This application is directed, in general, to computer processors and, more specifically, to a single instruction, multiple data (SIMD) processing unit having real and complex data multiplication capability.
The multiplication of two real numbers in a computer processor involves two inputs and therefore requires one multiplier. In contrast, the multiplication of two complex numbers in a computer processor involves four inputs (i.e., both the real and imaginary parts of the two complex numbers) and requires four multipliers. A complex multiplication requires twice as many multipliers per data input as a real multiplication, because each complex part is involved in two multiplications.
Given two real numbers, a and b, real multiplication is a×b. Given two complex numbers, a′=a.re+ja.im and b′=b.re+jb.im, complex multiplication is:
(a.re+ja.im)×(b.re+jb.im)=[(a.re×b.re)−(a.im×b.im)]+j[(a.re×b.im)+(a.im×b.re)].
A multiply-accumulate (mac) operation is a combination of the above-described multiplication, followed by the addition of the product to another value. It is quite common in signal processing. A vector input X contains multiple a or a′ values, and a vector input Y contains multiple b or b′ values. Each element of X is multiplied with each element of Y to yield a product Z, viz.:
Z[i]+=X[i]*Y[i].
Conventional SIMD processing units keep the vector inputs and the multipliers in corresponding sets of lanes. Different conventional approaches have been used to compute a complex mac in such processing units. A simpler but slower approach is to reuse a single multiplier four times with different inputs. A faster approach is to provide both X and Y into a four-multiplier array.
One aspect provides a multiply-accumulate unit (MAU) configurable to perform both real and complex multiplication operations. In one embodiment, the MAU includes: (1) a first multiplier having a first vector input and a first scalar input and configured to multiply a first vector by a first scalar to yield a first product, (2) a second multiplier having a second vector input and a second scalar input and configured to multiply a second vector by a second scalar to yield a second product and (3) an accumulator coupled to the first multiplier and the second multiplier and configured to receive the first and second products.
Another aspect provides a method of performing a mac operation. In one embodiment, the method includes: (1) using a first multiplier having a first vector input and a first scalar input to multiply a first vector by a first scalar to yield a first product, (2) using a second multiplier having a second vector input and a second scalar input to multiply a second vector by a second scalar to yield a second product and (3) receiving the first and second products in a first accumulator coupled to the first multiplier and the second multiplier.
Yet another aspect provides a processing unit. In one embodiment, the processing unit includes: (1) a pipeline control unit, (2) register files coupled to the pipeline control unit, (3) a load/store unit coupled to the register files and (4) a multiply-accumulate unit configurable to perform both real and complex multiplication operations. In one embodiment, the MAU includes: (4a) a first multiplier having a first vector input and a first scalar input and configured to multiply a first vector by a first scalar to yield a first product, (4b) a second multiplier having a second vector input and a second scalar input and configured to multiply a second vector by a second scalar to yield a second product and (4c) an accumulator coupled to the first multiplier and the second multiplier and configured to receive the first and second products.
Reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
As stated above, the faster conventional approach is to provide both X and Y (two vectors taking the form of data words) into a four-multiplier array. Unfortunately, conventional SIMD processing units provide only these two vectors to the multiplier array. While this is efficient for a complex mac operation, it is inefficient for a real mac operation because X and Y then contain only enough data input for two of the four multipliers.
One simple solution to increase the efficiency of real mac operations is to pack twice the data in X and Y and perform two mac operations concurrently. Unfortunately, this then makes complex mac operations inefficient.
Introduced herein is a MAU constructed according to a novel architecture that is optimized to use all multipliers in both real and complex mac operations. Instead of providing only vector inputs X and Y and multiplying each element of X by each element of Y as described above, the architecture incorporates a third, scalar input (e.g., “W”), which is narrower than X or Y. According to the architecture, a real mac operation involves multiplying a first vector by a first scalar and a second vector by a second scalar and then accumulating the products of the two multiplications. In a MAU embodiment employing a four-multiplier array, the scalar input W provides two scalar values WO and Wl as opposed to the larger widths of X and Y. All elements of the vector X are multiplied by WO, all elements of Y are multiplied by Wl, and the products of the two multiplications are then accumulated, viz.:
Z+=X[i]*WO+Y[i]*Wl
In an alternative embodiment, a difference is derived between the products of the two multiplications. In another alternative embodiment, the products of the two multiplications are subtracted from, rather than added to, an accumulated value.
Because it is narrow relative to the X and Y inputs, an additional data port to accommodate the additional scalar input W adds little cost to the architecture. Further, the novel architecture exhibits little mismatch between data bandwidth available and multiplier resources.
A MAU employing the novel architecture can provide significantly greater performance than (e.g., twice the performance of) a conventional MAU when used in finite impulse response (FIR) filters, matrix multiplication, and other real-valued signal processing functions.
In one embodiment, X and Y are consecutive registers. However, those skilled in the pertinent art will understand that X and Y may be nonconsecutive registers. W can come from a number of different sources.
The register files 140 are likewise coupled to bypass logic 180. The bypass logic is coupled to circuitry configured to perform mathematical and logical operations on constants or data stored in the register file 140 or the data cache and memory 160. The circuitry includes first and second MAUs 190-1, 190-2. The first MAU 190-1 includes an arithmetic and logic unit (ALU) 190-1a and first and second multipliers/accumulators (MACs) 190-1b, 190-1c. The second MAU 190-2 includes an ALU 190-2a and first and second MACs 190-2b, 190-2c. Another accumulator 190-3 is configured to accumulate results from the first and second MAUs 190-1, 190-2. An ALU 190-4 is likewise coupled to the bypass logic 180. One alternative embodiment employs a single MAU having SIMD capabilities. Another alternative embodiment employs an MAU having only two multipliers. Yet another alternative embodiment employs an MAU having a single multiplier that is reused as needed.
It should be noted that the SIMD processing unit of
In the embodiment of
The first multiplier 230-1 has both a Y (vector) input and a W1 (scalar) input. The Y (vector) and W1 (scalar) inputs are employed in a real mac operation. The first multiplier 230-1 has an additional X (vector) input that is not shown but employed in a complex mac operation. The second multiplier 230-2 has both an X (vector) input and a W0 (scalar) input. The second multiplier 230-2 has an additional Y (vector) input that is not shown but employed in a complex mac operation. The results (products) of the multiplications performed in the first and second multipliers 230-1, 230-2 are accumulated in an accumulator 230-3. The third multiplier 240-1 has both a Y (vector) input and a W1 (scalar) input. The third multiplier 240-1 has an additional X (vector) input that is not shown but employed in a complex mac operation. The fourth multiplier 240-2 has both an X (vector) input and a W0 (scalar) input. The fourth multiplier 240-2 has an additional Y (vector) input that is not shown but employed in a complex mac operation. The results (products) of the multiplications performed in the third and fourth multipliers 240-1, 240-2 are accumulated in an accumulator 240-3.
In the embodiment of
Those skilled in the art to which this application relates will appreciate that other and further additions, deletions, substitutions and modifications may be made to the described embodiments.