1. Field of the Invention
The present invention relates generally to very large-scale integrated (VLSI) circuits and more specifically to cost effective, high-performance, dynamically or run-time reconfigurable matrix multiplier circuits having a reduced design complexity and borrow parallel counter and small multiplier circuits.
2. Description of the Related Art
Many matrix multipliers or matrix multiplication processors and related arithmetic architectures have been proposed in publications in the last two decades. Those publications include L. Breveglieri and L. Dadda, “A VLSI Inner Product Macrocell”, IEEE Transactions on VLSI Systems, vol. 6, No. 2, June 1998; L. Dadda, “Fast Serial Input Serial Output Pipelined Inner Product Units”, Dep. Elec. Eng. Inform. Sci. Politecnico di Milano, Italy, Milano, Italy, Internal Rep. 87-031, 1987; H. T. Hung, “Why Systolic Architectures?”, Computer, Vol. 15, 1982, pp. 65-112 (hereinafter “H. T. Hung”); E. L. Leiss, “Parallel and Vector Computing”, McGraw-Hill, New York, 1995; R. Lin, Low-Power High-Performance Non-Binary CMOS Arithmetic Circuits, Proc. of 2000 IEEE Workshop on signal processing systems (SiPS), Lafayette, La., October, 2000. pp. 477-486. (hereinafter “RL6”); R. Lin and M. Margala, “Novel Design And Verification of a 16×16-b Self-Repairable Reconfigurable Inner Product Processor”, in Proc. of 12th Great Lakes Symposium on VLSI, NYC, April, 2002, the contents of which are incorporated herein by reference, (hereinafter “RL5”). However, due to the complexity and cost inefficiency, such as requiring a large amount of hardware for limited speed-up in processing, none has been implemented for widely successful use. One well-studied exemplary design of such architecture includes the systolic array matrix multipliers (see H. T. Hung).
What is needed is reconfigurable matrix multiplier architecture, such as that discussed in K. Bondalapati, and V. K. Prasanna, “Reconfigurable Meshes: Theory and Practice”, Proc. of Reconfigurable Architecture Workshop: International Parallel Processing Symposium, IT press Verlag, April 1997. Such architecture should be dynamically or run-time reconfigurable with a reconfiguration mechanism for computing the product of matrices ranging from 4 to 64 bits.
The present invention describes a general dynamically or run-time reconfigurable matrix multiplier architecture with a reconfiguration mechanism for computing a product of matrices X(n×r) and Y(r×n), which describe dimensions of matrices, and any item precision or bitwidth b of matrix elements, i.e., bitwidth ranging from 4 to 64 bits, based on a novel scheme of trading data bitwidth for processing array or matrix size.
Additionally, the present invention teaches an efficient application for size-4 matrix operations, which are critical to graphics processing and an area-power-efficient implementation scheme utilizing novel parallel counter circuits called borrow parallel counters, which encode signals and borrow bits, i.e., bits weighted 2, as building blocks for simplified system constructions.
The present invention provides a matrix multiplying processor for a general matrix multiplier using hardware comparable with one 64×64 bit high precision multiplier that can be directly reconfigured to produce a product of two matrices in several different input forms. For example, producing the following products:
The inventive matrix multiplier or matrix multiplying processor is a special processor used for typical computer graphics applications having the same amount of hardware as one 64×64-b multiplier, and can be directly reconfigured to produce the following products:
The inventive matrix multiplier consists of 64 (8×8) small multipliers, which make up a large percentage of the matrix multiplier's area. The efficiency of an 8×8 multiplier circuit greatly affects the overall performance of the inventive matrix multiplier. The borrow parallel counter circuitry of the invention enables the inventive matrix multiplier to have a realistic and efficient implementation of the large reconfigurable matrix multiplier in terms of all aspects of very large-scale integrated (VLSI) circuits' performance including speed, power, area, and test.
The traditional one hot out of 2k lines integer encoding, where k>=2, has an advantage of using fewer hot lines in representing small integers, and is well suited for low-power applications. However, extra circuits and lines required for the conversion between the unary and binary signals prevent the generalized use of such encoding for low-power circuit applications. The parallel counter circuitry of this invention extends the borrow parallel counter circuits and borrow parallel small multiplier library design of the U.S. patent application Ser. No. 10/728,485 filed Dec. 5, 2003, the contents of which are incorporated herein by reference (hereinafter “RL0”). The proposed parallel counter circuitry utilizes 1-hot out of four line signal encoding and utilizes borrow bits, i.e., input bits weighted 2, in a unique way, effectively merging conversions and arithmetic operations into a single embedded full adder circuit. This leads to advantages not only in power consumption, but also in lessening the VLSI area.
The invention presents an alternative library of seven small multipliers, developed based on four borrow parallel counters including borrow parallel counter 5_1 and 5_1_1 circuits (see RL0) and the newly developed borrow parallel counter circuits 6_0, 6_1. The seven new small multipliers run faster than the previously proposed multipliers due to the use of the new borrow parallel counter circuits 6_0 and 6_1.
The inventive circuits provide a significant reduction in switching activities and (hot) data paths due to the majority of the transistors being gated by or used to pass the 4-b 1-hot signals. The circuits with 0.25 mm and 0.18 mm processes for the counters and the matrix multiplying processor have shown superiority, particularly in compactness of layout and power dissipation, compared with their traditional binary counterparts.
The foregoing and other objects, aspects, and advantages of the present invention will be better understood from the following detailed description of preferred embodiments of the invention with reference to the accompanying drawings that include the following:
a is a diagram of a 4×4 partial product matrix generated by two 4-bit numbers X and Y on a network with a matrix of AND gates;
b is a diagram of a product of two numbers X and Y generated by adding all weighted partial product bits in the diagonal directions;
c and 1d are diagrams of an 8×8 partial product matrix, which is decomposed into four 4×4 matrices A-D, where data from two input numbers X and Y is duplicated and sent to the decomposed multipliers;
a is a diagram of a circuit structure of four multipliers A-D of
b is a diagram of a circuit having two 4-bit input item matrices X(2×2) and Y(2×2) as for performing a matrix multiplication product Z(2×2)=XY;
c is a diagram of two structures that can be combined into a single reconfigurable matrix multiplier structure by adding two 1-bit controlled switches;
a is a diagram of a reconfigurable matrix multiplier of size (s, 4)′ and block 4-2, where s is equal to 16 or (16, 4)′;
b is a diagram of a level recursive extension of the matrix multiplying process, where a reconfigurable matrix multiplier of size (s, 4)′, where s is equal to 32 and (s/m)2=64 for base 4×4 multipliers;
a is a Q(n×n) matrix for n=8=2k, k=4 or a Q(8×8) matrix;
b is the diagram of a square-recursive-M of the Q(8×8) matrix of
c is a tree diagram of the square-recursive M of
a-5c are diagrams of a matrix multiplying processor using reconfigurable matrix multipliers with a base multiplier m=8, where s is equal to 16, 32, and 64 respectively;
a is an illustration of a M(n×n) matrix, where n=2k and k=2;
b is a diagram of reconfiguration duplication switches and their states 1, 2, and 3 for inputs options 1, 2, and 3;
c is a diagram of a row-major ordering of items of a matrix (row-major-M) and a column-major ordering of items of a matrix (col-major-M) respectively of two linear arrays of ports;
d is a diagram of the conceptual duplication network of
e is a diagram of a square-recursive-M of an array of base multipliers;
f is a diagram of a duplication and distribution mechanism for a matrix multiplier of size (s, m)′=(32, 8)′;
a is a diagram of a matrix multiplication mechanism of X(4×4)*Y(4×4) of 8-bit items with input streams and switch states C=01, C1=0, and C2=0;
b is a diagram of a square-recursive matrix multiplication mechanism process of the matrix multiplier shown in
a is a diagram showing a matrix multiplication mechanism of X(2×2)*Y(2×2) of 16-bit items with an input stream and switch states C=10, C1=1, C2=0;
b is a diagram showing steps performed by the matrix multiplication mechanism of
a is a diagram of an implementation of a matrix multiplication mechanism for multiplying two 32-b numbers, with C=11, C1=1, C2=1, and C set to state 3, option 3;
b is a diagram of a conceptual view of the matrix multiplication mechanism of
a-13e are diagrams of a reconfigurable duplication network of matrix multiplier of size (64, 8);
a is a diagram of pipelined data flows and accumulations for the operation option 0 of the matrix multiplier (64, 8), with four pairs of 4×4 (8-bit) matrix multiplications in parallel when C=00 and W=UV;
b is a diagram of a conceptual view of the computation of W(4×4)=U(4×4)*V(4×4) in every 4 cycles in accordance with Equation E (in 4 pipeline steps);
a is a diagram of a parallel counter designated borrow parallel counter 5_1 circuit;
b is a diagram of a parallel counter designated borrow parallel counter 5_1_1 circuit,
a is a diagram of a parallel counter designated borrow parallel counter 6_0 circuit;
b is a diagram of a parallel counter designated borrow parallel counter 6_1 circuit;
a is an existing 3:2 shift switch parallel counter;
b is a 3:2 shift switch parallel counter of the present invention
c is the 3:2 shift switch parallel counter shown in
a to 20g are a library of small multipliers using 4-b 1-hot parallel counter circuits.
A novel approach of decomposing a partial product matrix, called square recursive decomposition, is described in R. Lin, “Reconfigurable Parallel Inner Product Processor Architectures”, IEEE Transactions on Very Large Scale Integration Systems (TVLSI), Vol. 9, No. 2. April, 2001. pp. 261-272 the contents of which are incorporated herein by reference, (hereinafter “RL3”); R. Lin, “Trading Bitwidth For Array Size: A Unified Reconfigurable Arithmetic Processor Design”, Proc. of IEEE 2001 International Symposium on Quality of Electronic Design, San Jose, Calif., March 2001, pp. 325-330; R. Lin, “A Reconfigurable Low-Power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters” Proc. of 10th Reconfigurable Architectures Workshop (RAW 2003), Nice France, April, 2003, the contents of which are incorporated herein by reference, (hereinafter “RL1”); and R. Lin, “Borrow Parallel Counters And Borrow Parallel Small Multipliers, New Technology Disclosure Documrentation”, Research Foundation of SUNY, August, 2002, the contents of which are incorporated herein by reference; (hereinafter “RL2”).
The decomposition of partial product matrix approach is briefly reviewed below with reference to
The four multipliers are used to compute a product of two 8-bit numbers.
Two types of computations and the reconfigurable matrix multiplying processor are illustrated in
Here X and Y are two 8-bit numbers, where X=X7 . . . . Xi . . . X0, Y=Y7, i and j are indices of matrix elements and u and v for 0≦u, v≦1 lower integers, imply the addition of or a square of four weighted 8-b numbers having respective weights of 1, 24, 24, and 28, by an adder called a 3-n adder that involves adding 3 numbers due to the weight difference.
As illustrated in
Here Xik and Ykj are 4-bit numbers. Since the numbers are weighted the same, 3-n addition is not required.
As is illustrated in
Construction of General Reconfigurable Matrix Multipliers
The reconfigurable matrix multiplying processor described above can be denoted by (s, m)′=(8, 4)′, where m represents the size of a base multiplier, s represents the matrix multiplier processor size that is equal to sqrt [(# of base multipliers)*m]. The prime sign is used to indicate that the matrix multiplier is not complete. A complete matrix multiplying processor will be discussed below. The approach of decomposing a larger partial product matrix into smaller product matrices and reconfiguring them for multiple types of computation may be applied recursively to construct a large size matrix multiplying processor. For example, four pieces of block 4-1, a 3-n 16-b adder, and corresponding large accumulators plus a few additional switches controlled by bit C2 will be sufficient to construct such a matrix multiplying processor with (s, m)′((16, 4)′.
a illustrates a reconfigurable matrix multiplier 26 of (s, 4)′, with s being equal to 16 or (16, 4)′ and block 4-224. Some output lines are shared by two contiguous blocks, and it is easy to verify that the structure can produce the product of:
It is also easy to verify that in general, if the matrix multiplier or matrix multiplying processor (s, m)′ is reconfigurable to compute the product of X(h×h) and Y(h×h) of b-bit items, then s=hb. As a special case, let h=1 then s=b, that means that the matrix multiplying processor (s, m)′ multiplies two s-bit numbers. So the size s of matrix multiplier (s, m)′ can also be seen as having the same size as an s-bit multiplier.
One more level recursive extensions of the matrix multiplying process is shown in
A similar matrix multiplying processor using reconfigurable matrix multipliers 30-34 with base multiplier m=8 are shown in
Several data structures and components specific to the above described architecture can be defined. These data structures include three one-dimension arrays with respect to a given (n×n) matrix, an input reconfigurable duplication network, and a fixed data distribution network.
Definition 1
Given matrix Q(n×n)*(n=2k), a square recursive view of Q is a decomposition of Q as follows:
Given matrix Q(n×n)*(n=2k), one dimensional arrays, row-major ordering of items of matrix Q (row-major-Q), column major ordering of items of matrix Q (col-major-Q), and square recursive ordering of items of matrix Q (square-recursive-Q), each re-ordering of all items of matrix Q are defined as follows:
Based on the Definitions 1 and 2, it can be verified that the square-recursive-Q is the array of the leaf-items of the tree constructed by following recursive view of Q, i.e., its items are in square recursive order.
As an example consider a Q(n×n) matrix for n=4=2k, k=2 or a Q(4×4) matrix illustrated in
Here, row-major-Q with respect to matrix Q, Q(3, 0)=row-major-Q(3*4+0)=row-major-Q(12) is square recursive view of Matrix Q(n×n), for n=4.
The top square, i.e., the matrix is substituted by four square ordered, i.e., NE-NW-SE-SW sub-matrices, which then recursively apply the process until each sub-matrix is an item.
The square-recursive-Q, with respect to matrix Q, is the leaf-array of a 2-level full-4-branch tree constructed following the square recursive view of Q.
Here, indices: 3=011(2), 0=000(2), and Q(3, 0)=square-recursive-Q(001010(2))=square-recursive-Q(10). As with respect to matrix M(8×8) illustrated in
For a pipelined matrix multiplication to generate accumulated outputs only a row and a column from two input matrices respectively in each cycle are needed to be: provided. The input data stream is then needed to be duplicated and distributed to the matrix multiplier, using the following two additional simple sub-networks:
Matrix 50 is illustrated in
The topology of a reconfigurable duplication network is determined by the matrix M(n×n) and all preset input options. The topology of a distribution network is determined only by the value n of the matrix M(n×n).
The duplication and distribution mechanism for a matrix multiplier of (s, m)′=(32, 8)′ is illustrated in
Option 1 is identified by reference number 72, and represents a first step for the input duplication and distribution network, where X(4×4) and Y(4×4) have the total of 8-b items.
Option 2 is identified by reference numeral 74, and represents a first step for the input duplication and distribution network, where X(2×2) and Y(2×2) have the total of 16-b items.
Option 3 is identified by reference numeral 76, and represents a first step for the input duplication and distribution network, where X and Y have the total of 32-b items.
While
The above discussion leads to a complete matrix multiplication mechanism. Considering Z(n×n)=X(n×n)*Y(n×n), the computation may be represented in an inner product form as Equation E:
According to Equation E, the multiplier takes n steps to compute the value of Z(n), term by term and one term per step. At the k-th step the base multiplier at position (i, j) multiplies X(ik)*Y(kj) to yield the k-th term of the inner product, i.e., Z(ij)*(k) which is accumulated into the result of the previous steps. In the inventive matrix multiplying processor this computation occurs in parallel.
Equation E suggests that n2 base multipliers are required. Since base multipliers are very small, for n and m, that are not too large, for example n≦16 and m≦8, such a matrix multiplying processor is of a common size. It can also be seen that Equations E1 and E2 presented above are equivalent forms of Equation E with terms computed in different ways.
Returning to
The pipeline process has a throughput of 1/h cycles and a latency of h+log(s/m) cycles.
a and 8b illustrate the process of X(4×4)*Y(4×4) of 8-bit items with input streams and switch states C=01, C1=0, and C2=0. Specifically,
Because the numbers are similarly weighted, there is no 3-n addition.
b illustrates the conceptual view of square-recursive illustration of the matrix multiplication mechanism process also shown in
The products of base multipliers are processed through two levels of 3-n additions associated with the two levels of squares to which they belong (this association is represented in
There are two more input options for the inventive matrix multiplying processor. For an input stream of 2×2 matrices of 16-bit items, C is set to state 2, option 2 data is processed, and the product of X(2×2)*Y(2×2) is produced.
Here i, j, and k are used to index matrix elements; u, v, and e, f are used to index the binary bits of matrix elements for an outer level-2 sub-matrix and an inner level-1 sub-matrix, respectively. For example, Xike 8u≦e≦8u+7 represents the e-th bit of matrix item Xik for some value u. In particular, X over 0≦k≦1 implies a sum in two pipeline steps, X over 0≦u, v≦1 implies the 3-n addition of (a square) 4 weighted data, X over 8u≦e≦8u+7 and 8v≦f≦8v+7 for some u and v, the formation of a weighted base product by a base multiplier.
b illustrates the conceptual view of the matrix multiplication mechanism. In each of the two steps, inputs are duplicated and distributed into base multipliers (entries of matrix M). In step 1 base multiplications with 3-n addition at level-1 squares are performed. Step 2 is the same as Step 1 for new data and after accumulation. The products of the base multipliers are then processed through two levels of possible 3-n additions (only inner level addition is performed here), and finally reach the accumulators for accumulated results.
a and 10b illustrate the process of multiplying two 32-b numbers, with C=11, C1=1, C2=1. For input of two 32-bit numbers, C is set to state 3, option 3 inputs are processed, and the product of two 32-b numbers is produced. Specifically,
This Equation is an extension of Equation E1. Here i and j are used as indices of bit positions of input numbers; u, v and e, f are used for outer-level and inner level decompositions, respectively. In particular, X over 0≦u, v≦1 implies the addition of an outer square of 4 weighted data sources by a 3-n adder, X over 0≦e, f≦1 implies the addition of an inner square of 4 weighted data sources by a 3-n adder, X over 16u+8e≦i≦16u+8e+7 and 16v+8f≦j≦16v+8f+7 for some u and v implies the formation of a weighted base 16-b product produced by the base multiplier.
b illustrates the conceptual view of a matrix multiplication mechanism. The inputs are duplicated and distributed into base multipliers (entries of matrix M). In the only step the mechanism performs base multiplications, addition at both level-1 and level-2 squares, and accumulation. The products of base multipliers are then processed through two levels of 3-n additions (3-n additions at both levels are required), and finally reach the accumulators for accumulated results.
Partitioning General Input Matrices
For example, using the matrix multiplier (32, 8) of
The operations of (4×4) matrices with various item precision are particularly important for graphics applications. The matrix items may include 8-b, 16-b and occasionally 32-b or even 64-b data for special needs. Efficient use applications of matrix multipliers of (s, m)=(32, 8) and (s, m)=(64, 8) are illustrated below. First, with the (s, m)=(32, 8) matrix multiplying processor shown in
a-13e show the reconfigurable duplication network of the matrix multiplier (64, 8).
The operations with C=1, 2 and 3 are the same as those for the (32, 8) matrix multiplier, except the input/output size can now be four times that for the (32, 8) matrix multiplying processor. It is noted that the (64, 8) matrix multiplying processor has about four identical components working in parallel, each equivalent to a single (32, 8) matrix multiplying processor. Also putting four blocks of (32, 8) in parallel is not able to provide multiplication of two 64-b numbers. The operation with C=0 requires an additional reconfigurable duplication unit to support an efficient operation and unified control.
The conceptual view of an input duplication net for options 1, 2, and 3 is shown in
a and 14b illustrate the complete views of option 0 of the matrix multiplier (64, 8).
The Implementation Circuits
Since the large amount of 8×8 base multipliers requires a significant percentage of the matrix multiplier area, a novel design of highly regular, compact, low power small multiplier circuits for the implementation of the 8×8-b base multiplier of the present invention is presented below. The 8×8 multiplier, called a borrow parallel multiplier, which is an array of borrow parallel counters is described in R. Lin and R. Alonzo, “An Extra-Regular, Compact, Low-Power Multiplier Design Using Triple-Expansion Schemes And Borrow Parallel Counter Circuits”, Proc. Of Workshop On Complexity Reduced Design (Isca), Held In Conjunction With The 30th Intl. Symposium On Computer Architectures, San Diego, Calif., June 2003, the contents of which are incorporated herein by reference, (hereinafter “RL4”); and in RL0, RL1, and RL2. The 8×8 borrow parallel multiplier can be laid out in an area of 33 mm×167 mm (with 0.18 mm technology, 3 metal layers; see
The borrow parallel counters possess the following advantages:
utilize borrow bits, i.e., input bits weighted 2, which make it possible for a small multiplier, such as 8×8-b multiplier, to be organized in a single array of almost identical parallel counters for a compact layout.
Table 1 shows the “4-bit 1-hot” (4-b 1-hot) encoded signals and their value interpretations. The unique bit position determines the value of a 4-b 1-hot signal.
The Borrow Parallel 5_1 and 5_1_1 Counters and Their Extension, Borrow Parallel 6_0 and 6_1 Counters
The present invention also sets forth a description of the borrow parallel circuits including new proof of the borrow parallel counter 5_1 and 5_1_1 circuits and their extension borrow parallel counter circuits 6_0 and 6_1, as well as an alternative library of small multipliers. In addition to the implementation of the proposed matrix multipliers, the borrow parallel circuits can be used for various applications including design of whole spectrum of large multipliers, e.g., up to 81-bit, (see RL0). The inventive borrow parallel counters utilizing the 4-b 1-hot signals and their additions are presented herein below. These counters are termed borrow (parallel) counters because one or more of the bits being counted by such counters have a weight of 2 instead of 1, such bits are called “borrowed” as they are borrowed from the left neighboring columns.
a and 16b illustrate two extra-compact, low-power, high-speed CMOS circuits, serving as building blocks for parallel arithmetic designs.
Each of the borrow parallel counter circuits 5_1 and 5_1_1 has 5 inputs, A1 to A5, two outputs U and L, and three pairs of in-stage input/output bits, X, Y, Z, where the weighted sum of all outputs equals the weighted sum of all inputs. Input bit A5 (or A4), weighted 2, is usually borrowed from the higher weighted neighboring columns and its input arrow in the circuit is offset.
In addition to utilizing 4-b 1-hot signal encoding and borrow bits, the borrow parallel counter circuits provide an embedded full adder, adding non-binary (4-b, 1-hot) and binary signals without decoding. A pass-transistor circuit illustrated in
The borrow parallel counter 5_1 circuit implements the five arithmetic-logic equations shown below:
A1+A2+A3+A4+2A5=4q+2c+s (or=qcs in binary form) (M1)
Xo=s; (B1)
Yo=Xi XOR c; (B2)
Zo=Xi′ (B3)
SUM=2U+L=Yi+2Yi′ Zi′+q; (M2)
The explanation of how the circuit illustrated in
A1+A2+A3+A4=4q0+R.
Since A1+A2+A3+A4+2A5=4q0+2c0+s0+2A5,
let 4q0+2(c0+A5)+s0=4q+2c+s,
thus s=s0 (D1)
4q0+2(c0+A5)=4q+2c=>c=c0XOR A5 (D2)
q=q0 or c0A5 (D3)
The 4-b 1-hot encoding scheme shown in Table 1 results in:
1. r0 or r2=1<=>s0=0 or r1 or r3=1<=>s0=1; and
2. r0 or r1=1<=>c0=0 or r2 or r3=1<=>c0=1 (D4)
From Equation D4 it is verified that
Xo=s0 and Yo=(Xi XOR A5)XOR c0=Xi XOR(c0XOR A5)
Equation D1 provides:
This can also be verified from the circuit shown in
The above provided proof is also achieved by an exhaustive verification program for all possible inputs and outputs. For example, inputs shown in
A1+A2+A3+A4+2A5=5=>q=1, c=0, s=1 and
Xo=1, Yo=1, Zo=0, SUM=3, U=1, L=1.
The circuit of
Th above verifies that the circuit of
To explain how the circuit of
With reference to
Let s, c, q, Xi, Xo, Yi, Yo, Zi, Zo, L, U and SUM of the counter in column k be sk, ck, qk, Xik, Xok, Yik, Yok, Zik, Zok, Uk, Lk and SUM k (for k=1, 2, 3) respectively, the outputs 6f the adder of column 1, i.e., U1 and L1 will be compute to show
2U1+L1=s3+c2+q1.
From Equation B1 it follows that Xo3=s3;
It can be verified that if conditions Yi=s3 XOR c2 and Zi=s3′ are true, then Yi+2Yi′Zi′ is equivalent to s3+c2.
The verification is provided below by the truth table shown in Table 2.
Equation D5 provides the following conditions: Yi1=Yo2=s3 XOR c2, Equations B3 and B1: Zi1=Zo2=Xi2′=Xo3′=s3′, therefore there exists the equivalence of Yi1+2Yi1′Zi1′ and s3+c2.
Finally Equation D6 provides:
SUM1=2U1+L1=s3+c2+q1 (D7)
Using the above provided proof, an array of borrow parallel counter 5_1 or/and 5_1_1 circuits can be viewed as parallel counters for reducing 5-bit-height input matrix into a set of s, c, and q bits, which set is further reduced in accordance with Equation D7 into two numbers Ui and Li.
Each borrow parallel counter 5_1 or 5_1_1 circuit can also be viewed as an effective counter for reducing 5 input bits having one or more borrow bits into two output bits. The addition of s3 and c2, which is embedded in the 4-b 1-hot signal form, by sub-circuits as shown in the shaded area of columns 3 and 2 in
The borrow parallel counter 5_1 and 5_1_1 circuit can be represented by a single arithmetic equation shown below, where the sum of all weighted inputs equals the sum of all weighted outputs:
For borrow parallel counter 5_1 circuit:
A1+A2+A3+A4+2A5+2Xi+4(Yi+2Yi′Zi′)=Xo+2Yo+4Yo′Zo′+4L+8U
For borrow parallel counter 5_1_1 circuit:
A1+A2+A3+2A4+2A5+2Xi+4(Yi+2Yi′Zi′)=Xo+2Yo+4Yo′Zo′+4L+8U
a and 18b illustrate additional 4-b 1-hot borrow parallel counter variants called borrow parallel counter 6_0 and 6_1 circuits 180 and 182, respectively. Each of the circuits 180 and 182 includes 6 inputs A1 to A6. All 6 input bits of the borrow parallel counter 6_0 circuit 180 are weighted 1. For the borrow parallel counter 6_1 circuits 182, the input bit A3 is weighted 2. The borrow parallel counter 6_0 or 6_1 circuit 180 and 182 are constructed using the borrow parallel counter 5_1 or 5_1_1 circuits 160 and 168 (
a shows an existing 3:2 shift switch parallel counter (see RL6).
The Alternative Library of Small Borrow Parallel Multipliers
One of the benefits of using the above described four 4-b 1-hot parallel counter circuits is the formation of a library of small multipliers ranging from 3 to 9 bits in a single array of counters structure.
Conventional binary counter based parallel multiplier circuits, including 8×8-b multiplier, are highly irregular in shape because a partial product bit matrix has a triangular shape. It is not efficient to re-arrange the bit matrix for bit reduction using small-size binary parallel counters. The layout cost in dealing with the irregularity can be significant. One of the major benefits of the library of small multipliers, is its ability to turn irregular small multiplication units into regular circuit blocks, thereby greatly reducing local complexity of large circuits.
As illustrated in
The inventive library of small multipliers improves the library based on two borrow parallel counter 5_1 and 5_1_1 circuits (see RL0). Each multiplier in the library of this invention is constructed the same way by a single array of borrow parallel counters plus a few 3:2 and/or 2:2 shift switch parallel counter. The library of the present invention includes four borrow parallel counter 5_1, 5_1_1, 6_0 and 6_1 circuits. They all have about the same small height as that of a single borrow parallel counter 5_1 circuit, plus the height of an input net. Similarly, these borrow parallel counter have about the same delay and display a very compact layout, high speed performance, and low-power utilization features.
The 8×8 Small Borrow Parallel Multiplier
3. the bottom part 218, shown below the dotted line, representing a fast and simple one stage carry look-ahead adder with a carry propagate node denoted by CPN.
Table 3 shows the summary and comparison of the parallel counters and 8×8 multipliers. The layouts of the borrow parallel counter 5_1, 5_1_1 circuits and the 8×8 multiplier using 180 μm CMOS technology and 3 metal layers with areas of 12.87×16.0 μm2 and 26.5×85.5 μm2, respectively, have been produced (see RL4). The 8×8 multiplier illustrated in
The preliminary results of current studies focusing on optimal layouts of duplication-distribution networks and the block-1, block-2, and block-3 modules, have shown that all these components may be laid out in matching the total width defined by the base multiplier array 220 for 530 μm and the base multiplier array 222 for 2120 μm as shown in
Since there is no reported data available for a comparable architecture, a comparison can be made with a 54×54 floating point Booth multiplier, recently reported in N. Itoh, Y. Naemura, H. Makino, Y. Nakase, T. Yushihara, Y. Horiba, “A 600 MHz, 54×54-bit Multiplier With Rectangular-Styled Wallace Tree”, IEEE JSSCs, Vol. 35, No. 2, February 2001, (hereinafter “Itoh”) and R. Montoye, W. Belluomini, H. Ngo, C. McDowell, J. SaWada, T. Nguyen, B. Veraa, J. Wagoner, M. Lee, “A Double Precision Floating Point Multiplier”. Proc. of 2003 IEEE ISSCC, February, 2003 (hereinafter “Montoye”). The Booth multiplier has the minimum area. The comparison is achieved by first scaling up Booth floating point multipliers to size 64, then comparing it with the inventive (64, 8) matrix multiplier. The multiplier of Itoh, fabricated in the same 0.18 mm technology, requires an area of 0.98 mm2, while the multiplier of Montoye fabricated in the 0.13 mm technology, requires an area 0.155 mm2, which will be 0.49 mm when scaled for 0.18 mm technology (see Montoye).
Based on these data, the inventive reconfigurable matrix multiplier architecture with borrow parallel counter circuits has shown itself to be competitive, particularly when the multiple provided functionalities are considered. A summary and simplified comparison of these three matrix multiplying processors are given in Table 4.
The inventive matrix multiplying processor can be run-time reconfigured to trade bitwidth for a matrix size for general multiplications of matrices. Specifically, the inventive matrix multiplying processor can be efficiently reconfigured to compute the product of matrices X(4×4) and Y(4×4) for graphics and image processing applications. The hardware comparable with one 64×64 bit high precision multiplier with minimal additional reconfiguration components can provide four computation options, which significantly reduces the total amount of hardware needed by existing computation systems.
The proposed inventive architecture minimizes the common irregularity that occurs in existing designs, and simplifies the overall logic scheme and circuit structures. The superiority of the architecture is achieved, particularly, through the use of CMOS borrow parallel counter circuits and small multipliers, which utilize 4-b, 1-hot integer encoding (valued 0 to 3), borrow bits, and a single counter array structure for multiplying small integers, achieving an extra compact layout and lower switching activity for low-power design.
The small 8×8 multiplier array based matrix multiplying processors also possess several unique features in self-testability and high design quality (see RL5). The architecture may also be extended as a unified arithmetic processor to provide inner product computation as well (see RL1).
While the invention has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
This invention was funded, at least in part, under grants from the National Science Foundation, No. CCR-0073469 and New York State Office of Advanced Science, Technology & Academic Research (NYSTAR, MDC) No. 1023263. The Government may therefore have certain rights in the invention.