1. Field of the Invention
The present invention relates to an asynchronous multiplier, and more particularly to an asynchronous multiplier with an accelerating circuit.
2. Description of the Related Art
The multiplier is an essential device in apparatuses such as micro-processors or in digital signal processing, and discrete sine transform. Multipliers usually take the longest operational time, which usually is the decisive factor of an effective chip. For the time being, several synchronous designs have been proposed, and so are the asynchronous designs. Due to its low power-consumption, low average operational time and flexibility to adapt to various process and environment, the asynchronous circuit has been used in very large scale integrated (VLSI) circuits for better performance.
Generally, the current multipliers comprise right-to-left array multipliers, left to right multipliers, divided array multipliers and multi-select array multipliers.
In the conventional technology, a right-to-left array multiplier has the most simple connection and rules, and thus becomes one of the most popular structures.
For the n-bit multiplicand and the n-bit multiplier, the area of the right-to-left carry-ripple array multiplier 100 is:
AR-L-CAR=APPG+ACRA-array (1)
APPG=n2Ahd AND2 (2)
ACRA-array=(n−1) 2AFA+(n−1) AHA (3)
Wherein, APPG represents the area of PPG102. AAND2 represents the area of the two input AND gates. ACRA-array represents the area of the carry-ripple addition array 104. AFA represents the area of the full adder. AHA represents the area of the half adder.
Referring to
With the n-bit multiplicand and the n-bit multiplier, the area of the right-to-left carry-save array multiplier 120 is:
AR-L-CSA=APPG+ACSA-array+Afinal-stag-add (4)
APPG=n2AAND2 (5)
ACRA-array=(n−1) (n−2) AFA+(n−1) AHA (6)
Afinal-stag-add=An-bit-adder (7)
Wherein, Afinal-stag-add represents the area of the final-stage adder 108, and the area depends on the implementation of the addition structure. In addition, in these equations, the right-to-left PPG and the left-to-right PPG have the same area. Considering the area of the addition array, CSA is smaller than CRA, but CSA needs additional final-stage adder.
For the design of a synchronous multiplier, the time for executing the addition array 104 with the save-carry adder 120 is less than that for executing the addition array 104 with the carry-ripple adder 100. The delay can be reduced from (2n−2) tF A to (n−1)tF A, and tF A represents a delay for each bit full adder.
The area of the L-R carry-ripple array multiplier 200 is:
AL-R-CRA=APPG+ACRA-array+Afinal-stag-add (8)
APPG=n2 AAND2 (9)
ACRA-array=(n−1) (n−2) AFA+(n−1) AHA (10)
Afinal-stag-add=An-bit-adder (11)
As shown in
AL-R-CSA=APPG+ACSA-array+Afinal-stag-addy+AEXTRA (12)
APPG=n2 AAND2 (13)
ACSA-array=(n−3) (n−2) AF A+(n−2) AHA (14)
Afinal-stag-add=A2n-bit-adder (15)
Aextra=(n−2) AF A
Based on the high-level estimation, the cost of the L-R scheme is similar to that of the R-L scheme. Table 1 shows the cost and delay time of the 32 x 32 R-L multiplier.
The base-line scheme uses the R-L carry-ripple array multiplier 100. The scheme does not need the final-stage adder 106. The second row represents the right-to-left CSA array with the CRA in the final-stage adder 106. It, however, causes the longest delay.
From Table 1, the left-to-right array multipliers 200 and 220 have lower cost and better performance than the right-to-left array multipliers 100 and 120. The array look-ahead adder with the final-stage adder might have slightly more cost, but can reduce more computation time of the adder than the carry-ripple adder.
The left-to-right CSA array with the CLA, such as the final-stage adder, can reduce 8.20% logic cost, and 14.75% computation time. Compared with other scheme, it provides a better cost/performance ratio.
Generally, an array multiplier has a longer transmission route and consumes more power. A solution is to divide the array into two parts. Then, the results are combined at the final stage. Accordingly, the computation time of this scheme can be reduced.
The data-dependent carry-save addition array 406 comprises the full adder 412 and the multiplexer 414. When the bit of the multiplier 404Bn is 1, the partial product is equal to the bit of the multiplicand 402An. The full adder 412 adds the inputs (CI, SI and Al) and outputs the carry/sum vector through the multiplexer 414 to the next stage. If the bit of the multiplier 404Bn is 0, the partial products of this row are zero. The full adders 412 of this row do not need to do anything, and the multiplexer simply outputs the carry/sum vector to the next stage.
In the data-dependent carry decomposition area of the multiplexer 404, the sum and the carry are added to obtain the final product. This area must decompose all carries transmitted from the LSB and the MSB. The carry-ripple adder has the smallest carry decomposition area. The carry look-ahead adder can also be selected to reduce time.
In the conventional technology, a delay-insensitive unit (DI) is used in the asynchronous array multiplier. The DI unit usually includes the PPGs, the DI adder, the DI array look-ahead adder and the completion detector.
Except for a few schemes, such as the Kearney and Bergmann data-dependent multiplier, the first unit of most array multiplier is a PPG. The PPG can be defined as below:
Accordingly, a multiplier with m-bit multiplicand and n-bit multiplier requires m*n PPGs to generate m*n-bit products.
In the conventional multiplier, the least significant partial product is generated at the top of the array. On the contrary, in the left-to-right multiplier, the least significant partial product is generated at the bottom of the array.
The PPG is implemented by the DI AND gate. The logic of the DI AND gate can be defined as:
Q1←A1B1 (18)
Q0←A0B0 (19)
Wherein, (A1,A0) and (B1,B0) are inputs, and (Q1,Q0) is an output. In addition, all signals are executed by the dual-rail signaling.
In the conventional technology, the DI full adder 700 can be a basic unit of the addition array. To execute the DI full adder 702, the dual-rail signal is used for inputting (A0,A1), (B0,B1) and (C0,C1), and outputting the sum (S0,S1) and the carry (Cout0,Cout1). Wherein, the sum and the carry can be obtained from the following logic expression:
Cout0=A0B0+A0C0+B0C0
Cout1=A1B1+A1C1+B1C1
The DI full adder 800 can comprise the right-to-left carry-ripple array or the carry-save array of the asynchronous multiplier shown in
Carry-kill ki=Ai0Bi0 (24)
Carry-generate gi=Ai1Bi1 (25)
Carry-propagate pi=Ai0Bi1+Ai1Bi0 (26)
Sum0 Si0=Ai0Bi0Ci0+Ai1Bi1Ci0+Ai0Bi1Ci1+Ai1Bi0Ci1 (27)
Sum1 Si1=Ai1Bi1Ci1+Ai1Bi0Ci0+Ai0Bi1Ci0+Ai0Bi0Ci1 (28)
Wherein, i=0, 1 . . . , n−2, n−1. As shown in
The D module 904 can be shown as:
Block-carry-propagate Pi,k=Pi,j Pj-1,k (29)
Block-carry-kill Ki,k=Ki,j+Pi,j Kj-1,k (30)
Block-carry-generate Gi,k=Gi,j+Pi,j Gj-1,k (31)
Block-carry-out Cj1=Kj-1,k+Pj-1,kCk0 (32)
Block-carry-out Cj1=Gj-1,k+Pj-1,kCk1 (33)
Wherein, i=0, 1, . . . , n−2, n−1. The input/output signals of the D module 904 can be shown Ii,j=(Ki,j, Gi,j, Pi,j), and Ci=(Ci0, Ci1).
In the initial state of
In the conventional technology, the synchronous circuit uses a clock to synchronize operations of all sub-systems, but not the asynchronous circuit. The asynchronous circuit usually uses the start signal (demand) and the completion signal (response) to synchronize other circuits and itself.
N input completion detections can be executed by the two-input C element 1002 so as to build the tree structure shown in
done=ack0*ack1*ack2* . . . *ackn−2*ackn−1 (34)
reset=ack0+ack1+ack2+ . . . +ackn−2+ackn−1 (35)
The done function is performed by the n-input AND gate 1004, and the reset function is performed by the n-input OR gate 1006. The two-input C element 1002 is used for combining them. If all acki are opened, and done=reset=1, then donereset are opened. If all acki are closed, and done=reset=1, then donereset is closed. In addition, if done is not equal to rest, then donerest remains unchanged.
Therefore, if the particular bit of the multiplier is zero, the duplicate partial product of the mapped bit of the multiplier is zero. Its sum and carry vector will be zero until the bit of the multiplier meets 1. If most of the bits of the multiplier are zero, their effective bit length will be shorter than the designed length. Accordingly, the multiplier would have much delay time for calculating these zeros.
Accordingly, a method to resolve the issues described above is desired.
Accordingly, the present invention is directed to an asynchronous multiplier, which directly outputs a ineffective bit, i.e., zero, to the final-stage adder to save operational time and enhance the operational speed.
The present invention provides an asynchronous multiplier. The asynchronous multiplier comprises a partial product generator, an addition array, a leading zero-bit detector, a final-stage adder, and a completion detector. The partial product generator generates a plurality of partial products according to a multiplier and a multiplicand. The addition array is coupled to the partial product generator, and performs addition operation to the partial products. The leading zero-bit detector is coupled to the addition array to detect a effective bit of the multiplier and a effective bit of the multiplicand, and to output a set of detection signals. The final-stage adder is coupled to the addition array to add the partial products and to output a sum. The completion detector is coupled to the final-stage adder to check and output the sum.
According to an embodiment of the present invention, the addition array comprises a plurality of zero adders coupled to the partial product generator and the leading zero-bit detector, and determines either to output zero or perform the addition operation according to the set of the detection signals.
According to an embodiment of the present invention, the zero adder comprises a plurality of DI adders and a plurality of DI multiplexers. The DI adders perform an addition operation to each bit of the partial products. The DI multiplexers are coupled to the DI adders, determining either to output zero or perform the addition operation according to the set of the detection signals.
According to an embodiment of the present invention, each of the multiplier and the multiplicand comprises effective bits and a ineffective bit. The multiplier is coupled to the leading zero-bit detector.
According to an embodiment of the present invention, the leading-zero- bit detector detects each bit between a most significant bit and a least significant bit of the multiplier.
According to an embodiment of the present invention, a logic value of the most significant bit is 0.
According to an embodiment of the present invention, the addition array is a left-to right addition array.
The present invention applies the accelerating circuit composed of the leading-zero-bit detector and the zero adders. The effective bits and the ineffective bit of the partial products can be differentiated. The ineffective bit, i.e., 0, is directly output to the final-stage adder to save the operational time and enhance the operational speed.
The above and other features of the present invention will be better understood from the following detailed description of the embodiments of the invention that is provided in communication with the accompanying drawings.
In this embodiment, the ineffective bit and the effective bit are defined to check each bit between the most significant bit (MSB) and the least significant bit (LSB) of the operand. If the bit is zero, the bit is defined as a ineffective bit, and the next bit is checked until a “1” bit is found. The bits between the “1” bit to the least significant bit are called effective bits. Their length is called effective length. In addition, the length of the ineffective bits is called ineffective length.
For example, for a 32×32 multiplication operation, if the multiplier and the multiplicand value in hexadecimal are 02E50FF0 and 00000D34, the ineffective bits are 6 bits and 20 bits, respectively. The effective bits are 26 bits and 12 bits, respectively.
In this embodiment, the accelerating circuit comprises a leading-zero-bit detector 1310 and a zero adder 1304.
The leading-zero-bit detector 1310 detects the effective bits and outputs the detection signals to the zero adder. The zero adder 1304, according to the detection signals, determines either to output zero or perform the addition operation. Wherein, the zero adder 1304 is used to constitute the addition array to replace the conventional addition array.
In this embodiment, the leading-zero-bit detector 1310 can be, for example, a delay insensitive (DI) leading-zero-bit detector, which checks each bit between the most significant bit and the least significant bit of the multiplier. If a bit is zero, the zero-flag is 1. Then, a next bit is checked until a “1” bit is found. If the bit is 1, the corresponding zero-flag is 0, other bits of the multiplier need not be checked, and the remaining zero-flags are zero.
For example, when X1=00010010, and X0=11101101, then Z1=11100000 and Z0=00011111. When X1=00000110 and X0=11111001, then Z1=11111000 and Z0=00000111.
In order to execute the DI leading-zero-bit detector 1310, dual-rail signaling is used for inputting bits, zero-flags and zero-propagation. Accordingly, the 1-bit circuit can be defined as:
Zero-flag1 Zi1=Pi+11,Xi0 (36)
Zero-flag0 Zi0=Pi+11Xi1+Pi+11Xi1+Pi+10Xi0 (37)
Zero- propagate1 Pi1=Pi+11X10 (38)
Zero-propagate0 Pi0=Pi+11Xi1+Pi+10Xi1+Pi+10Xi0 (39)
Wherein, i=0, 1, . . . , n−1.
In this embodiment, the n bits are divided into several blocks to solve the issue described above. Generally, a small block has a small area and a long delay. A small input, however, can make the computation and the result transmission speed to the next stage faster. On the contrary, a great block has a big area and a short delay. Additionally, a great block is accompanied with great fan-in and fan-out, but generates longer delays. Accordingly, the block size determines the area size and the delay time.
The delay is related to the effective length of simulation data. A longer effective length creates more delays. In other words, a shorter effective length results in a shorter delay.
Referring to
The leading-zero-bit detector 1310 generates a zero-flag Z. When Z is zero, the multiplexer 1604 selects and outputs an addition result. When Z is 1, the multiplexer 1604 does not need to wait for the operational result. The multiplexer 1604 immediately selects and outputs zero. The computation time is thus reduced.
In the DI zero adder 1304, the dual-rail signaling method is used to execute the DI full adder 1602 and the DI multiplexer 1604. The logic expression of the DI full adder 1604 can be shown as:
Carry0 Ci+10=Ai0Bi0+Ai0Ci0+Bi0Ci0 (40)
Carry1 Ci+11=Ai1Bi1+Ai1Ci1+Bi1Ci1 (41)
Sum0 Si0=Ai0Bi0Ci0+Ai0Bi1Ci1+Ai1Bi0Ci1+Ai1Bi1Ci0 (42)
Sum1 Si1=Ai1Bi1Ci1+Ai1Bi0Ci0+Ai0Bi1Ci0+Ai0Bi0Ci1 (43)
Wherein, Ai and Bi are main inputs of the adder 1602, and Ci is the carry input of the adder 1602. In addition, Ci+1 and Si are the output of the carry and the sum of the adder 1602. The carry bits are encoded with dual-rail signaling. If the formula 44 is equal to 1, it means no carry emerges from the last stage adder 1602. If the formula 45 is equal to 1, it means a carry emerges from the last stage adder 1602.
The DI zero adder 1304 comprises the DI adder 1602 and the DI multiplexer 1604. Its logic expression is shown as:
Carry0 Ci+10=Zi0(Ai0Bi0+Ai0Ci0+Bi0Ci0)+Zi1(Ei1) (44)
Carry1 Ci+11=Zi0(Ai1Bi130 Ai1Ci1+Bi1Ci1)+Zi1(Ei1) (45)
Sum0 Si0=Zi0(Ai0Bi0Ci0+Ai0Bi1Ci1+Ai1Bi0Ci1+Ai0Bi0Ci0)+Zi1(Ei0) (46)
Sum1 Si1=Zi0(Ai1Bi1Ci1+Ai1Bi0Ci0+Ai0Bi1Ci0+Ai0Bi0Ci1)+Zi1(Ei1) (47)
Wherein, Zi represents the zero-flag from the corresponding leading-zero-bit detector 1310. If Ei is always zero, Ei1=0, Ei0=1. The equation described above can be simplified as:
Carry0 Ci+10=Zi0(Ai0Bi0+Ai0Ci0+Bi0Ci0)+Zi1 (48)
Carry1 Ci+11=Zi0(Ai1Bi1+Ai1Ci1+Bi1Ci1) (49)
Sum0 Si0=Zi0(Ai0Bi0Ci0+Ai0Bi1Ci1+Ai1Bi0Ci1+Ai1Bi1Ci0)+Zi1 (50)
Sum1 Si1=Zi0(Ai1Bi1Ci1+Ai1Bi0Ci0+Ai0Bi1Ci0+Ai0Bi0Ci1) (51)
After comparing the DI zero adder 1304 and the DI full adder 1602, the DI zero adder 1304 has more a smaller area, but can reduce the delay of the multiplier.
Referring to
Partial product: PPi=Xi*Y (52)
Wherein, i=0, 1 . . . , n−1. The sequence of the square from top to bottom is from PPn−1 to PP0. Additionally, the first square PPn+1 represents the partial product of the most significant bit of the multiplier and the multiplicand Y.
The leading-zero-bit detector 1702 generates the zero flag (Zi), wherein i=0, 1 . . . n−3. Because the first row of the addition array is the sum of the first three rows of the partial products, n-2-bit flags are processed. In addition, the n−2 bits used for the zero-flags of the n-bit multiplier are generated. Each Zi controls a corresponding row of the addition array. If Zi=0, the multiplier of the corresponding row selects the addition for computation, and the sum vector and the carry vector are propagated to the next stage. When Zi=1, the multiplier of the corresponding row selects and outputs 0 to the next stage. For an n-bit multiplier, if the multiplier has m effective bits, m−3 stages of the addition rows are redundant. The m-row zero adder 1708 need not wait for the result in the final stage, and directly outputs zero. The m effective bits of the multiplier can reduce the m−2 stage computation time. Accordingly, only n-2-m stage computation time can be used to reach data dependence.
Accordingly, the asynchronous multiplier of the present invention divides the partial products into the effective bits and the ineffective bits. The ineffective bits, i.e., zero, is directly output to the final-stage adder to save the computation time and enhance the operational speed.
Although the present invention has been described in terms of exemplary embodiments, it is not limited thereto. Rather, the appended claims should be constructed broadly to include other variants and embodiments of the invention which may be made by those skilled in the field of this art without departing from the scope and range of equivalents of the invention.