The present invention relates generally to electronic processors that perform arithmetic operations (add/subtract/division/multiplication). Specifically, it relates to adder circuits for use in semiconductor integrated circuits and other electronic devices.
Binary addition is the single most important operation that a computer processor performs and is thoroughly investigated since the beginning of computing. The performance of processors is significantly influenced by the speed of their adders and it is shown by M. A. Franklin and T. Pan, Performance Comparison of Asynchronous Adders, in Proc. Of Int'l Symp. Advanced Research in Asynchronous Circuits and Systems, pp. 117-125, Nov. 1994; that in a prototypical RISC machine (DLX), 72 percent of the instructions perform additions (or subtractions) in the datapath (J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990). It is even reported by “J. D. Garside, A CMOS VLSI Implementation of an Asynchronous ALU, Asynchronous Design Methodologies, S. Furber and M. Edwards, eds., vol. A-28 of IFIP Trans., pp. 181-207 1993;” to reach 80 percent in ARM processors.
The adders can be sequential or combinatorial. As the sequential adders are bound to perform slowly due to its incremental nature of operation it is not considered for parallel and fast adders. The basic building block of combinatorial digital adders is a single bit adder. The Half-Adders (HA) are simplest single bit adder. The Full-Adders (FA) are single bit adders with the provision of carry input and output. The full-adders are typically composed of two HAs and hence are expensive than half-adders in terms of area, time and inter-connection complexity.
The most common approach for designing multi-bit adders is to form a chain of FA blocks by connecting the carry out bit of a FA to the carry in bit of the next FA block.
It is known as Ripple Carry Adder (RCA). The delay in RCAs increases linearly with number of bits. However, it remains to be the most efficient and thereby the choice for the designers for fewer number of bits (≦4) as clarified by N. H. E. Weste, K. Eshraghian, Principles of CMOS VLSI Design A Systems Prespective, 2nd Edition, Addison-Wesley Pub., 1994. Many different combinatorial adders are designed for improving the efficiency of basic RCAs and some of them consider the possible parallelism of the addition operation.
As described by R. E. Ladner and M. J. Fischer, Parallel Prefix Computation, Journal of the ACM, 27(4), pp. 831-838, October 1980; addition is a special prefix problem which means that each sum bit is dependent on all equal or lower input bits. This dependency makes it difficult to implement a parallel algorithm for addition. However the flow of bits can be tactfully arranged for a tree structured implementation of the adder that can reduce the addition overhead significantly. Carry Look Ahead/Carry Select/Carry Skip adders belong to this category of adders. On the other hand the Carry Save adders avoid the carry propagation altogether by employing a redundant number representation. Eventually the redundant number need to be converted to the non-redundant representation by using a carry propagate adder that eliminates much of earlier gains.
Apart from the theoretically possible best design for adders some implementation issues regarding circuit complexity and fabrication limitations also play crucial role in circuit design. The circuit complexity and irregular design can render it infeasible for VLSI fabrication. Moreover, the number of outputs an input signal need to drive is limited which is known as fan-out limitation. The fan-out limitation also incurs extra delay as the capacitance increases with increasing fan-out parameter. The power dissipation is also another important practical issue that limits number of interconnection in a VLSI fabrication.
As reported by Fu-C. Cheng, S. H. Unger and Michael Theobald, Self-Timed Carry-Lookahead Adders, IEEE Transactions On Computers, 49(7), pp. 659-672, July 2000; the best parallel adder can perform addition in log of log number of bits time. Typically the area and interconnection efficiency is traded off to achieve logarithmic/sub-logarithmic performance. Thus, it remains to be a challenge for the researchers to achieve fast adder with less area and interconnection requirement.
The present invention discloses a recursive formulation for PArallel Self-Timed Adder (PASTA). The design of PASTA is regular and uses HAs along with multiplexers with minimum interconnection requirement. Thus the interconnection and area requirement is linear that makes it practical to fabricate in a VLSI chip. The design works in truly parallel manner for the number of bits that do not require carry propagation. The carry chains for long number of bits are logarithmic and significantly smaller (B. Gilchrist, J. H. Pomerene, and S. Y. Wong, Fast Carry Logic for Digital Computers, IRE Trans. Electronic Computers, 4(4): 133-136, December 1955). Hence theoretically it can perform in logarithmic time. It is self-timed that means it will signal the completion of addition as soon as it is done thereby overcoming the clocking limitations.
Accordingly, it is an object of the present invention to provide a fast and area/interconnection efficient parallel adder.
Briefly, the embodiment of this invention is to provide a recursive formulation for PArallel Self-Timed Adder (PASTA). The design of PASTA is regular and uses Half Adders along with multiplexers with minimum interconnection requirement. Thus the interconnection and area requirement is linear making it easy to fabricate in a VLSI chip. The design works in truly parallel manner for the number of bits that do not require carry propagation. Thus theoretically it can perform in logarithmic time as the carry chains for long number of bits are logarithmic and significantly smaller as disclosed by B. Gilchrist, J. H. Pomerene, and S. Y. Wong, Fast Carry Logic for Digital Computers, IRE Trans. Electronic Computers, 4(4): 133-136, December 1955.
The single bit PASTA selects the original inputs at the beginning using Multiplexers and generates the result of single bit summation at the first step. For subsequent operations, the sum bit from a single bit adder block of PASTA is connected recursively to itself for addition with the carry in from the previous bit adder. Whenever a carry is generated or needs propagation from a bit position, it is transferred to higher bit level and hence its own carry is modified to zero. Thus the construction of plurality of adder is pretty similar to RCA. The advantage is that it is self-timed and logarithmic. It will signal the completion of addition as soon as all the carry signals from individual bit adders are zero.
These objects and advantages of the present invention will become clear to those skilled in the art as illustrated in the detailed description and figures.
A first embodiment of the parallel self-timed adder is presented in
Let an−1an−2 . . . a0 and bn−1bn−2 . . . b0 be two n-bit binary numbers with sum and carry denoted by Sn−1Sn−2 . . . S0 and cncn−1 . . . c0 where 0th bit represents the least significant bit. Basic single bit adders are now discussed.
Single bit Half-Adder (HA) and Full-Adder (FA) are the fundamental building blocks for nearly all high-speed adders. A single bit HA for ith bit addition is logically formulated as follows:
S
i
=a
i
⊕b
i
c
i+1
=a
i
b
i (1)
According to delay model by A. Tyagi. A reduced-area scheme for carry-select adders. IEEE Trans. Comput., 42(10):1162-1170, October 1993; simple logic gates (AND, OR, NAND, NOR, NOT) have 1 unit of associated gate delay and XOR/XNOR have 2 units of gate delay. Thus, the gate level delays associated with S, and ci bits are 2 and 1 respectively. The gate level area complexity for HAs is hence 2+1=3.
A single bit full adder implementation additionally takes consideration of the carry-in input from the preceding single bit unit and formulated as follows:
S
i
=a
i
⊕b
i
⊕c
i
c
i+1
=a
i
b
i+(ai⊕bi)ci (2)
The gate level delay associated with S, and c, bits are 4. The gate level area complexity for FAs is 7.
The recursive binary addition formula for addition of A and B is presented as follows.
Let S/ and C/ be the Sum and Carry respectively for ith bit at the jth recursion. The initial condition for the addition operation can now be defined as follows:
S
i
0
=a
i
⊕b
i
Ci0=aibi (3)
The jth iteration for the recursive addition can be found as follows:
S
i
j
=S
i
j−1
⊕C
i−1
j−1 (4)
C
i
j
=S
i
j−1
C
i−1
j−1 (5)
The recursion is terminated at the kth iteration when the following condition is met.
C
n
k
C
n−1
k
ΛC
0
k=0 (6)
Using the formulae presented in equations (3)-(6), a fast adder will now be designed. At first the correctness of the recursive formulation will be proved inductively by the following observation and subsequent theorem.
Observation 1: In a single bit adder with no carry in, the maximum obtainable result is 2.
Explanation. It is obvious that the sum cannot exceed the maximum sum obtained by two highest possible operands and hence should be equal or less than 2.
The significance of this observation is that for individual ith bit adder, the case of having Si=1 and Ci=1 (decimal value of 3) is impossible as it will exceed the maximum of the sum of two inputs which is 2 (binary 10). Thus the only valid (S, C) forms by ith bit adder are (0, 0), (0, 1) and (1, 0).
Theorem 1: The recursive formulation of (3), (4), (5) and (6) will produce correct sum for any number of bits and will terminate at finite time.
Proof. We prove the correctness of the algorithm by induction on terminating condition.
Basis: For operands A, B such that ci0=0 for ∀i, iε[0 . . . n], the proposed recursive formulation produces correct result in parallel by single bit computation time and terminates instantly as condition (6) is met.
Induction: Assume Cik≠0 for ∃i. Let j be such a bit for which Cjk=1. First we show that it will be killed in the (k+1)th iteration and next we will show that it will be successfully transmitted to next higher bit in the (k+1)th iteration.
According to Observation 1, (Sjk, Cjk), (Sj+1k, Cj+1k) could be in any of (0, 0), (0, 1) or (1, 0) forms. As Cjk=1, it implies that Sjk=0. Hence, from equation (5), Cjk+1=0 for any input condition between 0 to j−1 bits.
We now consider the next higher bit (Sj+1k, Cj+1k) at kth iteration. By observation 1, it could be in any of (0, 0), (0, 1) or (1, 0) forms. In the (k+1)th iteration, the (0, 0) and (0, 1) forms from kth iteration will correctly produce output of (1, 0) following equation (4) and (5) and hence carry will be absorbed in (j+1)th bit. For (1, 0) form, the carry is supposed to propagate through this bit level as the sum value is 1. By applying (4) and (5), we find Cj+1K+1=1. Thus the carry propagation/killing will be correctly performed by jth bit adder.
Finally, there is one extra bit adder block for carry out of the n-bit adder. This will have initial output (S,n0, Cn0)=(0, 0). Any carry chain is hence bound to end up at this bit and produce output (1, 0), if it is not already killed by any previous bit levels during earlier iteration(s). Thus all the single bit adders will successfully kill or transfer the carries to the next level until being killed at the nth bit carry out block. This ensures that terminating condition is always reached by the recursive formulation. QED.
The mathematical form presented above is valid under the condition that the iterations take place simultaneously and the signals will be available synchronously from the previous level. This implicates a clocked design. However, the complexity is supposed to rise for a clocked circuit. In the next section we present a pseudo-sequential feedback circuit for the implementation of the proposed recursive formulation.
The general architecture of the proposed recursive adder is presented in
The selection bits for 2 input multiplexers will be a single 0 to 1 pulse (denoted SEL in the CMOS implementation diagram). It will initially select the actual inputs during “SEL=0” and will switch on to feedback/carry paths for subsequent iterations (SEL=1). The feedback path from the HAs enable the recursion to continue till the terminating condition is met.
The CMOS implementation of the proposed embodiment is shown in
One particular practical issue is synchronization of the carry transitions during the recursion. The recursion will be implemented by a single pulse of the carry signal. However, it implies there will be switching transients for the rising and falling edge of the carry signals from one block to the next. To avoid the pitfall of switching twice (or multiple times) for the same signal the sum and carry outputs are separated by an extra multiplexer that tunes the delays associated in the feedback path and helps avoid the switching transients due to feedback path of Sum bits.
The termination signal following equation (6) can be generated by the CMOS implementation as shown in
Though, the TERM signal is the only block where as many as n +2 interconnections are needed it will not create any fan-in problem as all the connections are parallel except a single pull-up transistor.
It is evident from the architecture and implementation of
Logic and time complexity of available adders along with PASTA are shown in Table 1. Though the theoretical limit of PASTA is similar to existing logarithmic algorithms but it achieves the same performance supporting a regular structure with constant fan-in and fan-out. Thus, it is better for the VLSI implementation than the prefix algorithms. Moreover, it is shown by SPICE simulations in the next section that it is possible to achieve constant time carry propagation by PASTA implementation. This phenomenon can be utilized for a future constant time parallel adder.
The specified CMOS circuit is simulated using Linear Technology SPICE version 4.04i. The 50 nm fourth generation Berkeley Short-channel IGFET Model (BSIM4) is used. Initially, the outcome of 8 bit adders are presented to show the practical realization of the prototype implementation. In practicality, the situation is complicated due to the length of the carry signal in effect for the next block. If the duration is not properly tuned or quite large this could feed carry to the next block for multiple transitions before eventually settling down to zero. This is similar to race condition as the final outcome is not predictable. Consequently, it is important to tune the MOS dimensions for proper synchronization.
In
The worst-case, best-case and average case for maximum, minimum and average length carry propagation is highlighted in the timing diagrams of
As the proposed approach is very basic one without any lookahead scheme and further optimization, we compare the performances of similar chained schemes of RCA and Delay Insensitive RCA (DIRCA). The results are displayed in Table 2. For the average case we have used the expected carry length for n bit binary numbers as found by G. W. Reitwiesner, The Determination of Carry Propagation Length for Binary Addition, IRE Trans. On Electronic Computers, vol. EC-9, pp. 35-38, March 1960. The delay in case of PASTA is computed from the switching time (when SEL changes to 1 from 0) of the multiplexers.
It is to be noted that DIRCA architecture is not able to perform better in the best/average cases. This is due to the fact that dual rail signals are reset at the beginning of the computation and require propagation from previous completed stages to produce successful completion signal. We have used the completion signal as provided by Fu-C. Cheng, Practical Design and Performance Evaluation of Completion Detection Circuit, In proceedings of the Intl. Conf. on Comp. Design (ICCD), pp. 354-359, October 1998.
The results clearly indicate the potential of the new PASTA as it performs best among the cascaded logic designs. It is due to the truly parallel theoretical basis of the design for independent carry chains.
However, the biggest advantage that could be reaped out of the proposed design could possibly be a truly constant time parallel adder. It is found that the cascading delay for successive carry propagation can be totally avoided by tuning the MOS dimensions. The timing diagram for single carry propagation for a 32 bit adder circuit for operands A=(FFFF FFFF)16 and B=(1)16 is shown in
For clarity only a few carry signals are displayed (C0, C1, C15 and C31). From C2-C31 all carry signals follow nearly same timing. The addition thus merely takes 1.29 ns to complete for the worst case propagation condition in 32 bit adder.
It has been in the theory that the delay could be reduced to that of single gate delay by tuning MOS parameters for parallel connections. However, this was not possible with earlier adder designs to achieve constant time carry propagation which involves complex circuit.
Number | Date | Country | Kind |
---|---|---|---|
PI2010001675 | Apr 2010 | MY | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/MY11/00032 | 4/13/2011 | WO | 00 | 10/12/2012 |