Field of the Invention
The invention relates to area efficient realization of coefficient block [A] or architecture [A] with hardware sharing techniques and optimizations applied to this block. The optimization techniques and structure of the present invention have application in linear digital filters typically a finite impulse response(FIR) filter, infinite impulse response filter(IIR) and for other filters and applications based on combinational logic consisting of delay element(T), multiplier(M), adder(SA) and subtractor(SS).
Explanation of Elements and Symbols Used Herein
The basic component symbol used herein are shown in
Unit delay (T)
This element is a one bit delay element. It also performs the function of a multiplier by a factor of 2. For the serial input frame, 0101011 in binary or 43 in integer representation, the output of this block is 01010110 in binary or 86 in integer representation. This element is usually a Flip-flop (D Flip-flop, J-K Flip-flop, etc.).
Fuller Adder (FA)
The full adder performs binary addition. The inputs to this element are A, B, and Cin (Carryin), while the outputs are Z and Cout (Carryout). The truth table for full adder functionality is shown in
Full Subtractor (FS)
This element performs binary subtraction. The inputs to this element are A, B, and Cin (Carryin) while the outputs are Z and Cout (Carryout). The truth table for full subtractor functionality is shown in
Serial Adder (SA) and Serial Subtractor (SS)
These elements perform addition and subtraction of a two serial frame, x1(nT), x2(nT) to generate output y(nT) represented as x1(nT)+x2(nT) or x1(nT)−x2(nT). The serial adder (or subtractor) is implemented using a full adder (or subtractor) with a Flip-Flop as shown in
Serial Multiplier (M)
The multiplier multiplies two serial input frames X(nT) and m. The output is a function represented as Y(nT)=X(nT)*m. A serial coefficient multiplier (M) can be implemented by a shift register using [T] elements and an adder element [SA] (One shift means multiply by factor of 2). As shown in
Delay (Z−1)
Delay by one frame of data is done by a shift register (series of Flip-flops (T) connected to store and shift the input frame). The number of Unit delay (T) in one delay element is equal to the frame size of the input.
Existing Implementation of Filter
The following description discusses the elements used for implementation of the design and the existing implementations for digital filters. The proposed minimization is extendable to other applications such as Digital Signal Processing field and Digital designs.
The following description of the preferred embodiment, namely a FIR filter, is extendable to other filters as described earlier.
Stating the FIR filter equation in time and frequency domain
where X, Y are the input and output respectively, c(0), c(1), . . . c(n) represent the coefficient values that define the characteristics of the filter, and each delay [Z−1] block represents one sample delay. The filter equation can be implemented in two ways as shown in
In implementation 1, coefficient lines CLin—0, CLin—1, . . . CLin_n are common and connected to input X[n]. The output lines CLout—0, CLout—1, . . . CLout_n are connected to block [E], consisting of delay element [Z-1] and serial adders [SA] elements. The structure enables easy realization of a shareable multiplier in the coefficient block [A]. An example of a shareable multiplier with coefficient values 3,11 is illustrated in
The embodiments of the invention reduce the hardware of the coefficient block [A] by having shareable elements in coefficients, even if the coefficient lines CLin—0, CLin—1, . . . are not commonly connected. For the existing configuration shown in
Also, as described in the previous section, implementation 2 is area efficient with respect to implementation 1 due to reduced sizes of the delay elements. Over and above this by having a shareable multiplier or reduced coefficient block [A], implementation 2 becomes still more area efficient. This reduction is extendable to other filters based on the coefficient block [A] as stated in the first section. The present invention operates on integer-valued coefficients.
Further, to quote Norsworthy and Crochiere (Delta-Sigma Data Converters IEEE press pp-435, Copyright 1997):
“Bit-serial architecture reduce the interprocessor communication down to 1 bit. Generally the number of processors is very large, but because each processor is so small, the overall economy is very high. Bit serial architectures are usually most effective for filters having a few state variables, such as IIR filters and the wave-digital filters. For this reason, bit-serial techniques are less frequently applied to FIR structures, especially when the filter length is relatively long . . . ”
However, the present invention applies optimization techniques for reducing the area in large sized coefficients by applying a number of optimizations in FIR/IIR filter structures.
To elaborate the optimization techniques of the present invention, consider an FIR filter with symmetrical coefficient as 5, 14, 25, 30, 25, 14, and 5. Though the size of the coefficients in this example is small, it is enough to elaborate the minimization proposals. Stating the FIR filter equation in time and frequency domain
where X, Y are the input and output respectively and c(0), c(1). . . c(n) represent the coefficients value.
Using the coefficient values in the above equation
Y(n)=5X(n)+14X(n−1)+25X(n−2)+30X(n03)+25X(n−4)+14X(n−5)+5X(n−6)
Y(z)=X(z)[5+14Z−1+25Z−230Z−3+25Z−4+14Z−5+5Z−6] (EQ 1)
The Existing Method and Minimization
In one known optimization technique, advantage is taken of the symmetry in the coefficients. The streams which have to be multiplied with the same coefficients can be added first and then multiplied. For a large filter structure, this leads to a reduction by 45% in the coefficient block. (See
This is done by restructuring the equation as follows:
Y(z)=X(z)[5*(1+Z−6)+14*(Z−1+Z−5)+25*(Z2+Z−4)+30*Z−3] (EQ 2)
For the rest of the optimization proposals, the description will focus on the multiplier adder series, which is shown in the dotted box referred to as coefficient block [A].
Mathematically, the restructured equation according to the structure is stated as:
Y(nT)=(4+1)S1+(8+4+2)S2+(16+8+1)S3+(16+8+4+2)S4 (EQ 3)
In this implementation, the lines S1, S2, S3, S4 are not commonly connected. Hence, this restricts achieving a shareable hardware in the coefficient block [A]. Thus all the function/operations of this block represent unique hardware. The elements required by the terms are listed as:
The generalized structure of The Existing Method and Minimization is depicted in
The presence of one of the elements in columns 1 to n (i.e., T1—1 to T1_m, T2—1 to T2_m . . . Tn—1 to Tn_m) is determined by coefficient value. Thus depending on the coefficient value on lines S1 to Sn, the number of [T] element in a column is determined. Also the number of serial adders/subtractor [SA/SS] in the columns is represented as (SA1—1 to SA1_m,SA2—1 to SA2_m . . . SAn—1 to SAn_m). The presence of one of these elements is again defined by the coefficient value.
In the structure, the [T] elements are arranged in shift register form. The input to the first [T] element is connected to one of the S lines, while the input to [SA/SS] is connected from input S* and/or one of the output of the [T] elements of shift register, depending on the coefficient value. Finally, using SAe—1 to SAe_n−1 elements, the addition/subtraction of [SA/SS] of all the coefficient terms depicted in the columns is done. The final output is the output of last addition/subtraction [SA/SS].
Among the lines S1 to Sn, the [T] elements are not shareable and also the [SA] in each column are also not shareable. Thus limited minimization is possible in this structure.
The disclosed embodiments of the invention are directed to a co-efficient processing circuit for processing combinational logic coefficients, such as those used in digital filters, in a bit serial format having n bit positions. In one embodiment the circuit includes a plurality of serial input bit lines carrying coefficient signals; a combinational-sequential circuit, ideally made up of a plurality of serial adder blocks corresponding to the number of bit positions, each serial adder block comprising at least one serial adder and coupled to at least one of the plurality of serial input bit lines and including an output; and an output block coupled to the combinational-sequential circuit and configured to receive the output of each serial adder block, the output block comprising n−1 multiplier circuits, preferably multiplier-by-two circuits composed of flip-flops, for bit positions 0 through n−1, each multiplier circuit having an input coupled to an output of each respective serial-adder block in the corresponding bit position, the multiplier circuit in each bit position 0 to n−1 having an output coupled to an input of the serial adder block occupying the next higher bit position, and the output of the serial adder block in the n bit position comprising an output of the circuit.
In accordance with another aspect of the foregoing embodiment, the serial adder blocks comprise at least one serial adder, and the at least one serial adder block includes at least one serial subtractor.
In accordance with another embodiment of the invention, a coefficient processing circuit for processing combinational logic coefficients in a bit serial format having five bit positions is provided. The circuit includes first, second, third, and fourth serial input lines carrying respective coefficient signals; a combinational-sequential circuit comprising first, second, third, fourth, and fifth serial adder blocks corresponding to bit positions 0 through 4, respectively, each of the serial adder blocks coupled to at least two of the serial input lines; and an output circuit coupled to the combinational-sequential circuit and comprising first, second, third, and fourth multiplier circuits, each multiplier circuit having an input coupled to an output of the respective serial adder block and corresponding to bit positions 0 through 3, and each multiplier circuit having an output coupled to the serial adder block occupying the next higher bit position, the output of the fifth serial adder block comprising an output of the circuit.
In the accompanying drawings:
The device in accordance with one embodiment of the invention reduces the hardware of the coefficient block [A] by having shareable elements in the coefficient block [A], even in the implementation where the coefficient lines CLin—0, CLin—1, . . . are not commonly connected (shown as architecture [A]). This reduced hardware in the coefficient block when applied implementation 2 (
The device of the present invention reduces the area by approximately 30-50% of that shown in
Accordingly, an embodiment of the present invention illustrated in
[(a0*S1+b0*S2+ . . . +k0*Sn),
(a1*S1+b1*S2+ . . . +k1*Sn) . . . (am*S1+bm*S2+ . . . +km*Sn)]
are represented as blocks [B], the values of a0, b0, . . . etc. are represented as [(+/−)1 or 0], the said Block [B] is a combinational-sequential block consisting of serial adders (SA) & serial subtractors (SS) elements, the connection of elements (SA/SS) to S1, S2, . . . Sn lines and interconnection of the elements (SA, SS) depend on the value of coefficients, the SA/SS elements are arranged in matrix form SA0—0 to SA0_n in bit position 0 and SA1—1 to SA1_n in bit position 1 and similarly SAm—1 to SAm_n in bit position m, the presence of one of these elements is defined by coefficient value, the output of each block [B] is connected to [T] elements through line b—1, b—2, . . . b_m, the number of T elements depends on the size of maximum coefficient and is share-able for all the coefficient in the coefficient architecture [A], the output of element [T] is connected to one of the inputs of combinational logic of block [B] of next bit position (i.e., connected to input of element (SA or SS) of block [B] depending upon the sign value +/−), Lines t—1, t—2, . . . . t_m are used to mark the interconnections from cluster [C] to [B], in the said structure [A], all the elements in the block [B] are clustered together as block [D] and all the unit delay elements {T[1], T[2] . . . T[m]} are clustered together in block [C], thereby separating the combinational-sequential and sequential logic, while the sequential elements [T] of block [C] are common for all the coefficients and are share-able and positioned at end position of each Block [B], the Block [D] has combinational-sequential element block [B] which are essentially SA, SS. In the structure the hardware within block [B] are shareable across various [B] blocks and also within block [D]. The final output is taken from the output of the elements of the last bit position.
An Embodiment of the Invention or Optimization (a)
Continuing the same example of FIR filter and using Equation 3 of the previous section:
y(nT)=5*S+14*S2+25*S3+30*S4
Y(nT)=(4+1)S1+(8+4+2)S2+(16+8+1)S3+(16+8+4+2)S4
The implementation flow for this equation is presented below this text paragraph and the hardware implementation is shown in
As shown in the above implementation flowchart, the equation defines the bit position as BIT0 to BIT4, which is the position of “multiplication by power of two”. (e.g., BIT0 represents multiplication by 20). At BIT0 position addition of S3+S4 is performed and the output is terminated at T(1). The output of T(1) defines the next bit position BIT1, which performs addition of S2+S3+S4 and output of T(1) by using the [SA]. The output of this addition is again terminated at T(2). The structure is repeated in next BIT positions. The final addition of BIT position BIT4 gives the output of the coefficient block [A].
Implementation of hardware is shown in
In the structure, all [T] elements are represented as block[C] wherein the flip-flop [T] representing multiplication by a factor of 2, is share-able among various coefficient values and their number is determined by maximum coefficient value. This is in contrast to
The number of flip-flops (T) in
For filter having large size coefficient, this leads to a drastic reduction in the area (30-50% of the coefficient block).
Another Embodiment of the Invention or Optimization (b)
This optimization reduces the hardware of block [D] which essentially consists of (SA) and (SS) elements. Beginning with Equation 4 and finding out the common additive factors:
A1=S2+S4
A2=S3+S4
The Equation 4 can be further reduced as:
y(nT)=(S1+S3)+2*(AI+2*(SI+A1+2*(S2+A2+2*A2))) (EQ 5)
The flow of implementation of the equation is illustrated below and is self explanatory. Here S1, S2, S3, S4 represents four inputs. The primary addition is done using serial adders SA(1), SA(3), SA(9) representing addition of terms S1+S3, S2+S4, S3+S4. While the secondary and tertiary addition is done using the adders SA(5), SA(7), SA(3), SA(6), SA(4), SA(2). The multiplication by factor of two is done using the elements T(1), T(2), T(3), T(4).
The optimizations in reducing the hardware of block [D] are done. The output b—1 representing the bit position BIT0 and addition term A2, is connected to T[1] and also fed to the next block [B], hence reducing the adder count by 1. Also the output of adders SA(3) of block [B] in bit position BIT3, is fed at two points. One to the input of adders SA(4) which eventually terminates at [T4] element and other to the input of adder SA(5), hence reducing the adder count further by 1.
Note how A1 and A2 are shared in the Structure. Comparing the hardware implementation of
In the present minimization, approximate area calculations is 9 serial adder (SA)+4 (T)=22 Units, whereas the area of the existing minimization of
Yet another Embodiment of the Invention or Optimization (c)
In realization of block [D], further optimization is done by realizing the coefficient value using subtraction instead of addition. This is good for numbers that have values closer to power of 2. (e.g., for realization of coefficient value 63, the realization (63=64−1) is better than (63=32+16+8+4+2+1). In the first case the number of subtractor is 1 while in second case the number of adders are 5.
To illustrate this by an example, consider the coefficient values as 5, 25, −48, −63). Writing the FIR equation using these coefficient values.
Alternately arranging the terms with 63 as (64-1), the first equation reduces
The realization of Equation 6 and Equation 7 is shown in
In these realizations, the number of [T] elements is one more in Equation 7 due to the presence of the term 64. However, the number of adders are less in the structure represented by Equation 7 than by Equation 6. This is because the number of adders are less in the former case. Comparing the area of the two realization, from Equation 6, the area obtained is 5 T+6 SA+6 SS=29 Units. While from Equation 7, representing optimization(c), results in an area calculation of =6T+6SA+2SS=22 Units, assuming 1 Unit=1 FA=2HA=IT and SA=SS=2 Units. This is an improvement by 24% in reducing area of coefficient block for the current example.
Thus, the optimization reduces area for realization of negative coefficient. This optimization is also efficient realization of coefficients having values close to the power of two. Further minimization is possible by taking subtraction as a common factor and using addition instead of subtraction wherever possible in the realization. This results in an improvement in area, due to the fact that area for a subtractor is more than the area of an adder.
Generalized Structure of the Invention
The invention provides an area efficient realization of filter coefficient block[A] applicable to filter devices such as FIR, IIR and other filter structures. This architecture is also applicable to combinational and sequential logic consisting of adders, subtractors, multipliers and flip flops [T]. This architecture is realized using the elements serial adders (SA), serial subtraction (SS) and flip-flop [T].
A generalized structure of the present invention is depicted in
Beginning with the generalized equation of FIR filter coefficient block(A):
y(nT)=a*S1+b*S2+c*S3+ . . . k*Sn (1)
Now, representing each coefficient as an addition of terms arranged in power of two and applying it to the equation:
y(nT)=(2m*am+ . . . 21*a1+20*a0)* S1+(2m*bm+ . . . 21*b1+20*b0)* S2+(2m*cm+ . . . 21*c1+20*c0)* S3+ . . . +(2m*km+ . . . 2−1*k1+20*k0)*Sn
Further, taking “2” as a common factor, the generalized equation for architecture is:
In
[(a0*S1+b0*S2+ . . . . +k0*Sn),(a1*S1+b1*S2+ . . . +k1*Sn . . . (am*S1+bm*S2+ . . . +km*Sn)]
are represented as blocks [B]. Block [B] is a combinational-sequential block consisting of serial adders (SA) and serial subtractors (SS) elements. The values a0, b0 . . . etc. represent value [(+/−)1 or 0]. The connection of elements (SA/SS) to S1, S2 . . . Sn lines and interconnection of the elements (SA, SS) depend on the value of coefficients. [This is because the value of coefficient determines the value of a0, a1 . . . . . . . etc. and hence it defines the interconnections between them]. The output of each block [B] is multiplied by two using [T] elements. The elements T[l], T[2] . . . T[m] are used for multiplication by factor of 2. The number of T elements depends on the size of maximum coefficient and is share-able for all the coefficient in the coefficient block [A]. Thus in the structure the final outputs of all the blocks [B] are terminated at unit delay elements [T] (connected through b—1, b—2 . . . . b_m).
In the structure, all the elements [B] are clustered together as [D] and all the unit delay elements {T[1], T[2] . . . T[m]} are clustered together in [C]. The sequential [C] and combinational-sequential logic [D] are quite separated in this architecture. The input of the unit delay element [T] is final output of block [B] and the output of element [T] is connected to the one of the inputs of combinational logic of block [B] of next bit position (i.e., connected to input of element (SA or SS) of block [B] depending upon the sign value+/−). The interconnections from cluster [C] to [B] are represented as t—1, t—2 . . . t_m.
The bit positions of serial data frame are marked as BIT0, BIT1 . . . BITm.
In the generalized structure, flip-flops[T] of all the coefficient are shareable and the number of flip-flops[T] are limited to the coefficient that has the maximum value. Also optimization can be applied in block [D]. The gain in area when compared with the existing design is illustrated below.
Hardware Reduction in Block [C]
Before beginning to prove the statement, we proceed to formularize the calculation of the number of flip-flops (T) for structure of “The Existing Method & Minimization” in “
Similarly, the approximate formula for calculation of total adders (SA) in the coefficient block for “The Existing Method and Minimization” and “Detailed Description of the Invention” in
Now, as an example, provided herebelow is use of the above mentioned formulae from the previous two paragraphs in filter of 20 coefficient. Assume the maximum coefficient value is represented in 16 bits (e.g., maximum coefficient value is +32767 or −32768 in 2's complement representation). Average size of the coefficient approximated by the formula is 16/2=8 bit. In the existing method and minimization, the total number of flip-flop (T) required for implementation is 8*20=160. In contrast to this “Detailed Description of the Invention,” this would require only 16 Flip-Flops (the number of flip-flops of all the coefficient are share-able and are limited to the coefficient which has the maximum value). Assuming in a worst case that there is no optimization of adders, the number of adders in both the cases are the same and are equal to 8*20=160. (Refer to
Area calculation for the existing method and minimization as shown in
The preferred embodiment of the invention is also supported by a real example of a filter coefficient device. This is referred to as optimization(a) and shown in
Hardware Reduction in Block [D]
The optimizations in block [D] are referred to as optimization (b), optimization (c) and shown in
This is illustrated using the example of previous section Using A1=S2+S4, A2=S3+S4 in this example. Total area for this example after the optimization is 9SA+4T=22 Units*. This, when combined with optimization(a) results in an area-gain of 37% in area (of coefficient design block) for this example design as compared to existing implementation (where area was 35 units).
The optimization(c) as described before can be further applied to cluster [D]. That is beside optimization(a) and optimization(b), the technique of realizing the coefficient value using subtraction (SS) instead of addition(SA) is used here. This substantially saves in area when the coefficient value is close to power of 2. (e.g., for realization of coefficient value 63, the realization (63=64 −1) is better than (63=32+16+8+4+2+1). In the first case, the number of subtractors is 1 while in the second case the number of adders are 5.). The two cases are illustrated in the previous section and are shown in
With all the optimization applied, the invention while in use results in an area improvement of 30-50% of the coefficient block design or combinational logic consisting of adders, subtractor, multiplier and unit delays [T].
*Note that the input to adders in [B] are interchangeable e.g., adders SA(5), SA(6) inputs could be interchanged. Also the signals t—1, t—2 etc. can be connected to any input of adders of block [B] of next bit position.
**For approximate area calculation following assumption is made (1 Unit of Area=1 FA=2HA=1T & SA=SS=2 Units of Area).
The present invention is most economical in terms of area of coefficient block/architecture. In fact, the present invention provides an area improvement of 30-50% of the coefficient block design or combinational logic consisting of adders, subtractor, multiplier and unit delays.
From the foregoing it will be appreciated that, although specific embodiments of the invention have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of the invention. Accordingly, the invention is not limited except as by the appended claims and the equivalents thereof.
Number | Date | Country | |
---|---|---|---|
Parent | 09807498 | Jun 2001 | US |
Child | 10968822 | Oct 2004 | US |