The present invention relates to a computation system for computing an inner product of an input signal with a plurality of coefficients, and to an analog-to-digital converter (ADC). The ADC may be employed as a component of the computation system. The ADC is based on the Residue Number System, which on its own, is capable of providing a highly efficient way of implementing high resolution high speed analog to digital conversion. The computation system for computing the inner product is based on the Residue Number System and Distributed Arithmetic technique and works especially well with the ADC.
Many applications require the digitization of an analog signal, followed by digital signal processing, often involving the computation of an inner product of a vector representing the digitized signal with another vector.
The Flash ADC is the most common solid-state circuit based high speed ADC in use today. In the Flash ADC, multiple parallel comparators, equal to the number of quantization levels to resolute, are used to convert the analog input signal to the corresponding digital output (comprising a plurality of input signal entries). A Flash ADC in the form of a parallel converter of n-bit resolution provides a 2n dynamic range and has 2n−1 quantization levels (a quantization level is also known as the least significant bit, LSB) and hence requires a total of 2n−1 parallel comparators. For instance, an 8-bit parallel type Flash ADC will need 28−1=255 parallel comparators. Since the number of parallel comparators needed increases exponentially with the resolution, managing the skew times between the parallel paths used by the parallel comparators in higher resolution high speed Flash ADCs becomes a complicated issue. Furthermore, the overall power dissipation and chip area required also increase tremendously with the number of parallel paths in a Flash ADC. These factors impose a practical limit to the resolution that can be achieved in these types of high speed Flash ADCs.
A high-speed alternative to the Flash ADC is the Folding ADC. Operation of a Folding ADC is similar to a two-step ADC. In particular, both the Folding ADC and the two-step ADC comprise two parts: a coarse quantizer to output the MSBs (most significant bits), and a fine quantizer to digitize the residual signal (i.e. signal remaining after removing the MSBs) and output the LSBs. However, in a Folding ADC, the residual signal is obtained directly from a folding circuit. This is unlike the two-step ADC that obtains the residual signal through the output of its coarse quantizer. As such, the Folding ADC can operate at the full speed of a Flash ADC without the need to wait for the coarse quantizer to first complete its operation.
A Folding ADC uses fewer parallel paths than a Flash ADC but is capable of retaining the high speed of the Flash ADC. With the Folding ADC, the number of parallel paths is reduced significantly and is minimized when the MSBs and the LSBs have the same number of bits. For example, an 8-bit Folding ADC having 4-bit MSBs and 4-bit LSBs will require only 2(24−1)=30 parallel comparators. This is much less than the 255 comparators required in an 8-bit Flash ADC.
The operation of the Folding ADC is discussed in greater detail in reference [10].
Inner product computation of a signal with a plurality of coefficients is required in the fundamental function of many digital signal processing applications. Therefore, its implementation efficiency is of major significance from a practical feasibility point of view.
The Distributed Arithmetic (DA) technique is a well-known technique for computing inner products [1]. Compared to the multiply-accumulate (MAC) approach, the DA technique allows the inner product computation to be completed in a number of cycles proportional to the bit-length of the input signal entries, instead of the number of coefficients. As such, it provides performance gain when the number of coefficients is more than the bit-length of the input signal entries. Inner product computation involves the addition of a series of products (i.e. multiplication outputs). The DA technique allows the computation of the inner products without the need to perform multiplication by using a look up table (LUT) with bit-serial data addressing to provide the products. These products are then added together to derive the final answer i.e. the inner product.
The Residue Number System (RNS) [2] is suitable for the implementation of high speed digital signal processing as parallel operations and small data bit-lengths may be achieved with the RNS.
In the RNS, a big natural number A within a legitimate dynamic range [0,P) can be uniquely represented by a set of smaller natural numbers <a1,a2, . . . , aM>. This set of smaller natural numbers is known as the residues or residue digits of the number A and is derived based on a modular arithmetic principle using a selected set of numbers [m1,m2, . . . , mM] called the moduli set. In particular, this set of smaller natural numbers <a1,a2, . . . , aM> are remainders obtained by dividing the number A by the moduli [m1,m2, . . . , mM]. The moduli are pair-wise prime positive integers (that is, they have no integer factors in common except 1) and P is equal to the product of the moduli, i.e. A<P=πi=1Mmi. The relationship between the number A and its residues <a1,a2, . . . , aM> may be referred to as a RNS relationship which may be expressed in the form A≅<a1,a2, . . . , aM>. Furthermore, the residues <a1,a2, . . . , aM> of a number A are referred to as the RNS format of the number A.
Besides being able to represent a big natural number using smaller residue digits, another important property of the RNS is that arithmetic operations such as addition, subtraction and multiplication of two numbers A and B can be equivalently performed with RNS-based arithmetic using their corresponding sets of residue digits ai and bi corresponding to the modulus mi. Moreover, these operations can be performed in an independent and parallel manner, with no carry-propagation occurring between the operations for different moduli.
For instance, using the [7,8,9] moduli set which provide a legitimate dynamic range of [0,504), the integer R=179 can be represented by the residue digits 4, 3 and 8 (i.e. 4,3,87,8,9 residue set) and the integer S=254 can be represented by the 2,6,27,8,9 residue set. Arithmetic operations between the integers R and S can be equivalently performed using their corresponding residue sets 4,3,87,8,9 and 2,6,27,8,9 as follows:
where the arithmetic operator ∘ can be +, − or ×.
For example, with the arithmetic operator ∘ as +, the following is obtained.
Note that there is a need to perform a modulo operation on an output of the arithmetic operation if its value exceeds its modulus. For example, in Equation (2), the outputs of the arithmetic operation “+” on the residue digits are 6, 9 and 10 expressed in the form (6,9,10)7,8,9. Outputs 9 and 10 exceed their corresponding moduli 8 and 9 and thus, it is necessary to perform modulo operations on outputs 9 and 10 with moduli 8 and 9 respectively.
As shown in Equation (2), in the RNS, arithmetic operations between residue digits arising from the same modulus can be performed in a parallel and independent manner from residue digits arising from other moduli. This is as long as the resultant output from the arithmetic operation does not exceed the legitimate dynamic range provided by the moduli set. Furthermore, since the residue digits of a number are smaller than the number itself, a much shorter bit-length may be used to encode the residue digits as compared to the bit-length used to encode the number. These properties of the RNS i.e. smaller residue digits and parallel arithmetic operations make the RNS ideal for use with the DA technique for inner product calculation. In particular, since the performance gain that can be provided by the DA technique is dependent on the bit-lengths of the input signal entries, the smaller values of the residue digits can lead to a faster execution cycle due to the shorter bit-lengths required to encode the residue digits. Furthermore, the ability for parallel operations across different moduli enable simultaneous arithmetic operations to be done in multiple independent channels, each reserved for residue digits derived using the same modulus.
However, in practice, some complications arise when implementing the RNS with the DA technique (i.e. when implementing a DA-RNS system) for inner product calculation. Even if each input signal entry is in the RNS format with smaller residue digits, the residue digits themselves are usually still encoded in the binary code (BC) format. As such, there are still overheads (although, lower when compared to using a non-RNS based approach) due to localized carry propagation in the arithmetic operations performed in each channel. Furthermore, because of the 2n bit weights associated with the BC format, a 2n scaling process is required for the inner product computation when the residue digits are encoded in the BC format. This need for a 2n scaling process complicates issues in a DA-RNS system for inner product computation since executing a modulo operation on the 2n factor is complex in practice [3]. Therefore, in a DA-RNS system using the BC format to encode the residue digits (i.e. a BC based DA-RNS system), the modular adder used to compute the inner products requires a convoluted implementation. There is also no simple way to perform the modulo operation [2] for BC formatted residue digits for a generic class of moduli (i.e. not moduli with carefully selected values, such as powers of 2 or the like). Thus, to date, there are hardly any reports on efficient means to implement the DA-RNS concept.
The following provides more details of the DA technique and the BC based DA-RNS system.
Inner product computation of an input signal with a plurality of coefficients Ak may be expressed as follows:
In Equation (3), y is the inner product to be computed and it is assumed that Ak take on fixed values (e.g. Ak may be the filter coefficients of a FIR filter). The input signal is in a representation
Now consider the case whereby each input signal entry xk is encoded with a plurality of bits in the BC format with a bit-length of N. Each input signal entry xk may be expressed in terms of its plurality of bits bkn as follows:
In Equation (4), bkn represents the bit in the nth bit position (i.e. the nth bit) of the plurality of bits encoding xk and has either the binary value of 0 or 1 (i.e. is either bit ‘0’ or bit ‘1’). 2n represents the weight of the bit bkn and differs for each bit bkn.
Substituting Equation (4) into Equation (3), Equation (3) can be written in the form associated directly with the bits of the input signal entries as follows:
Interchanging the order of the summations in Equation (5) and bringing Ak together with the binary bits bkn of xk, the following equation is obtained.
The function f(Ak,bkn) represents a sum of multiplications to be performed and is derived using the individual binary bits bkn of each input signal entry xk. Since each bit bkn can only take on a value of either 0 or 1 and the value of each Ak is fixed, there are altogether 2K possible combinations of the bits bkn and the coefficients Ak for Equation (7).
In the DA technique, the values of the function f(Ak,bkn) resulting from the 2K possible combinations may be pre-computed and stored as entries in a Look-Up-Table (DALUT). The DALUT is then successively addressed by using the nth bit of all the input signal entries xk in parallel, starting with n=0 until n=N−1. With each addressing of the DALUT, an output comprising the value of the function f(Ak,bkn) corresponding to the nth bit is provided. The successive outputs from the DALUT are then accumulated as indicated in Equation (8) and the eventual N−1 accumulated sum is the inner product y. From Equation (8), it can be seen that due to the different weights 2n of the binary bits bkn in the input signal entries xk, there is a need to first scale each output from the DALUT by its respective 2n factor.
Consider an example of K=4 inner product computation having four coefficients. This inner product computation has the expression shown in Equation (9) below:
In this example, each of the input signal entries xk: x0,x1,x2,x3 is encoded with a plurality of bits bkn in the BC format with a bit-length of N=3 as follows:
x
0
={b
02
b
0l
b
00}
x
1
={b
12
b
1l
b
10}
x
2
={b
22
b
2l
b
20}
x
3
={b
32
b
3l
b
30} (10)
A system based on the DA technique (i.e. BC based DA system) can then be implemented.
The DALUT is then successively addressed using the nth bit of all the input signal entries xk in parallel, starting with n=0 until n=2 and the corresponding DALUT entries are successively provided as DALUT's outputs. This takes places in three execution cycles whereby in each execution cycle, a collective bit pattern formed by concatenating the nth bit of the input signal entries in a bit-serial manner is used. The collective bit patterns bk0, bk1 and bk2 (with k=0 to 3) respectively for the execution cycles tcycle=0, tcycle=1 and tcycle=2 are as follows:
t
cycle=0:bk0={b00b10b20b30}
t
cycle=1:bk1={b01b11b21b31}
t
cycle=2:bk2={b02b12b22b32} (11)
The DALUT output from each execution cycle is then scaled by its corresponding scaling factor 2n before it is accumulated with scaled DALUT outputs from previous execution cycles (see Equation (8)).
In a conventional binary number system, the 2n scaling of a DALUT output may be performed by a logical left shift of the bits of the DALUT output by an amount corresponding to the value of n. The adder can be any type of binary adder and the output of the adder may be stored into a register to be used for further accumulation with incoming scaled DALUT outputs.
Assuming that the scaling and accumulation execution operations for each DALUT output can be performed within one clock cycle (although, in practice, depending on the accumulator implementation, this may take more than 1 clock cycle), the inner product computation can thus be completed in N clock cycles with the DA technique. In contrast, using the MAC approach, the computation will take K execution cycles. Assuming that each MAC execution operation can be performed within one clock cycle (which is only true if one multiplication and addition can be performed in 1 cycle), the DA technique provides performance gain for the inner product computation if N<K. This is the case in the above example where N=3 and K=4. In practice, the value of N is usually much lower than that of K, i.e. N<<K. Furthermore, there is no multiplier needed in the DA technique to perform the computation due to the use of the DALUT. This is beneficial as having a multiplier is typically more hardware costly.
BC based DA-RNS systems have been reported in publications such as [3], [5] and [6] but the number of publications are fewer than what one would normally expect in view of such a seemingly good match between the DA technique and the RNS. This is likely due to the difficulties in implementing modulo operations on the 2n scaling factors that originate from the weights of the bits of the BC encoded residues (BCR). The following derives the expression reflecting the implementation of the inner product computation using the RNS and DA technique, and reveals the above-mentioned difficulties.
Starting with the same inner product computation expression as in Equation (3) whereby
and expressing y in its RNS format y≅y1,y2, . . . , yM using a [m1,m2, . . . , mM] moduli set, a total of M residue digits based equations can be derived. Each residue digits based equation has the general expression as shown in Equation (12) where yi is the inner product for the modulus mi.
Using the binary bit representation of xk as given in Equation (4), |xk|m
Combining Equations (12) and (13) produces
The expression within the modulus of Equation (14) is the same as that in Equation (5), and hence can be similarly re-arranged as follows:
As before, the 2n factor needs to be decoupled from the term f(Ak,bkn) that is to be stored in the DALUT. This is done by applying the algebra of RNS as follows
Equation (16) then becomes the residue expression:
The values of fm
The present invention aims, in one aspect, to provide a new and useful converter for converting an analog input signal into a digital representation.
In general terms, the one aspect of the present invention proposes an ADC which uses the input signal to generate an RNS representation of the signal based on a plurality of moduli. For each modulus there is a Residue Number System (RNS) converter which includes a number of zero-crossing based folding circuits equal to the modulus, and a comparator for each zero-crossing based folding circuit. The output of the comparators is used to form the RNS representation. This ADC may be implemented using a smaller number of comparators than known systems, and with high accuracy. Optionally, the RNS representation may be converted into different digital representations.
The present invention further aims, in another aspect, to provide a new and useful system for computing an inner product of an input signal with a plurality of coefficients.
In general terms, the other aspect of the present invention proposes a system which uses the input signal having a number K of signal entries. Each signal entry is represented in an RNS format, in which the residue for each modulus is represented as a string in which the number of components taking a first value is equal to the residue. Corresponding components of the strings for different input entries are used to obtain a summation value, and the summation values are accumulated. Since the components of the string are not associated with weight values, the accumulation of the summation values can be performed without using a scaling accumulator.
Embodiments of the invention will now be illustrated for the sake of example only with reference to the following drawings, in which:
a)-(d) show logic gate implementations for binary adders; and
As discussed above, the RNS relies on modular arithmetic principles, which allows an integer to be uniquely defined by its remainders (the residues or residue digits) when divided by a set of pair wise prime positive integers (these integers are also known as moduli and the set of these integers is known as a moduli set). As such, a feature of the RNS is that an integer within a large dynamic range (defined by the product of the moduli) can be uniquely represented by a set of residue digits that have much smaller values corresponding to the size of the moduli set used in the computation. For example, the residue digits from a moduli set [7,8,9] have values varying within the dynamic range of 0 to 6, 0 to 7 and 0 to 8 respectively and the maximum dynamic range provided by this moduli set [7,8,9] is [0,7×8×9=504) i.e. integers lying within the range of 0 to 503 can be uniquely represented by the residue digits from this moduli set [7,8,9]. An 8-bit integer in the range of 0 to 255 lies within this dynamic range and hence, can be uniquely and more than adequately represented by the residue digits from the moduli set [7,8,9]. For example, an integer 178 can be represented by the residue digits 3,2,77,8,9 using the moduli set [7,8,9].
The residue digits representing an integer follow a particular pattern as the integer value increases. In particular, as the integer value increases, the residue digit representing the integer increases as well and resets to 0 whenever the integer value reaches multiples of the modulus (including the modulus itself). For example, using the modulus m=7, the residue digits of an integer will follow a pattern of the form {0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2 , . . . } as the integer value increases linearly from 0 with an incremental value of 1. Hence, the digital output of the RNS-based ADC 300 should also follow a pattern. More specifically, the digital output of the ADC 300 should also reset itself repeatedly, in particular whenever the level of the analog input signal reaches multiples of the modulus used by the ADC 300.
As shown in
Each group of zero-crossing based folding circuits is configured for a different integer modulus mn, where n=1,2, . . . , M,M≧2 and may be referred to as a modulus mn group. Each integer modulus mn is relatively prime to the other integer moduli. In other words, other than 1, there is no common factor between the integer moduli. For example, the ADC 300 may comprise three moduli mn groups of zero-crossing based folding circuits for a M=3 moduli set [3,4,5] with m1=3, m2=4 and m3=5 which are relatively prime to one another.
Each modulus mn group comprises mn parallel zero-crossing based folding circuits, each indexed mn,i where i=1, . . . , mn. Each zero-crossing based folding circuit may be implemented with any type of circuit that is capable of performing the zero-crossing based foldings. Examples of such circuits are described in references [10], [11] and [12].
With an analog input signal whose level VIN increases linearly, the mn zero-crossing based folding circuits in each modulus mn group produce mn zero-crossing based folding waveforms Wm
As shown in
The second zero-crossing based folding waveform Wm
Similar patterns are also present in the zero-crossing based folding waveforms Wm
Furthermore, the mn zero-crossing based folding circuits for each modulus mn group have the same folding factor determined by the modulus mn. In other words, their zero-crossing based folding waveforms have the same number of zero-crossings or zero-crossing voltage transitions. Note that the folding factors must be able to provide the resolution and dynamic range required by the ADC 300. Thus, the total number of zero-crossings in the zero-crossing based folding waveforms depends on the dynamic range to be provided by the ADC 300. For example, if the ADC 300 is designed to be an 8-bit ADC, the number of zero-crossings in each zero-crossing based folding waveform may be either (28−1)/mn or (28)/mn, depending on the phase differences between the waveforms generated by the circuits mn,i within each modulus group mn. The zero-crossing based folding waveforms for each modulus group mn have to comprise a number of zero-crossings sufficient to represent the total number of LSBs required by the ADC 300.
The zero-crossing based folding waveforms may be of the single-ended type or the differential-ended type which is more noise tolerant and common mode level insensitive.
Each modulus mn group of zero-crossing based folding circuits is configured to compare a level VIN of the analog input signal at different points of the input signal against a set of reference voltages (or in other words, code transition voltage levels) to produce comparison outputs.
The zero-crossings of each zero-crossing based folding waveform are at a subset of the set of reference voltages. The reference voltages are multiples of the quantization level ΔV of the ADC 300, typically measured in volts. The actual amplitudes of the reference voltages may be in the millivolt or micro-volt range. Some of the reference voltages may be obtained from a reference ladder resistor network. To reduce the number of voltages needed from the reference ladder resistor network, additional voltages may be generated by an interpolation technique using the adjacent pair of zero-crossing based folding circuits required for producing zero-crossing based folding waveforms of appropriate folding factor. For example, referring to
The comparison outputs for each modulus mn group are based on the plurality of zero-crossing based folding waveforms produced by the modulus mn group. In particular, each comparison output is a point on a respective zero-crossing based folding waveform corresponding to the level VIN. For each modulus mn group of zero-crossing based folding circuits, the comparison outputs are collectively output from the zero-crossing based folding circuits in the group and indicate a residue from a modulo operation on the input signal level VIN based on the modulus mn. The value of the residue is related to the number of parallel zero-crossing based folding circuits and the folding factor in the modulus mn group.
A more specific example of how the zero-crossing based folding circuits operate is as follows. A level VIN of the input signal at a point of the input signal is first compared against the reference voltages. This determines the location on the zero-crossing based folding waveforms the level VIN corresponds to. The comparison outputs are the points of the waveforms at this location.
For example, in
The ADC 300 further comprises a coding unit configured to transform the comparison outputs into the RNS representation. The coding unit, together with the zero-crossing based folding circuits, forms a RNS converter.
For each modulus mn, the coding unit comprises a plurality of comparators configured to convert the outputs of the plurality of zero-crossing based folding circuits (the comparison outputs) to a plurality of comparator bits with each comparator bit indicating the level of one of the plurality of waveforms (and in particular whether it has the characteristic of being above or below its associated horizontal dotted line).
The coding unit further comprises an encoder for each modulus mn whereby the encoder is configured to combine the plurality of comparator bits (from the comparators associated with the modulus mn group) to form a plurality of bits with a different format.
With a linearly increasing input signal level, the digital outputs from the encoder follow a pattern in which they are repeatedly reset to zero. More specifically, the digital outputs from the encoder are reset to zero every time the input signal level reaches the value, and multiples of the value of the modulus mn. In other words, these digital outputs encode the residue of the input signal level from a modulo operation based on the modulus mn. Thus, these digital outputs can be said to be in the RNS format i.e. the circular code pattern digital outputs (comparator bits) from the comparators associated with each modulus mn group are combined by the encoder to form digital outputs in the RNS format.
The encoder may comprise mn−1 circuits capable of performing the Exclusive OR (XOR) function. These circuits may comprise a plurality of XOR logic gates.
By combining the residue digital output codes from all the moduli groups, the corresponding input signal level within a dynamic range equal to the product of the moduli used by the ADC 300 can be uniquely determined. As shown in
The RNS is capable of detecting and correcting bit errors when redundant moduli are used. Therefore, in one example, the ADC 300 uses redundant moduli. In other words, the ADC 300 uses a plurality of non-redundant moduli which are sufficient to provide the desired level of resolution of the input voltage (because their product is sufficiently high to encode the input voltage to this desired accuracy), and one or more additional moduli, which can be considered as redundant. These redundant moduli are also relatively prime with respect to each other and to the non-redundant moduli. The residues extracted by the ADC 300 for the redundant moduli can be compared against the residues extracted for the non-redundant moduli to check the accuracy of the residues obtained for the non-redundant moduli. Such ADCs are capable of performing self bit error detection and self bit error correction, and thus are more reliable. The ADC 300 may comprise a moduli mn group of zero-crossing based folding circuits and a coding unit for each redundant modulus so as to convert the analog input signal into additional residues based on the redundant modulus. These moduli mn groups of zero-crossing based folding circuits and coding units may be used with an appropriate decoder or computation circuit that is capable of performing the error detection and correction functions. Reference [14] is a reference on the error detection and correction properties of the RNS.
Because of the modular nature of the circuit arrangements in the ADC 300 as well as the mathematical properties of the RNS, it is possible to independently enable and disable each moduli mn group of zero-crossing based folding circuits and its associated coding unit. In one example, a control unit comprising a control circuit is configured to enable and disable the zero-crossing based folding circuits and associated coding units for a subset of the plurality of moduli used by the ADC 300. Disabling the zero-crossing based folding circuits and coding units for a subset of the plurality of moduli does not affect the general operation of the ADC 300, except that it lowers the resolution and dynamic range provided by the ADC 300. Therefore, the number of moduli used can be reduced if a lower resolution and a smaller dynamic range are acceptable. For instance, a moduli set [7,8,9] provides a maximum dynamic range of 504 and instead of using this moduli set, it is possible to remove the modulus 7 and use a new moduli set [8,9] when a smaller dynamic range of 9×8=72 is acceptable.
The ADC 300′ comprises a second example encoder (hereinafter, “Encoder #2”) instead of Encoder #1 in
Similar to the Encoder #1, by using a combination of the residue digital output codes generated by Encoder #2 of all the moduli group, it is possible to uniquely determine the corresponding input signal level within a dynamic range equal to the product of the moduli used by the ADC 300′. As shown in
Similar to the ADC 300, the ADC 300′ may also use redundant moduli. Furthermore, each moduli mn group of zero-crossing based folding circuits and its associated coding unit in the ADC 300′ may also be independently enabled and disabled.
The ADC 300 or its variation 300′ is a highly efficient ADC with several advantages over existing ADCs. The following describes some of the advantages of the ADC 300 and its variation 300′.
As compared to the Folding ADC and the Flash ADC, the ADC 300 or 300′ uses a smaller number of parallel paths to achieve a same resolution. The ADC 300 or 300′ uses a zero-crossing based folding circuit together with one comparator for every parallel path and compared to the commonly used parallel based Flash ADC, a much smaller number of comparators is required for the ADC 300 or 300′ to provide a particular dynamic range. For example, an 8-bit ADC in the form of the ADC 300 or 300′ using a [7,8,9] moduli set can be more than adequately implemented by using 7+8+9=24 comparators i.e. 24 parallel paths whereas an 8-bit Flash ADC requires 28−1=255 parallel paths and an 8-bit Folding ADC requires 2(24−1)=30 parallel paths. The difference in the number of parallel paths required by a Folding ADC, a Flash ADC and ADC 300 or 300′ becomes even more pronounced when higher resolutions are required. For example, to implement a 10-bit ADC, the Flash ADC will need 1023 comparators, the Folding ADC will need 2(25−1)=62 comparators whereas the ADC 300 or 300′ will only require 9+11+13=33 comparators when using the [9,11,13] moduli set. This great reduction in the number of comparators and parallel paths required by the ADC 300 or 300′ is possible as the operations of the ADC 300 or 300′ are based on the theory of modular arithmetic using the RNS. Furthermore, despite the reduction in the number of parallel paths, the speed performance of the ADC 300 or 300′ is not inferior to that of the Folding ADC or the Flash ADC.
In addition, the RNS modular arithmetic also provides the ADC 300 or 300′ features of built-in bit error detection and bit error correction capability of its output bits. This is possible because of the error detection properties of the Redundant Residue Number System (RRNS). In particular, the ADC 300 or 300′ is capable of detecting and correcting errors in its output when redundant moduli are used. Extra parallel circuitry such as additional zero-crossing based folding circuits may be included for these redundant moduli. Thus, the ADC 300 or 300′ is capable of achieving a more reliable and accurate operation.
Furthermore, the ADC 300 or 300′ may comprise a control unit that enables and disables the zero-crossing based folding circuits and coding units for a subset of the plurality of moduli used. This allows an adaptive variation in the conversion resolution of the ADC 300 or 300′ to suit the need of the system operation that the ADC 300 or 300′ is used in, thereby allowing power management and reducing the overall power consumption of the system. In particular, when a lower resolution and a smaller dynamic range are acceptable, the zero-crossing based folding circuits and coding units for a subset of the plurality of moduli used by the ADC 300 or 300′ may be disabled. Although the device's resolution level is sacrificed, a lower operation power can be achieved and this is beneficial especially for devices such as a battery operating mobile device. The zero-crossing based folding circuits and coding units may be enabled again when a higher resolution and a higher dynamic range are required.
While it is true that modular arithmetic has been applied in analog to digital conversion (see reference [13]), there are distinct differences between Pace's proposal and the ADC 300 or 300′.
The first difference is as follows. Pace's proposal requires the use of analog folding circuits with high linearity characteristics and accurate reference voltages for proper operation. Furthermore, the folding waveforms used for Pace's proposal are of a triangular shape that needs to bend sharply at the peaks of the waveforms while maintaining symmetry along the linear slopes of the waveforms. In contrast, the ADC 300 or 300′ only requires the zero-crossing based folding circuits to operate with accurate reference voltages to achieve the foldings. In particular, each of the zero-crossing based folding circuits only needs to determine whether the analog input signal level has crossed the reference voltages. Hence, the zero-crossing based folding circuits of ADC 300 or 300′ operate more like digital circuits where circuit linearity is irrelevant. This provides a significant advantage over Pace's proposal in terms of implementation practicality as the ADC 300 or 300′ may be implemented with a lower circuit complexity.
The second difference is in the output format of Pace's proposal and the ADC 300 or 300′. Pace's proposal outputs a digital code in a format that he refers to as Symmetrical Number System (SNS) in his publication [15]. Due to the ambiguity caused by the symmetrical triangular folding waveforms used in Pace's proposal, the SNS format has the disadvantage of requiring a complicated decoding process and/or additional steps to convert the outputs to the RNS format in order to apply the modular arithmetic algorithm for further processing. In contrast, the ADC 300 or 300′ outputs digital codes inherently in the RNS format. Note that the RNS format is technically based on a saw-tooth waveform while the SNS format is based on a triangular waveform, although in the ADC 300, no saw-tooth waveform is actually needed. The encoding of the digital codes output by the ADC 300 or 300′ with the RNS format is advantageous as efficient execution of signal processing algorithms may be performed on these digital codes directly based on modular arithmetic principles. Furthermore, encoding the digital codes output by the ADC 300 or 300′ with the RNS format allows unique identification of the corresponding analog input signal level.
Computation System for Computing an Inner Product of an Input Signal with a Plurality of Coefficients
Referring to
The conversion unit 1402 is configured to output the input signal in a representation comprising a plurality of input signal entries whereby the representation is in a bit-parallel format. For example, the input signal may be in the form of a K-component vector {right arrow over (x)}k=[x0,x1, . . . , xK-1], where x0,x1, . . . , xK-1 are the input signal entries. Each input signal entry xk indicates a characteristic of the input signal (for example, a level or magnitude of the input signal) at a point of the input signal (which may be a point in time if the input signal is a time signal).
If the input signal is an analog signal, the conversion unit 1402 is in the form of an ADC converter.
In one example, the conversion unit 1402 is in the form of an ADC 300 of the kind described above in relation to
However, note that the conversion unit 1402 of the DA-RNS system 1400 can also be in the form of other types of ADC. For example, the conversion unit 1402 may be in the form of an ADC that outputs data in the BC format and in this case, the BC formatted data may be converted to a format required by the summation and accumulating units 1406, 1408 before they are fed to the formatting unit 1404.
In any case, the conversion unit 1402 converts the input signal into a digital representation based on the residue number system (RNS) which uses a plurality of M relatively prime moduli, specifically a moduli set mi=[m1,m2, . . . , mM]. Each input signal entry is represented as a plurality of residues, corresponding to respective moduli of the plurality of moduli used by the system 1400. More specifically, each residue corresponds to an output from a modulo operation on the input signal entry based on its respective modulus.
Each residue is encoded as a binary string having a plurality of bits or in other words, components (at least) equal to the modulus minus one. The string has a number of bits taking a first value (say “1”) equal to the residue. Thus, the plurality of bits encoding each residue have equal weights. Any format may be used to encode the residues as long as the number of bits in the binary string taking the first value is equal to the residue.
In a more specific example, each residue is encoded in a thermometer code format as discussed below. Such a residue may be referred to as a thermometer code residue (TCR).
Thermometer code (TC) format refers to an encoding format which comprises a plurality of binary bits taking either a value of ‘0’ or ‘1’. The number of binary bits taking the value of ‘1’ is equal to the value of the datum the format encodes. For example, using the TC format, an integer with a value of 5 can be represented using a plurality of bits with the bit pattern {11111} comprising 5 bits ‘1’ (i.e. 5 bits with the value of ‘1’). Binary bits with a value of ‘0’ (i.e. bits ‘0’) may also be added to explicitly indicate the dynamic range (DR) associated with the datum. For example, an integer with a value of 5 and with a dynamic range of 10 may be represented by a plurality of bits with the bit pattern {0000011111}.
Mathematically, a TC encoded number system is a unary numeral system which is equivalent to a base-1 bit system when the symbol used is the binary bit. It is also common to describe it as a no place-value number system, since the positions of its bits ‘1’ in the bit pattern are not important. In other words, the bits representing a datum in the TC format have equal weights and the TC format can be referred to as an equal place-value number system.
In the output of the conversion unit 1402, each residue may be expressed in terms of its plurality of bits tkn, according to Equation (19). In Equation (19), |xk|m
Some features associated with TC based modular arithmetic are as follows. Modular addition of two TCRs can be done by first concatenating the bits encoding the TCRs. Then, the modulo operation can be done by checking a single bit of the output after removing the trailing ‘0’ of the concatenated bits as described below.
Consider an example with two TCRs, r1 and r2, each corresponding to an integer modulus m with decimal value of n. Let r1 consisting of (n−1) bits ‘1’ and r2 consisting of (n−3) bits ‘1’ be represented as follows, where each tx corresponds to a binary bit of value ‘1’ situated at bit position x in the r1 and r2 TC data.
r
1=0tn-1tn-2tn-3 . . . t3t2t1
r
2=000tn-3tn-4tn-5 . . . t3t2t1
The modulo addition of r1 and r2 comprises first concatenating r1 with
This intermediate sum is then logically shifted to the right by 3 bits to form a 2n-bit length TCR normalized to its rightmost bit position as follows:
r
1
2>>3=0000t2n-4t2n-5t2n-6 . . . t3t2t1
Performing the modulo operation of this intermediate sum in the third step is done in hardware by testing the bit value of the normalized intermediate sum's nth bit (which corresponds to the value of the modulus used for these TCRs). Based on this nth bit value, a circuit (e.g. a multiplexers based circuit) selects the lower n bits if the nth bit has a bit value of ‘0’ or the upper n bits if the nth bit value is equal to ‘1’.
Modular subtraction operation for TCRs can also be similarly performed by concatenating the minuend with the additive inverse of the subtrahend, where the additive inverse of a TCR is obtained by taking the one's (1's) complement of its plurality of bits. With TCR based modulo operation, there is also no ambiguity in taking the additive inverse of a value ‘0’. This is because the one's complement of the plurality of bits in the TCR of the value ‘0’ is equal to the TCR of the modulus which reverts to the TCR of the value ‘0’ after the modulo operation.
System 1400 further comprises a formatting unit 1404. The formatting unit 1404 is configured to convert the output of the conversion unit 1402 in the bit-parallel format to the bit-serial format. The formatting unit 1404 is further configured to send the bit-serial formatted data to the summation unit 1406.
System 1400 employs the DA technique and the RNS as mentioned above. Thus, it may be referred to as a DA-RNS system. A system 1400 whose summation unit 1406 receives input signal entries with residues encoded in the TC format may be referred to as a TC based DA-RNS system.
It is preferable if the TC based DA-RNS system uses more moduli with small values rather than a few moduli with medium values. For example, it is preferable to use a [5,7,8,9] moduli set rather than a [11,13,15] moduli set to cover a range equivalent to the range of a 11-bit BC system. This allows a more efficient use of the TC format with the RNS.
The equations governing the TC based DA-RNS system are similar to those governing the BC based DA-RNS system as mentioned above. However, instead of the BC's bit expression as shown in Equation (4), the TCR's bit expression as shown in Equation (19) is used. In other words,
The residue expression (corresponding to Equation (18)) for the TC based DA-RNS system can then be obtained by replacing the symbols used in Equation (18) with the TCR equivalents, namely, the number of bits for TCR is equal to mi−1, and all bits are of equal weight, 20=1. This residue expression is shown in Equation (21) where yi is the inner product for the modulus mi (more specifically, yi is the residue from a modulo operation on the inner product of the input signal with the plurality of coefficients Ak, whereby the modulo operation is based on the modulus mi). The inner product of the input signal with the plurality of coefficients Ak may be derived by combining all the inner products obtained for the plurality of moduli (for example, a binary representation of the inner product may be obtained by performing a reverse conversion using the Chinese Remainder Theorem). In other words, the inner product of the input signal with the plurality of coefficients Ak is a combination of the inner products obtained for the plurality of moduli after performing a reverse conversion.
Based on Equation (17), the expression of fm
The values of fm
between the bits tkn, of the residues corresponding to the modulus mi and the plurality of coefficients Ak, and modulo operations |•|m
based on the modulus mi.
As shown in Equation (22), the DA technique is used. In particular, for each modulus, the dot product each summation value arises from is performed for a bit position n whereby the dot product is between the bits tkn at the bit position n (in other words, the bits t0n,t1n . . . , t(K-1)n) of the residues corresponding to the modulus mi and the plurality of coefficients Ak. In other words, the summation values represent the sum of the coefficients Ak over those of the set of corresponding bits which take the value 1.
In one example, the summation unit 1406 comprises a memory which in turn comprises a plurality of Look-Up-Tables (LUTs) (also referred to as DALUTs) with memory addresses addressable using the bits of the input signal entries. Each channel of the summation unit 1406 corresponding to each modulus mi comprises a DALUT. For each modulus mi, the DALUT stores the values of fm
For each modulus mi, the summation unit 1406 is configured to provide the summation values for successive values of n, by successively addressing the DALUT using an address string of length K, generated from the K bits tkn at the bit position n of the residues corresponding to the modulus mi i.e. |x0|m
The accumulating unit 1408 is configured to execute the summation and modulo operation in the residue expression
as shown in Equation (21) for each modulus. In other words, it is configured to obtain an inner product yi for each modulus mi by cumulatively adding the summation values provided for the modulus mi and performing a modulo operation on the cumulative sum based on the modulus mi.
As shown in Equation (18), when the BC format is used to encode the residues of the input signal entries, it is necessary to scale fm
If a modulo operation is performed only after the summation of the summation values for all the bit positions i.e. only after
is completed, the accumulating unit 1408 may overflow. Therefore, it is preferable to expand Equation (21) using the algebra of residue as shown below and execute modulo addition operations successively as the summation values are obtained. This can be more clearly illustrated using the example below in which a modulo operation is performed after every addition.
In other words, it is preferable to configure the accumulating unit 1408 to obtain the inner product yi for each modulus by (a) performing a summation of a first subset of the summation values (e.g. fm
In one example, the accumulating unit 1408 comprises a plurality of channels with each channel corresponding to one modulus mi. The accumulating unit 1408 further comprises a plurality of accumulators, with each accumulator configured to obtain the inner product for one modulus mi in one channel. In other words, for a moduli set [m1,m2, . . . , mM], the accumulating unit 1408 comprises a total of M channels and a total of M accumulators.
Thus, the units 1406, 1408 are each implemented as a set of M channels.
The summation unit 1406 portion of the channel comprises a 16-entries DALUT 1506 and the accumulating unit 1408 portion of the channel comprises a Modulo-mi Accumulator 1508. The accumulator 1508 is configured to obtain the inner product for the corresponding modulus mi. As shown in
In one example, the modular adder 1502 as shown in
in other words, residues from modulo operations.
The binary adders 1602, 1604 are used to perform the modular addition operation:
In particular, the first binary adder 1602 is configured to perform an addition of the two operands, A and B to provide a sum S′. The second binary adder 1604 is configured to subtract the value of the modulus m from the sum S′. This subtraction is done by adding the sum S′ with the two's complement of m, i.e. {tilde over (m)}. The BC based modular adder further comprises a multiplexer 1606 whose output is controlled by a carry-out bit cout from the subtraction done by the second binary adder 1604. The multiplexer 1606 is configured to determine whether the output of the BC based modular adder should be S=A+B or S=A+B−m based on the carry-out bit cout. In other words, the multiplexer 1606 is in effect performing a modulo operation.
Although there is no carry propagation between channels for different moduli in the BC based modular adder, there is still a localized carry propagation occurring within each channel. This is because the residues to be summed by the BC based modular adder are encoded with the BC format whose operation is based on the principles of the binary adder. Furthermore, the BC based modular adder needs the carry-out bit cout from the subtraction performed by the second binary adder 1604 in order to generate its final output. Therefore, the performance of the BC based modular adder depends very much on the carry propagation performance of binary adders 1602 and 1604.
Each of the first and second binary adders 1602, 1604 may be in the form of a ripple carry full adder which is slow but uses a simple logic structure, or a version of the carry-look-ahead full adder which is faster but at a much higher logic gates cost.
As mentioned above, the BC based modular adder is inefficient due to the carry propagation which is in turn due to the use of the BC format. This inefficiency may be overcome by using an alternative coding format.
In another example, the modular adder 1502 is in the form of a one-hot code based modular adder (OHC based modular adder) which uses a one-hot code (OHC) format for encoding the data.
The OHC format comprises n bits, but only 1 bit is asserted at any one time. Hence, it is also known as a 1-out-of-n encoding scheme. The OHC format is normally used for decoding address bits for LUTs. When it is used to encode residues in a RNS, each residue encoded in this manner may be referred to as a one-hot residue (OHR) [7]. In the OHC format, the value of the residue corresponds directly to the asserted bit position. Compared to the TCR, the OHR uses one extra bit in order to encode the value ‘0’. For example, in a modulus-7 system, a residue with a value of 5 may be represented with 7 bits with the bit pattern {0100000}, whereas a residue with a value of 0 may be represented with 7 bits with the bit pattern {0000001}.
While the value of an OHR is intuitively clear from its bit pattern, it lacks formal mathematical properties (e.g. base-1, base-2) and hence, it is difficult to use the OHR for general mathematical purposes. Nevertheless, the inventors of the present invention have found out the unique usefulness of the OHC for representing residues. In particular, the unique usefulness lies in that addition or subtraction of OHRs may be performed using a circular shifting technique which executes not only the addition or subtraction operation, but also the modulo operation on the output from the addition or subtraction.
For example, consider two modulus-7 residues r1 and r2 which have numerical values of 4 and 5 respectively. Expressing these residues in the OHC format, the following OHRs are obtained.
r
1=0010000
r
2=0100000 (25)
The modular sum of these two OHRs can be obtained by executing a circular shift operation on the bits of one of the OHRs, based on the value of the other OHR. For example, to sum r1 and r2, the bits representing r1 are circular shifted by five bit positions to the left (since the value of r2 is 5) such that the bit ‘1’ in the n=4 bit position wraps around the n=0 bit position and moves to the n=2 bit position. This is based on the assumption that in the plurality of bits representing r1, the highest value bit is the leftmost bit in the n=6 bit position and the lowest value bit is the rightmost bit in the n=0 bit position. The output of the above-mentioned circular shifting is thus {0000100}, implying a numerical value of 2, which is consistent with the summing operation: |4+5|7=2. As can be seen, the modulo operation is performed inherently via the wrapping involved in the circular shifting technique.
The OHC based modular adder may be implemented using shifters based circuits to perform the addition operation without carry propagation. As mentioned above, the circular shifting technique for adding or subtracting the OHRs performs not just the addition or subtraction but also the modulo operation on the output of the addition or subtraction. The implementation of the OHC based modular adder is thus simpler as compared to that of the BC based modular adder.
With the modular adder 1502 in the form of an OHC based modular adder and the summation values from the summation unit 1406 encoded in the BC format, the accumulator 1508 comprised in the accumulating unit 1408 can be said to have a hybrid design as elaborated below.
As shown in
In particular, at the beginning of each accumulation execution cycle, the register 1504 provides a first input (set to zero) as input A to the OHC based modular adder whereas the DALUT 1506 provides a first summation value (for the modulus associated with the channel) as input B to the OHC based modular adder. The OHC based modular adder then generates a first augend from the first input and the first summation value. This first augend is then stored in the register 1504.
A plurality of iterations is then performed whereby in a first iteration, the register 1504 provides the first augend as input A to the OHC based modular adder and the DALUT 1506 provides a second summation value for the modulus as input B. The OHC based modular adder then generates a second augend from the first augend and the second summation value. The second augend is then stored in the register 1504. Similar steps are performed in the subsequent iterations for the remaining summation values for the modulus. In other words, the OHC based modular adder is configured to successively generate further augends in a plurality of iterations after generating the first augend. A further augend is generated in each iteration from a most recently generated augend and a subsequent summation value provided for the modulus. The register 1504 is configured to store the augend from each iteration and is further configured to provide the OHC based modular adder the most recently generated augend in each iteration.
Compared to the BC based modular adder, the OHC based modular adder based on shifters operates much faster as there are no logic gate delays involved in the operation. Neither does the OHC based modular adder have the carry propagation issue. Instead, the operating speed of the OHC based modular adder is determined solely by the delay of the signal passing through the multiplexers. In addition, the number of transistors used to implement the log shifter circuit of the OHC based modular adder is even lower than that for the BC based modular adder using the ripple carry full adder which is to date, the most area efficient (but slowest) implementation for a binary adder.
As mentioned above, the plurality of bits in each TCR has equal weights. Therefore, the TC based DA-RNS system can be configured to operate at 2-bit-at-a-time (2BAAT) [1] or at an even higher rate to compensate for the longer bit-length of the TCR.
As shown in
The order of addition is not important and the two groups of bit-serial streams i.e. the first and second group of summation values may respectively comprise the summation values arising from even bits and odd bits encoding the TCR of the input signal entries. Alternatively, the first and second group of summation values may respectively comprise the summation values arising from the lower half of an N-bit word (with
and upper half of the N-bit word (with
encoding the TCR of the input signal entries. How the summation values are divided into the first and second groups usually depends on which division is more hardware convenient.
The DA-RNS based FIR filter in
A FIR lowpass filter output y[n] is related to its input signal x[n] through the filter coefficients Ak as follows:
As shown in Equation (26), the operation of the FIR low pass filter comprises multiple inner product computations as a series of input signal entries are made available to the filter.
A 4th order DA-RNS based FIR digital low pass filter designed using the Parks-McClellan algorithm has coefficients as shown below.
y[n]=3x[n]+11x[n−1]+15x[n−2]+11x[n−3]+3x[n−4] (27)
The frequency response of this FIR filter is shown in
To demonstrate the operation of this filter, an input data sequence comprising a plurality of input signal entries is generated. The input data sequence comprises a first signal component with a frequency at about 0.06 fs, i.e. within the passband of the filter and a second signal component with a frequency located at about 0.35 fs. To simplify the numerical conversion between the input signal entries in the form of data binary numbers and their RNS representations later on, the values of the input signal entries are rounded to integer values. The values of the input signal entries are also kept within bounds such that the resultant output dynamic range can be adequately covered using a [5,7,8] moduli set. An example input data sequence generated with 51 points is as follows:
x[n]={1,1,2,3,3,5,5,5,6,5,5,5,3,4,2,1,2,0,0,1,0,2,2,2,5,4,5,6,5,6,5,4,4,2,2,2,0,1,0,0,2,1,2,4,3,5,6,5,6,5} (28)
The input data sequence x[n] is then applied to the 4th order FIR filter, and the output y[n] obtained is as follows:
y[n]={3,14,32,57,86,118,154,187,212,226,230,226,212,190,165,133,100,71,47,28,17,21,39,61,89,125,162,194,215,230,237,230,212,183,147,114,86,61,39,21,17,28,47,71,100,133,168,201,227,237,233} (29)
The time domain response of the FIR filter with the input data sequence is also generated using a simulator for visual confirmation of its filtering effect and its operation as intended. The simulated input and output waveforms are shown in
The FIR filter designed is next translated to the DA-RNS based FIR filter in
The summation unit 1406 of the DA-RNS based FIR filter comprises three DALUTs, one for each channel corresponding to a modulus. The DALUT for each of the three channels is derived by calculating the summation values using Equation (22) with the plurality of filter coefficients Ak as follows:
A step-by-step calculation of the DA-RNS based FIR filter response is now presented to demonstrate the filter operation.
In particular, starting with n=0, the first group of residues sent by the conversion unit 1402 are residues of x[0], x[−1], x[−2], x[−3] and x[−4]. At n=1, the second group of residues sent by the conversion unit 1402 are residues of x[1], x[0], x[−1], x[−2] and x[−3]. In general, the residues sent by the conversion unit 1402 will progressively incorporate residues of a subsequent x[n] with residues of 4 prior input signal entries. In a practical casual system, input signal entries prior to x[0] are considered to have a value equal to 0. Hence in this case, the response of the DA-RNS based FIR filter will reach a steady state at n=4. The following shows the detail of the data operation for the three channels, A, B and C corresponding to the three moduli 5, 7 and 8.
The DALUT outputs corresponding to each row of bits received i.e. summation values provided by the summation unit 1406 are indicated under the “DALUT entries (m=5)” column. For each time instance n, four summation values are provided and are modulo-5 accumulated over four clock cycles as shown under the “Mod-5 Acc” column in the table of
The output from the 4th clock cycle (i.e. at tcycle=3) is the inner product for the modulus 5 derived from residues of the input signal entries x(n), x(n−1), x(n−2), x(n−3) and x(n−4) at time instance n and the filter's coefficients Ak. From the table of
y
5
[n]={3,4,2,2,1,3,4} (31)
Similar steps are used to derive the output of the modulus-7 channel B. As the TCR bit-length is 6 bits long for this channel, the resultant inner product for the modulus 7 is obtained in the 6th clock cycle (indicated as tcycle=5) as shown under the “Mod-7 Acc” column in the table of
y
7
[n]={3,0,4,1,2,6,0} (32)
iii) Channel C for Modulus 8
Similar steps are used to derive the output of the modulus-8 channel C. As the TCR bit-length is 7 bits long for this channel, the resultant inner product for the modulus 7 is obtained in the 7th clock cycle (indicated as tcycle=6) as shown under the “Mod-8 Acc” column in the table of
y
8
[n]={3,6,0,1,6,6,2} (33)
Consolidating the outputs of all three channels from Equations (31), (32) and (33) for n=0 to 6, the output data sequence of the FIR filter, in RNS representation is as follows.
For n=0 to 6:
y[n]={<3,3,3>,<4,0,6>,<2,4,0>,<2,1,1>,<1,2,6>,<3,6,6>,<4,0,2>} (34)
The correctness of this RNS based output can be confirmed by performing a reverse conversion using the Chinese Remainder Theorem (CRT) to find its binary representation. The CRT's reverse conversion formula is as follows (see reference [2]):
where
Applying the values used in this example, the CRT expression of Equation (35) becomes:
Y=|56|y1|5+40|3y2|7+35|3y3|8|280 (36)
Substituting the residues digits values i.e. RNS representation of the RNS-based FIR filter as shown in Equation (34) into Equation (36), the binary representation of the y[n] output can be obtained as follows.
Starting with n=0:
The other binary values corresponding to n=1 to 6 can be similarly calculated and the y[n] output values for these n=1 to 6 are as follows.
y[1]=4,0,6≅14
y[2]=2,4,0≅32
y[3]=2,1,1≅57
y[4]=1,2,6≅86
y[5]=3,6,6≅118
y[6]=4,0,2≅154 (38)
These calculated values are exactly the same as the first seven values given in Equation (29), hence confirming the accurate operation of the DA-RNS based FIR filter and the TC based DA-RNS system.
To further demonstrate the practical feasibility of the TC based DA-RNS system, circuit level simulations using a PSPICE simulator are performed.
The 1BAAT design for the TC based DA-RNS system is shown in
As the bit-lengths used by the modulus 7 and 8 channels are longer, if these channels are implemented using the 1 BAAT, the accumulation for each time instance n would take 6 and 7 clock cycles respectively, as indicated in the tables of
The following presents the circuit and simulation results of the 2BAAT operation for the modulus-8 channel C.
To demonstrate the flexibility of the TC based DA-RNS system, two bit-serial streams are created for each channel. In particular, for the modulus-8 channel, a first bit-serial stream is created from the lower four bits of the TCR of each input signal entry, and a second bit-serial stream is created from the upper three bits of the TCR of each input signal entry. The second bit-serial stream is padded with one extra bit ‘0’ to balance the two bit-serial streams. These two bit-serial streams are then sent in parallel in the 2BAAT bit-serial manner to the summation unit 1406 which contain the two DALUTs for the modulus-8 channel of the DA-RNS based FIR filter. The BC encoded output i.e. summation values provided by each of the two DALUTs is then fed to respective ones of the cascaded modulus-8 adders of
The simulation results above confirm the practical feasibility of the TC based DA-RNS system. Using a combination of TC, BC and OHC formats, an efficient means to perform DA-RNS based inner product calculation can be achieved by the TC based DA-RNS system. To compensate for the longer bit-lengths of the TCRs, higher BATT rates can be used. This possibility arises as the bits in the TCRs have equal weights and the operating principles of the TC based DA-RNS system are not complex.
An advantage of the TC based DA-RNS system lies in its simple accumulation operation during the computation of the inner product. Compared to the scaling accumulator for the BC based DA system (see
The superiority of the TC based DA-RNS system over the BC based DA and BC based DA-RNS systems thus hinges on the effectiveness of its modular adder. This section compares the performance and complexity of the OHC based modular adder against the BC based modular adder comprising binary adders.
A BC based modular adder requires two binary adders of either 3-bit or 4-bit arranged in the manner as shown in
Two standard representative binary adders may be used in the BC based modular adder for the comparison against the OHC based modular adder. These are the ripple carry full adder and the carry-look-ahead full adder. The ripple carry full adder is the most hardware efficient but slowest implementation of the binary adders, while the carry-look-ahead full adder is one of the fastest binary adder but has a high hardware circuit complexity. Note that special modular adders that are optimized for specific classes of moduli (e.g. 2n and the likes) are not considered in the comparison as the purpose of the comparison is to evaluate adders that may be employed in systems using generic moduli.
a) shows a logic gate implementation for one bit of the ripple carry full adder [9]. A 3-bit or 4-bit ripple carry full adder may use 3 or 4 of such a circuit.
A 4-bit binary carry-look-ahead full adder may be implemented with the circuit in
To implement the OHC based modular adder, moduli with values varying between 5 and 13 are used. With such moduli, the circuits for the OHC based modular adder may be realized in a more practical manner in terms of the hardware implementation. Furthermore, such moduli can form a moduli set with a dynamic range of more than 216, sufficient for most practical cases. In an OHC based modular adder using a modulus value of m, the number of multiplexers needed in the log shifter circuit with the arrangement as shown in
Gate count comparison between the OHC based modular adder and the BC based modular adder is difficult as the multiplexers in the OHC based modular adder are usually realized using transistor based circuits such as the 4-transistor based CMOS Transmission Gate or the 2-transistor based Pass-Transistor logic. Hence, it is more appropriate to compare the hardware complexity of the OHC based modular adder and the BC based modular adder in terms of transistor count. However, this does not reflect the complexity involved in the wiring of the underlying circuits. The transistor count comparison is performed based on the following: a total of 6 transistors is used for each 2-input XOR logic gate, a total of 4 transistors is used for each of all other types of 2-input logic gates, a total of 2 transistors is used for each extra input pin and a total of 2 transistors is used for each NOT gate. Each multiplexer is considered to comprise the 4-transistor based CMOS transmission gate as this is a fairly conservative design. One NOT gate is shared among all multiplexers to generate the internal complement shift control signal.
Critical path gate-delay comparison is based on the longest path that a signal propagates through the circuits of the OHC based modular adder and the BC based modular adder. For the BC based modular adder, this is equal to the delay through the two binary adders 1602, 1604 to generate the S″ value for the output multiplexer 1606 as shown in
To provide a more definitive comparison, a HPSICE simulation is performed to implement a ripple carry full adder based on 65 nm technology to determine the time a signal takes to travel the critical path B0 to C0 shown in
A HSPICE simulation is also performed to estimate the signal propagation delay through a log shifter circuit comprising four multiplexers in cascade (such a log shifter circuit is suitable for a OHC based modular adder using moduli up to a value of 15). The latency or signal propagation delay measured via the simulation is 8.8 psec, in other words, an estimate of 2.2 psec delay is incurred as the signal travels through each multiplexer. The comparisons in this section are performed based on this estimate to obtain some indicative performance values, and to verify that using the OHC based modular adder is advantageous as compared to the BC based modular adder. However, note that in practice, the propagation delay of the signal through each multiplexer may vary depending on the actual output load, layout related parasitic effect, and skill of the designer.
As shown in
The following describes some advantages of the system 1400, particularly the system 1400 in the form of the TC based DA-RNS system.
The TC format is normally not popular as such a format appears to be not efficient due to its seemingly excessive number of bits required to represent typical data (e.g. 8-bit resolution). Hence, using the TC format with the RNS seems to be disadvantageous as it appears to nullify the RNS's benefit of having shorter word-lengths. Rather, such a benefit appears to be better achieved when the more conventional BC format is used for the DA-RNS implementation.
However, the inventors of the present invention have found that despite the seemingly higher number of bits required by the TC format, the TC format brings about unexpected and non-obvious advantages when used with the RNS. These advantages allow the TC format to be an attractive replacement for the BC format when used with a DA-RNS system. The use of the TC format enables the benefits of using the RNS with the DA technique to be truly realizable in a very efficient manner using simple circuit design.
One of the advantages is that when the TC format is used, the complications arising due to the 2n scaling factor encountered when using the BC format may be avoided. The accumulators required in a TC based DA-RNS system may hence be implemented in a much simpler manner (see
Furthermore, as compared to the BC based DA-RNS system, much simpler and yet, very efficient TCR modular arithmetic (for example, TCR modular addition) can be used in the TC based DA-RNS system. The modular addition or modular accumulation operations may be made even simpler and faster by using an OHC based modular adder. Using the OHC based modular adder overcomes the inefficient carry propagation as well as the complications due to the modulo operation associated with performing the modular addition with a BC based modular adder. Therefore, the operating speed of the OHC based modular adder is superior to that of the BC based modular adder. The OHC based modular adder may also be implemented using simple log shifter based circuits. When the OHC based modular adder is used, the TC based DA-RNS system outputs data encoded with the OHC format. Output data in this format may be converted to data in the BC format using a look up table (LUT) based encoder design, such as the binary encoder.
The performance of the TC based DA-RNS system may be further enhanced with an efficient implementation of the modular accumulators such that the TC based DA-RNS system can be operated at a higher clock rate as well as at higher bit-at-a-time (BATT) rates [1].
In addition, the DR of each residue digit in the RNS is bounded by its modulus. For example, with a [7,8,9] moduli set, the word-length of a residue digit in the TC format may just be 6, 7 and 8 for modulus 7, 8 and 9 respectively. These word-lengths are similar to the word-lengths of binary numbers that may be represented by the moduli set [7,8,9] if these numbers were encoded in the BC format (in particular, the binary numbers that may be represented by the moduli set [7,8,9] are in the range of [0,504)). Therefore, using the TC format with the RNS does not lead to excessive bit-lengths when compared to the BC based DA design.
As mentioned above, a TC based DA-RNS system comprising a DA-RNS based FIR filter is designed and implemented with its operation simulated using the PSPICE simulator. The simulation results validate the accuracy and practical feasibility of the TC based DA-RNS system. A broad performance comparison against the BC based DA system also shows that there is no penalty incurred in terms of transistor count and latency for the TC based DA-RNS system. Instead, there is a potential to run the TC based DA-RNS system at a higher clock rate or a higher BAAT rate (using parallel bit-serial operations) to further enhance the throughput performance of the system.
In the TC based DA-RNS system, one important practical consideration for using RNS based modular arithmetic is that a forward conversion is required to first convert the input signal (with levels coded in conventional numbers) to its residues. This is likely to be a costly operation and usually hinders the wide adoption of RNS in real world applications. This problem may be overcome by using the ADC 300 in the conversion unit 1402 of the TC based DA-RNS system as the data generated during the conversion by the ADC 300 are inherently output in the RNS pattern and with the TC format. As such, there is no extra overhead needed to convert the input signal to its RNS representation, and signal processing arithmetic operations on the input signal can be performed using the TCRs directly.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG2012/000160 | 5/7/2012 | WO | 00 | 12/30/2013 |
Number | Date | Country | |
---|---|---|---|
61502869 | Jun 2011 | US |