1. Field of the Invention
This invention relates to electronic circuitry and, more particularly, to reconfigurable circuitry which may be utilized to carry out a number of different logical operations on data.
2. History of the Prior Art
The manipulation of data in modern electronic systems requires the transmission of data from place to place, for example, within a local network or over the internet. The speed of data transmission has increased (and continues to increase) as various new forms of hardware are created and older forms are improved. However, no matter how rapidly data is transmitted, it is only useful if transmitted without errors.
There are a number of methods for testing to determine if data has been correctly transmitted. One of these methods makes use of what is referred to as a cyclical redundancy check (CRC) to ascertain whether data being received is what was in fact sent. Typically, at the sending station the cyclical redundancy check process adds to the data stream of a message a sequence of bits generated from that data stream then determines at the receiving station whether the message with the added bits is correct. Since the bits added at the sending station are determined by the content of the message being sent, those added bits may be tested against the message to determine (within reason) whether the message received is correct.
Historically, the cyclical redundancy check was conducted on a bit serial basis as the data was received. Such a process functions well when the bits of a message are transmitted serially and at slower speeds. However, as faster transmission speeds are attained (including those attained by parallel bit transmission) such a test requires a relatively large amount of time and significantly reduces the overall speed of message transmission. Recently, software techniques have been devised for handling portions of the operation in parallel to decrease the time required for the cyclical redundancy check. For example, a paper entitled Fast Parallel CRC Algorithm and Implementation on a Configurable Processor, Ji and Killian, ICC2002-IEEE International Conference on Communications, vol. 25, no. 1, April 2002, pp. 1813-17, demonstrates an operation by which a cyclical redundancy check is accomplished on a message of indeterminate length by testing message portions of consistent segment lengths and then combining the results of the tests on the message portions.
Although such techniques have decreased the time required for the cyclical redundancy check, even more transmission speed can be attained by a hardware solution. However, once chip area is allocated to a hardware solution, the circuitry is typically useful for accomplishing only the limited purposes for which it was devised.
To this end, it is desirable to enhance the speed of data transmission by hardware techniques which may be utilized to carry out a large variety of different logical operations on data.
It is an object of the present invention to enhance the speed of data transmission by providing circuitry which may be utilized for a variety of purposes including providing and testing the correctness of cyclical redundancy check values.
The present invention is realized by a reconfigurable arithmetic circuit including a matrix having a plurality of partial product mask cells arranged in rows and columns, where rows and columns have incrementing arithmetic weights assigned, each partial product mask cell including a gate implementing a logical AND function of its inputs to provide an output, and a programmable memory cell connected to furnish input to the gate, a plurality of horizontally oriented conductors each connected to furnish input to the gates of the partial product mask cells of a row, and a plurality of diagonally oriented conductors each connected to furnish input to the gates of the partial product mask cells along the diagonal of increasing arithmetic weight of rows and columns, and a compression circuit receiving inputs from the gates of the partial product mask cells of the matrix, and furnishing outputs providing conventional arithmetic compression of its inputs in carry-saved format.
These and other objects and features of the invention will be better understood by reference to the detailed description which follows taken together with the drawings in which like elements are referred to by like designations throughout the several views. It is to be understood that, in some instances, various details of the invention may be shown exaggerated or otherwise modified to facilitate an understanding of the invention. Moreover, some aspects of the invention considered to be conventional may not be shown so as to avoid obfuscating more important aspects or features of the invention.
As has been pointed out, a cyclical redundancy check is used to determine that data which has been received is the data which was, in fact, sent. A cyclical redundancy check usually appends a sequence of bits to the data stream of a message at the sending station then determines whether the message including the appended bits is correct at the receiving station. The appended bits are generated at the sending station based on the message being sent. Consequently, the appended bits may be tested against the message received to determine (within reason) whether it is correct.
A cyclical redundancy check may be utilized in a great many aspects of data transmission. It may be used to determine that data sent over the internet or other communications has been correctly received. It may be used for the same purpose in local area networks. It may be used to determine within a specific piece of data manipulation hardware (such as a computer) whether data is correctly transmitted and received by the different components of that piece of hardware.
Because a cyclical redundancy check may be (and often is) used in so many aspects of data transmission, the speed with which the check takes place necessarily affects the speed at which data may be transmitted. Classically, the operations of generating and later testing a cyclical redundancy value appended to serially transmitted data were accomplished a bit at a time as the data was transferred. As data transmission speed has increased by improvements such as increasing the width of the data path, such a method has become much too time consuming.
Consequently, methods have been devised for generating and testing a cyclical redundancy value while handling bit portions of the transmitted data in parallel. Typically, these methods are accomplished by executing a specified algorithm on a processor to manipulate the data to generate or check a cyclical redundancy value.
As with many elements of data manipulation, a hardware solution can provide results faster than can a software solution. However, a hardware solution is typically not as flexible as a software solution.
The present invention offers a hardware solution to the operations of obtaining and checking cyclical redundancy values which solution overcomes the limitations of prior art solutions. More particularly, the hardware of the present invention is uniquely reconfigurable so that it may be used for a variety of different forms of cyclical redundancy check computations and may be converted to accomplish a plurality of other logical functions as well.
In order to understand the invention, it will be useful to understand how a cyclical redundancy check is carried out. First, an introduction to the Galois field framework is useful.
The cyclical redundancy check of concern is implemented utilizing Galois field arithmetic in an extension field GF(2N) of the finite field referred to as GF(2). In GF(2), which has elements that can be represent by single binary digits, the addition operation is a modulo 2 addition which is equivalent to an exclusive OR (XOR) operation, and the multiplication operation is equivalent to an AND operation. Because of these equivalences, Galois field arithmetic is useful in implementing data operations. Elements of the extension field GF(2N) may be represented as single-digit binary coefficients of a polynomial in x of degree-bound N, or as a binary number of N digits. The addition operation in GF(2N) is obtained by performing the addition in GF(2) on the coefficients of equal power of x, in the polynomial representation, or as a bit-wise XOR, in the binary number representation.
In order to define multiplication in GF(2N), a primitive polynomial G(x) must be selected. This primitive polynomial G(x) is a polynomial of degree N that is prime (i.e., it can not be factored into smaller polynomials) and that has at least one zero that is a generator of the field GF(2N). A generator of GF(2N) is an element that, when raised to the powers 1 through 2N sequentially, will cycle through each non-zero element of GF(2N) exactly once [see Fast CRC Calculation, R. J. Glaise and X. Jacquart, Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers, pp. 602-605, 1993. (IBM)]. In GF(2N) with primitive polynomial G(x), any intermediate result R(x) that has a polynomial representation of degree N or higher must be folded back into a polynomial of degree-bound N by calculating its value modulo G(x) (further referred to as mod G(x)), which is obtained by performing a polynomial division of R(x) by G(x) and determining the remainder. In order to calculate the multiplication of two elements A(x) and B(x) of GF(2N), the raw polynomial multiplication C(x), which is of degree-bound 2N, is first calculated; and then the remainder R(x)=C(x) mod G(x) is calculated.
In its simplest form, the basic operation of a cyclical redundancy check transmitter is to extend the message data with a number of zeroes matching the order of the primitive polynomial as the least significant bits; and then, while treating the obtained value as a polynomial with binary coefficients, determine the modulo G(x) remainder (which is the CRC value), and replace the padded zeroes by this CRC value in the data that is being transmitted. The data is transmitted most-significant-bit first, so the CRC value follows after the original message has been transmitted. The result is a data stream having a value which is an exact multiple of the primitive polynomial.
When the data is received, the data including the CRC value in polynomial representation is divided by the same primitive polynomial used in generation of the CRC value. If the remainder of this division is zero, it is highly probable that the data has been transmitted and received correctly. The hardware required for implementing the CRC transmitter and receiver is largely identical.
It should be noted that in this scheme, the same primitive polynomial is used to accomplish the division both in generating and in testing the cyclical redundancy check value. Thus, the value used for the division need not be part of the information transferred with a message.
The polynomial division modulo G(x) can easily be serialized; and, in fact, the first methods for accomplishing a cyclical redundancy check using Galois field arithmetic operated serially upon the sequential bits of a message (and the message with an appended cyclical redundancy check value). As has been pointed out, such an operation functions well for sequentially transmitted messages but slows faster transmission methods. Consequently, methods for handling sequences of bits in parallel have been devised. Typically, the methods involve selecting from a message bit portions or segments having a common bit width, treating those segments individually to modulo G(x) division by means of software executing on a processor, and combining the results to complete the cyclical redundancy check for any message being transmitted whatever its data width. Examples of such methods are illustrated in Fast Parallel CRC Algorithm and Implementation on a Configurable Processor, H. M. Ji and E. Killian, ICC2002—IEEE International Conference on Communications, vol. 25, no. 1, April 2002, pp. 1813-1817.
The present invention is primarily concerned with calculating the CRC value serially on message portions that have a length that is a multiple of the size of the primitive polynomial G(x). The following discussion details how this can be achieved. The Galois field operations can be efficiently implemented according to the present invention.
Assume now a message M consisting of L binary values, where L is very large (L>>N). This message can be represented by a polynomial M(x) of degree-bound L.
To illustrate this, consider the Galois field GF(232) defined by the primitive polynomial:
G(x)=x32+x26+x23+x22+x16+x12+x11+x10+x8+x7+x5+x4+x2+x1+x0 [Eq. 1]
as defined for use in The Ethernet, A Local Area Network, Data Link Layer and Physical Layer Specifications, Digital Equipment Corporation, Intel Corporation and Xerox Corporation, Stamford, Conn., Version 1.0. Sep. 30, 1980, page 22.
Although this example utilizes N=32 and treats CRC32 using the example polynomial, the results are intended for the general case.
Elements of GF(232) can be represented as binary thirty-two bit vectors, e.g., A[31:0], B[31:0] or as the equivalent polynomials in GF(232) as follows:
A(x)=A[31]x31+ . . . +A[1]x1+A[0]x0 [Eq. 2]
B(x)=B[31]x31+ . . . +B[1]x1+B[0]x0 [Eq. 3]
Similarly, G(x) has a corresponding thirty-three bit binary value:
G[32:0]=1 00000100 11000001 00011101 10110111 [Eq. 4]
A polynomial A(x) can be multiplied by a polynomial B(x) for the case where B(x) is fixed. This is also referred to as Galois field scaling.
To scale A[31:0] by B[31:0] in GF(232) defined by G(x) is equivalent to finding:
R(x)=(A(x)*B(x))mod G(x) [Eq. 5]
where mod G(x) is the modulo G(x) operation, or the remainder of the operand after polynomial division by G(x) in GF(232).
The following derivation relies on properties of the modulo G(x) operator:
(A(x)+B(x))mod G(x)=A(x)mod G(x)+B(x)mod G(x)
(A(x)*B(x))mod G(x)=A(x)*(B(x)mod G(x)))mod G(x).
Equation 5 can be rewritten as:
Which in turn equals:
where F(x) is a vector of polynomials [F[0](x). F[31](x)], and where each F[i](x)=(xi*B(x)) mod G(x). In other words, F[i](x) is the remainder of the polynomial division of xi*B(x) by G(x). In recursive form this yields:
F[i](x)=(x*F[i−1](x))mod G(x) [Eq. 10]
Since F[i−1](x) mod G(x)=F[i−1](x), x*F[i−1](x) can only be of order 32, in which case a simple XOR of x*F[i−1](x) with G(x) suffices to calculate the mod G(x) remainder.
Hence, a recursive method to pre-calculate the coefficients F[i][31:0] of F[i](x) is as follows:
F[0][31:0]=B[31:0]
for i in range 1 to 31 with increments of 1:
For each vector F[i][31:0], this describes a left shift of F[i−1][31:0], with conditional XOR with the coefficients of the primitive polynomial G(x) if F[i−1][31] is 1. An example of values for the matrix F[31:0][31:0] for scaling by B(x)=x32 for the primitive polynomial G(x) of [Equation 1], obtained using this method, is shown in
R[i]=F[i][0]*A[0]+F[i][1]*A[1] . . . F[i][31]*A[31] [Eq. 11]
Hence, the result R[31:0] of scaling of a variable A[31:0] by a constant in the Galois field GF(232) can be obtained as thirty-two wide XOR operations on different subsets of elements of the variable A[31:0], where inclusion in a particular XOR operation is predetermined. Such a calculation can be performed using a structure according to the present invention.
“Designing TCP/IP Functions in FPGAs”, W. Lu, MSc Thesis, Delft, The Netherlands, August 2003, pp. 34-38, shows how scaling in GF(232) can be used for cyclical redundancy check, in the case of thirty-two bit message segments. Let A(x) be a new message segment of thirty-two bits, and part of a larger message M(x). Let CRCPrev(x) be the CRC calculated for portion P(x) of the message M(x) up to but not including A(x), then:
CRCPrev(x)=x32P(x)mod G(x) [Eq. 12]
Let CRCNew(x) be the CRC calculated for the portion of the message M(x) up to and including A(x):
CRCNew(x)=x32(x32P(x)+A(x))mod G(x) [Eq. 13]
This can be rewritten in recursive format, using the earlier mentioned properties of mod G(x), as:
CRCNew(x)=((CRCPrev(x)+A(x))x32)mod G(x) [Eq. 14]
Equation 14 shows that a CRC can be calculated iteratively using bitwise XOR followed by Galois field scaling.
Next, this result is combined with the derivation from Fast Parallel CRC Algorithm and Implementation on a Configurable Processor, referred to above, to show how to calculate the CRC value iteratively, using scaling, on message portions that have lengths that are a multiple of the order of the primitive polynomial. For example, let A(x) be a sixty-four bit message portion, consisting of two thirty-two bit portions A0(x) and A1(x) such that:
A(x)=x32A1(x)+A0(x) [Eq. 17]
Meaning that A1(x) is the first arriving thirty-two bit portion.
Then, keeping the same notation as before:
This shows that the thirty-two bit CRC value can be obtained iteratively for sixty-four bit wide message chunks, using bit-wise XOR and GF scaling by either x32 mod G(x) or by x64 mod G(x). The matrices F containing values corresponding to these scaling amounts can be predetermined. If stored, these values are immediately available to accomplish Galois field operations at both the transmission and reception of a message. This greatly enhances the speed of the division operation.
Concatenating a message by the remainder from modulo G(x) division provides a message value which has a remainder of zero when again divided by the same divisor. Consequently, a remainder of zero from the second division, performed at the receiver, suggests that the data has been correctly transmitted.
The present invention provides a hardware solution which produces more rapid results than the software solution of the prior art and which may be reconfigured to allow a number of distinct logical operations in addition to cyclical redundancy check scaling of different types.
The arithmetic unit 10 includes a plurality of AND gates 12 logically arranged in rows and columns. In this description, “logical arrangement” means that the circuitry functions as though the devices were physically arranged in the manner illustrated even though the individual elements may be physically positioned differently. In the basic embodiment, each row and each column of the arithmetic unit 10 includes a number of AND gates 12. The number of AND gates in a column of a basic unit may be approximately equal to a convenient number of bits of a typical input message (i.e., ho-h31). That is, if an input message typically may be divided into groups each of which includes thirty-two bits, then each column of the arithmetic unit 10 has thirty-two AND gates. On the other hand, although the number of AND gates in each row of the arithmetic unit 10 may also be thirty-two, this number is selected based on the largest polynomial to be used in the cyclical redundancy check. Moreover, depending on the actual details of the mathematical operations being conducted, a larger or smaller number of AND gates 12 might be utilized in each row and column; and extra AND gates 12 might be added to each row or column for purposes such as a parity check.
In the arithmetic unit 10, each AND gate 12 in a row receives the same bit of the message as an input. Each AND gate 12 in a column receives a logically sequentially increasing (or decreasing) bit of the message as an input. Each AND gate 12 in a column also receives a second value (referred to as a “parity masking bit”) which is a logically sequentially increasing (or decreasing) bit of a Galois field value. The Galois field values furnished to the AND gates 12 of each sequential column of the arithmetic unit 10 are the sequential Galois field values computed for the particular message length. For Galois field scaling as described above, these values are typically constants which may be precomputed and stored for ready use in parity mask memory cells associated with the AND gates 12. For a thirty-two bit message portion, the associated Galois field scaling M(x) multiplied by (x32) mod G(x) can be obtained by using as parity mask memory values the values shown in the table of
It should be noted that the particular primitive polynomial for thirty-two bits is usually referred to as CRC32 and is the polynomial used in Ethernet communications selected by an industry standards committee and described in IEEE 802.3-2002, Section 3.8. The CRC32 polynomial is that presented in Equation 1, above, and represented in binary form in Equation 4 above.
As may be visualized, when a thirty-two bit message is presented to the arithmetic unit 10, the sequential bits ho-h31 of that message appear at the input terminals of the sequentially-positioned AND gates 12 of each column simultaneously. Each of the AND gates 12 of a column also receives one of the sequential bits bi according to the selected manipulation. In the basic case of Galois field scaling, bits bi are assigned to Galois field values such as those shown in
Thus, in a single operation, the results of the manipulation of the message bits by each of the Galois field values of
Associated with each logical column of the arithmetic unit 12 is a counting circuit comprising an exclusive OR (XOR) tree 14 (see the block diagram of
Those skilled in the art will recognize that both the logical AND function and the logical XOR function may be performed by many different circuits. Apart from the specific devices of which a logic circuit is composed, different logical steps may be performed by stages of two different circuits which provide the same ultimate result. For example, both a basic AND circuit and a NAND circuit with an inverted output provide a logical AND function.
Moreover, the following properties based on De Morgan's law may be relied upon. First, an XOR circuit with both of its inputs inverted produces the same result as an XOR circuit that does not have its inputs inverted. As a result, the XOR of the output of two AND circuits produces the same result as the XOR of the output of two NAND circuits of the same input signals. Similarly, an XOR of the output of two XOR circuits produces the same result as an XOR of the output of two NOT-XOR (NXOR) circuits of the same input signals. Hence, AND gates followed by a binary XOR tree produce the same result as NAND gates followed by a binary XOR tree, and produce the inverted result of AND or NAND gates followed by a binary NXOR tree. Secondly, the OR function with its inputs inverted is equivalent to the NAND function of its inputs. Hence, an AND circuit followed by an OR function produces the same result as a NAND function followed by another NAND function.
In addition, when a plurality of stages of logical operations are involved in producing a particular logical result, various of the manipulations such as inversions particular to a specific circuit may be included within others of the stages yet produce the same results. These characteristics are utilized in providing a number of advantages of the present invention. In order to facilitate an understanding of the different aspects of the invention, particular logical operations which result from a particular configuration of the circuits utilized are referred to as “logical functions” (e.g., the logical AND function, the logical OR function) no matter which specific circuit performs the function. Consequently, the scope of the invention should be considered to include the various different circuits which may be utilized to carry out the referenced logic functions.
It should be noted that the particular arrangement illustrated in
More particularly, the tree 31 illustrated in
This may be appreciated by considering the results at each stage of the tree. Presuming that the physical arrangement utilizes NAND gates to furnish the inputs a-h and that all of the SEL values are set to ZERO so that the circuits 32 function as NXOR gates and the circuit 34 as an inverter, then the outputs of the circuits 32 at the first stage of the tree are NXOR(a,b), NXOR(c,d), NXOR(e,f), NXOR(g,h). Then, the outputs of the circuits 32 at the second stage of the tree are NXOR (NXOR(a,b), NXOR(c,d)), and NXOR (NXOR(e,f), NXOR(g,h)). These values of outputs of the circuits 32 at the second stage reduce to NXOR (a,b,c,d) and NXOR (e,f,g,h). Then, the output at the third stage of the tree becomes NXOR (a,b,c,d,e,f,g,h). When inverted by the circuit 34, the result is XOR (a,b,c,d,e,f,g,h).
The truth table of
In order to allow the tree to perform the different operations, circuits 32 may be utilized such as those illustrated in
On the other hand, when the SEL setting is chosen to be ZERO, the NOR gate functions to invert the input value B. This causes the multiplexer to furnish an output which is that of a NXOR circuit. Thus, depending on its particular configuration, the circuit 32 of
The circuit 34 which is the last stage of the tree of
The results produced by the XOR trees 14 of
Using a plurality of reconfigurable arithmetic units allows longer messages to be handled in parallel. It has been shown in the Fast Parallel CRC Algorithm and Implementation on a Configurable Processor publication and in the discussion above regarding equation 5 that it is possible to divide messages longer than thirty-two bits into thirty-two bit segments and handle in parallel the processing of the cyclical redundancy check values for of those individual segments. The results of processing the individual segments may then be combined to provide a result for the entire message.
In one embodiment of the present invention (see
Similarly, larger messages may be handled in series of one hundred twenty-eight bit segments furnished to the four reconfigurable arithmetic units provided for handling one hundred twenty-eight bits; since the operation is modulo 2, the remainder values wrap into the larger messages in a similar way as suggested by equation 14 for the sixty-four bit case.
It should be noted that a practical arrangement for handling messages of varying lengths might include additional circuitry. For example, If the messages to be handled are guaranteed to be multiples of thirty-two bits, four RAUs and additional multiplexers permits the last message portion to be sent into any of four, three, two, or one RAU and padded with zeroes. For messages which are guaranteed to be a multiple of eight bits in length (the most common form in Ethernet communication), an additional fifth RAU and additional multiplexers may be used to perform a necessary length correction step that can also be reduced to a Galois field scaling step as known to those skilled in the art.
Moreover, a practical arrangement might include two complete sets of four (or five) reconfigurable arithmetic units in order to accomplish the processing of both incoming and outgoing messages.
Although the values used in the Galois field manipulation are usually well known and may therefore be precomputed, the basic arithmetic unit 10 of the present invention allows a number of variations. The arrangement utilizes programmable inputs to the AND gates 12 for values generated using the primitive polynomials. Because these values are programmable, the use of the circuitry may be changed from simply accomplishing the cyclical redundancy check utilizing a constant primitive polynomial to other uses. By utilizing AND gates 12 having a variable input B, a different value may be provided in place of a standard internet Galois field value. For example, if another polynomial is utilized in the cyclical redundancy check manipulation, then the Galois field value used at the receiver may be different than that which has been precomputed. In such a case, the Galois field polynomial must be transferred from the transmitting station to the receiving station in order to accomplish the cyclical redundancy check operations. With the AND gate parity masking bit input to the arithmetic unit 10 of the present invention, this operation is easily accomplished since the Galois field values utilized in the manipulation may be readily varied.
The ability to modify the parity bit input values may also be useful for other than the determination of cyclical redundancy check values. Thus, the arithmetic unit 10 may be utilized to generate hash values useful in various operations of the circuitry with which the arithmetic unit 10 is associated. (e.g., see A Performance Study of Hashing Functions for Hardware Applications, M. Ramakrishna, E. Fu, and E. Bahcekapili, Proc. 6th. Intl. Conf. Computing and Information, 1994, pp. 1621-36).
Another advantage of the invention is the ability to utilize primitive polynomials other than that for thirty-two bits in the computations. There are a number of additional cyclical redundancy check verifications having associated primitive polynomials which are utilized for other purposes. For example, there are also primitive polynomials which have been selected for cyclical redundancy check verifications for eight, ten, twelve, and sixteen bit messages. Because the B inputs of the AND gates 12 of the arithmetic unit 10 are changeable, the arrangement provides the ability to work with these additional cyclical redundancy check verifications in smaller portions of the AND-XOR array. For example, since the values furnished to the B inputs may be controlled, it is possible to utilize sixteen bit by sixteen bit portions of the array to test for other than internet cyclical redundancy check values. The same ability allows even smaller portions such as eight by eight bit portions of the arrays to be utilized for similar purposes.
More importantly, providing additional inputs to sub-portions of the AND gates allows the individual sub-portions of the array to be associated with one another in a manner that the parts may be made to cooperate to provide a whole that may be easily manipulated. For example, the array may be utilized in a manner that four individual sixteen by sixteen arrays are provided. By dividing the individual inputs of the message and the added inputs into sixteen bit sections, all of these arrays may be made to function so that they provide results based on individual sixteen bit inputs. Moreover, the outputs of these sub-arrays may be combined and utilized in a manner so that effectively four individual sixteen by sixteen operations are being conducted in parallel. Of course, the same in true of the smaller sub-arrays such as those of eight by eight bits as described above.
In order to obtain the aforementioned advantages, it is desirable to provide inputs to and take outputs from the individual sub-portions of the array. In order to accomplish this, it is useful to provide additional horizontal input buses to eight-by-eight and to sixteen-by-sixteen subsections of the reconfigurable arithmetic unit. These inputs may be made programmable so that individual eight-by-eight or sixteen-by-sixteen sub-sections may be utilized as well as the full thirty-two by thirty-two bit array. It should be noted that, assuming that the entire reconfigurable arithmetic unit is provided the XOR trees discussed above for each column, then these XOR trees will, in fact, function to provide the desired result without additional change.
The operations which the invention may be utilized to accomplish are increased by the ability to utilize sub-portions of the thirty-two by thirty-two bit array or other convenient sized array and the enhancements illustrated in an embodiment of the invention shown in
The provision of the additional inputs to the AND gates also allows the reconfigurable arithmetic unit to be utilized for general Galois field multiplication.
A(x)B(x)mod G(x)
in which both operands A(x) and B(x) are variable, so that the method of precalculating a table that was presented for Galois field scaling does not apply. A typical application for this is in Reed-Solomon decoders [Blahut], and a typical implementation uses the extension Galois field GF(28). For the following explanation, variables are elements of GF(28) and a primitive polynomial of order eight, for example [Ref BBC Whitepaper WHP031, p. 8, Eq. 8]:
G(x)=x8+x4+x3+x2+x0 [Eq. M1]
In reference [Gill 1992], it is shown how a multiplication of two operands A(x) and B(x) may be accomplished in a two step process, consisting of first, the calculation of a raw product polynomial T(x) that is of degree-bound 16, followed by a scaling of the top portion of T(x) by a fixed value, and bit-wise XOR of that value with the lower portion of T(x). For completeness, that derivation is repeated in the following paragraphs.
Input operands and result in GF(28) represented by the polynomials A(x), B(x) or the vectors A[7:0], B[7:0], and a result R(x) or R[7:0]:
First consider the raw multiplication result Prod(x) in polynomial form. This is a polynomial of order fourteen:
Prod(x) is hence obtained by calculating the partial products for every combination of elements of A[0:7] and B[0:7], and performing a bit-wise XOR for partial products corresponding to equal powers of x.
This polynomial Prod(x) can be split in a lower half ProdLow(x), containing the coefficients corresponding to the powers (0..7) of x, and a higher half ProdHigh(x), containing the coefficients corresponding to the powers (8..14), but with x8 divided out so that:
Prod(x)=x8*ProdHigh(x)+ProdLow(x) [Eq. M3]
where:
Note that here, the 8th coefficient of ProdHigh(x) is always zero but it is added so that all buses used in the calculation can be multiples of eight bits.
The remainder of Prod(x) divided by G(x) is the desired result polynomial R(x). Since ProdLow(x) mod G(x)=ProdLow(x) and ProdHigh(x) mod G(x)=ProdHigh(x):
T(x)=[x8 mod G(x), x9 mod G(x), . . . , x14 mod G(x), x15 mod G(x))], [Eq. M9]
The coefficients of which can be precalculated since T(x) is independent of either operand.
Hence the coefficients R[7:0] of R(x) are obtained from:
First, three sub-portions of the array 61, 62, and 63 are chosen to accomplish the operation. Each of these sub-portions may have an array size equal to the values to be utilized. That is, since values of eight bits are to be multiplied, the sub-portions of the array chosen are eight-by-eight in size. Typically a byte is the smallest useful segment of data which might be manipulated. Before input values are furnished to the array, inputs to certain of the AND gates in two of the sub-portions are rendered inoperative by zeroing the inputs provided on the diagonal input conductors. As may be seen in
One of the values to be multiplied is furnished on the horizontal input lines to the two sub-portions 61 and 62 while the other value to be multiplied is furnished to the same sub-portions on the diagonal lines which are not furnishing disabling inputs. Thus all of the AND gates of all of the columns of the two sub-portions 61 and 62 receive the same input values on the horizontal lines; while, the value furnished on the diagonal lines increases by one bit weight with each column, proceeding from left to right. When the results produced by the AND gates are manipulated by the XOR trees of each column, a sixteen bit result consisting of a fifteen bit raw multiplication result padded in the most significant bit position by one 0 bit is provided by the two sub-portions.
To reduce this fifteen bit result to a GF (2**8) value, the high order bits provided as output by the XOR trees of sub-portion 61 are furnished on horizontal lines MpyH to the sub-portion 63. In sub-portion 63, these high order bits are scaled utilizing a matrix of values for a primitive polynomial for GF (2**8) Galois field operations. The results provided from the XOR trees of the columns of the sub-portion 63 are then combined (XORed) with the results provided from the XOR trees of the columns of the sub-portion 62 to provide a final result in GF (2**8) form.
Each of the RAUs 50 illustrated in
Each of the RAUs 50 is also positioned to receive input from and provide output to two vertically-placed ones of a plurality of switch boxes 40. For example, the RAU 50a is positioned to receive input from and provide output to the two vertically-placed switch boxes 40a and 40b.
In discussing the embodiment of
The input signals available to each of the RAUs 50 include horizontal input values furnished on input buses prefaced by an “h” such as h8a2 furnished to RAU 50a. The nomenclature utilized indicates that the eight bit bus furnishes signals to a number of columns convenient to the particular size of the RAUs. Those buses including an “8” furnish signals across eight columns, while those including a “16” furnish signals across sixteen columns (across two RAUs). The latter buses are used to provide signals to two RAUs at the same time. The input signals available to each of the RAUs 50 also include two different values furnished on input buses connected to diagonal inputs of the RAU 50a. These are furnished on input buses prefaced by a “d” such as d8a0 and d8b3 which connect to the RAU 50a.
Output values are furnished by the RAUs 50 on buses prefaced by a “v” such as bus v8a1 joining RAU 50a.
Each of the horizontal switch boxes 30 receives input signals on three eight bit input buses c8xx, h8xx, and d8xx and on one eight bit input bus h16xx which spans sixteen columns. Although not illustrated in order to reduce the complexity of this figure, particular embodiments may also include input buses which span thirty-two columns. Those skilled in the art will recognize that buses of other sizes also may be included depending on the particular use of the arrangement.
The internal elements of the switchboxes 30 and 40 may be similar so only a single switchbox is treated in detail. The manner in which signals may be provided to each of the RAUs from the horizontal and vertical switch boxes will be better understood by referring to
In
In addition, each of the circled intersections in
As may be seen, each of the outputs A1, D1, C1, H1, A0, D0, C0, H0, and H2 is preceded by a tristate buffer. On the other hand, the output MXOUT is furnished by a multiplexer selecting from one of two vertical buses, the output XOROUT is furnished by a XOR circuit receiving inputs from two vertical buses, and the output REGQ is furnished by a DQ register receiving inputs from two vertical buses. Each of the MXOUT, XOROUT, and REGQ outputs is also furnished as an internally routed input on a similarly labeled horizontal bus. These horizontal buses allow connections to be made (at circled crossing points) to the various output channels.
In addition to the other inputs, logical ZERO (“0”) and ONE (“1”) values are furnished within the switchbox 30 so that these values may be furnished on the various outputs of the circuits. These are especially useful for setting the diagonal values of certain areas of the RAUs in order to allow the arrangement to be used for various arithmetic purposes such as Galois field multiplication previously discussed.
Those skilled in the art recognize that a sparse crossbar can be readily implemented using multiplexor circuits.
The paths 155 and 156 meet at another sparse crossbar 158 which provides for interchanging the signals on the two signal paths. The paths 155a or 156a also proceed by bidirectional paths 165 and 166 to a sparse crossbar 161 where the signals may be switched to the outputs A and B at a reconfigurable circuit 167. Bidirectional signal paths 168 and 169 return from the circuit 167 through the sparse crossbar 161 and to the paths 155b and 156b. The multiplexer 161 allows signals on paths 165, 166, 168, and 169 to be cross-coupled to the other paths.
At the circuit 167, signals on paths 165 and 166 are furnished to separate AND gates 163 and 164 where they may be transferred in response to control signals. It should be noted that each AND gate shown represents a bus width of individual AND gates controlled by a single input control. The AND gate outputs are furnished to XOR circuitry, and the result furnished to the signal paths 168 and 169. The circuit 167 has the ability to function as either a bussed multiplexer or a bussed XOR circuit and offers a bussed register at one output. As is illustrated, the signal paths 168 and 169 connect back to the sparse crossbar 161 where the signals may be rerouted through the arrangement. The circuit of
In
In order to illustrate the operation of the access circuitry shown in
In particular,
In a similar manner,
Another operation of the RAU is illustrated in
In order to access a particular value stored in the RAU, the value being sought is furnished on the first, third, fifth, and seventh rows of the horizontal input bus; while ONEs are furnished on the second, fourth, sixth, and eighth rows of the horizontal input bus. That is, the value being sought is furnished on alternate lines of the input bus at rows in which ONEs are stored by memory (i.e., IN[0], IN[N/2−1]). The other lines of the input bus are furnished values of ONE. This causes the AND gates of the particular column storing the value being sought to furnish an output indicating a match of the input value and the value stored by the column. Referring to the table of settings of the select values which may be utilized for programming the XOR tree of
The present invention may also be utilized to accomplish the functions of a programmable logic array (PLA). A PLA is a circuit which allows the implementation of arbitrary logic by mapping it onto two levels of logic, either (1) AND followed by OR or (2) OR followed by AND. The functionality of a PLA may be obtained by cascading two RAU circuits so that the output of the first RAU circuit is used as the input of the second RAU circuit. In the first RAU, the selection values of the counting tree (
A PLA-like structures of a higher number of logic levels can be obtained by cascading more than two RAU circuits in this manner.
A different type of PLA may be realized by using a function which XORs the sum of products. This may be implemented by configuring the counting tree of the second RAU circuit to provide XOR logic by setting the select values as delineated in line A of the accompanying truth table. Such an arrangement allows certain types of logic to be mapped more efficiently.
Another illustration will assist those skilled in the art to understand that much more sophisticated logical operations may be accomplished by the inventive arrangement utilizing the reconfigurable arithmetic units and the accompanying access circuitry shown in
It may be recognized that the following two equations are implemented in a repetitive manner for incrementing indices i−1,i,i+1 etc:
L
i+1
<=L
i+Delta*Bi [Eq. RS1]
B
i<=Skip?Bi:Li*DeltaRecip [Eq. RS2]
where the variables Li+1, Li, Delta, Bi, and DeltaRecip are elements of GF(28) and “Skip” is a binary variable; “+” stands for addition in GF(28), “*” stands for multiplication in GF(28), “<=” denotes an assignment through a flip-flop or register on a predetermined clock edge, and the notation “A ? B:C” denotes a multiplexer that has output value B if A equals “1” and C if A equals “0”.
These equations can be divided into smaller portions, each corresponding to a specific hardware component shown in Figure [BKMSysArrayNew.ps], while keeping the same notation, as follows:
L
i+1
<=K
i+1 [Eq. RS1a]
implemented in register 214a of section 210a,
K
i+1
=L
i
+C
i+1 [Eq. RS1b]
implemented in Galois field adder 212a of section 210a,
C
i
=B
i*Delta [Eq. RS1c]
implemented in Galois field multiplier 232b of section 235b, where Ki+1, Ci, and Ci+1 are elements of GF(28), and:
Bi<=Ai [Eq. RS2a]
implemented in register 220b of section 210b,
Ai=skip?Bi:Mi [Eq. RS2b]
implemented in multiplexer 218b of section 210b,
M
i
=L
i*DeltaRecip [Eq. RS2c]
implemented in Galois field multiplier 216b of section 210b, where Ai and Mi are elements of GF(28).
It may be recognized that the hardware described in
Except for the single bit signal “Skip”, all signals are sent on 8 bit buses representing elements of Galois field GF(28). Vertical crossbar 40c of
At this point, it is useful refer to the description of Galois field multiplier 216b of
Vertical crossbar 40b is programmed to furnish a logic “0” on its output D0, and onto bus d8a1, which is connected to input D of RAU 50b, and vertical crossbar 40f is programmed to furnish a logic “0” on its output D1, and onto bus d8b7, which is connected to input E of RAU 50d. Vertical crossbars 40d and 40f have been programmed to furnish the signal of their inputs V1 on their outputs V0, and externally supplied value DeltaRecip is distributed along buses v16b3, v16b4, and v16b5.
Cross switch 60d is programmed to connect bus v16b4 to bus h16b4. Horizontal crossbar 30d is programmed to furnish the signal of its input H2 onto is outputs A0 and A1. As a result, DeltaRecip is available on bus h8a3 and on input A of RAU 50b, as well as on bus h8a5 and on input A of RAU 50d.
Furthermore, RAUs 50b and 50d have been set to operate as Galois field multipliers, and produce the high and low portions ProdHigh and ProdLow of the raw multiply value of Li*DeltaRecip on their outputs z, respectively. Vertical crossbar 40b and horizontal crossbar 30c are programmed to conduct signal ProdHigh from v8a2 to d8b3 and from d8b3 to h8a4, respectively. In
In another embodiment diagonal buses for routing purposes only could be present.
Vertical crossbars 40d are programmed to furnish a logic “0” on bus d8a3 to input D of RAU 50c, and vertical crossbar 40e is programmed to furnish a logic “0” on bus d8b6 to input E of RAU 50c. RAU 50c is programmed to implement the Galois field scaling of ProdHigh by x8 mod G(x) to obtain ResH, similar to Equation M3. Vertical crossbar 40e is further programmed to produce the Galois field addition (i.e. bitwise XOR) of signal ResH on bus v8a5 and signal ProdHigh on bus v8a6, and furnish the result Mi on bus c8b6.
Horizontal crossbar 30f implements multiplexer 218b and register 220b of
Due to the repetitive nature of the implementation, some logic belonging to the next slice 200a of the systolic array is also included in this figure. In particular, vertical crossbars 40a and 40d of
The usefulness of the RAU may be significantly extended by an improvement which replaces the reconfigurable XOR tree arrangement with a reconfigurable compression tree. In contrast to an XOR tree arrangement (such as that described above) which may be used to determine the parity of values furnished by an arrangement of AND gates, a compression tree may be used to determine an actual count of the values furnished by the arrangement of AND gates. Consequently, a RAU including a compression tree may be utilized to perform more advanced mathematical functions such as general multiplication of values furnished to the RAU.
As is well known, a compression circuit includes a plurality of stages representing increasing positional values (also referred to herein as bitweight) each of which stages receives a number of input values to be combined and provides an output which is equal to the sum of the input values furnished to that stage. This sum includes a value which is within the positional range for that stage and any overflow into the stage having the next positional value occasioned by a total greater than the number base. With binary numbers where the range can be only zero or one, any stage must respond to a pair of input values by producing a zero sum where both inputs are zero, a one sum where a single input value is a one, and a zero sum with a carry of one to a next higher stage where both input values are one. The number of input values is usually greater than two since, among other things, all stages except the least significant include a carry input from stages of next lower significance. Consequently, the circuitry of the stages, further referred to as counter circuits, must be able to deal with all possible results.
A relevant explanation of the design of multipliers using a compression tree is given in M. Santoro, Design and Clocking of VLSI Multipliers, Ph.D. Thesis, Stanford University, October 1989, included herein by reference.
A compression tree 180 is illustrated in block diagram in
In a similar manner, the three counter circuits 170 in the lower row of the figure receive as input values the output bit values furnished by the AND gates of the remaining four adjacent columns (indicated as PP4, PP5, PP6, and PP7) of the 8×8 RAU. These values are combined with carry values and provide output values of sum and carry from each stage.
The sum and carry values generated by the counter circuits 170 of the upper and lower rows are furnished as inputs to counter circuits 170 illustrated in the center row which provide the stages of the next level of the compressor 180. The sum values from each first level stage are furnished to the same stage at the higher level, while the carry values from each first level stage are furnished to the next highest stage. In this manner, the results of the first level of compression are combined to provide final sum and carry results for the particular 8×8 RAU.
Carry outputs of counter circuits 170 on the left edge of compressor 180 are sent out in a bus indicated as mco[0:5]. These may be connected to corresponding inputs of a similar compressor 180 located immediately to the left of the one under consideration in order to compress wider rows of partial product inputs.
In the embodiment shown in
The combination of the compression tree 180 of
One counter circuit 170 which may be utilized in the compression tree 180 for accomplishing generalized counting of input values provided in parallel is illustrated in
The details of circuitry utilized in various full adders are well known to those skilled in the art. Essentially, the sum is the result of XORing all of the input values, while the carryout value is one if two or more of the input bits to that adder 171 are one, and the carry value is similarly one if two or more of the input bits to that adder 172 are one. These carry and carryout values may be determined by circuitry which assesses the results of XORing individual pairs of input bits to the particular adder. For example, if the XOR of two input bits to an adder is ONE, then those two bits are not equal; and the carry or carryout is the same as the value of the third input bit. On the other hand, if the XOR of two input bits to an adder is ZERO, then those two bits are equal; and the carry or carryout is the same as the value of one of those bits. As is illustrated in
As with the XOR tree circuitry described above, there are various methods of implementing individual counter circuits 170 to provide the desired results. The counter circuit 190 illustrated in
As may be seen, the four input values i0 through i3 are furnished to a pair of XOR circuits at a first level. The results of each first level XOR circuit provide input values to a second level XOR circuit. The result of the second level XOR circuit is provided along with a carry-in as input values to a third level XOR circuit resulting in a sum value for the particular bitweight and level.
The output of the left one of the first level XOR circuits is also used to gate a multiplexor which furnishes a carry-out value from either the i0 or the i2 input value. The i0 value is selected if the select value provided by XORing the i0 and i1 is ZERO, while the i2 value is selected if the select value provided by XORing the i0 and i1 is ONE. Similarly, the output of the second level XOR circuit is also used to gate a multiplexor which furnishes a carry value from either the i3 or the carry-in input value. The i3 value is selected if the select value provided by the second level XOR circuit is ZERO, while the carry-in value is selected if the select value provided is ONE. It may be verified that the counter circuit 190 of
Another counter circuit 300 illustrated in
Another counter circuit 310 shown in
Ultimately, the compression tree 180 provides an output which includes a sum value and a carry value for all of the bit positions of the particular RAU. The sum and carry values provided by the compression trees of a plurality of individual RAU may be furnished to and combined in a similar manner by additional higher level compression trees so that a result for a larger set of input values (e.g., 32×32) may be obtained. Finally, the resulting sum and carry values may be combined by a carry propagate adder (see
cpaco*(2n)+T=S+C+ci [CPA1]
where S and C are n-bit wide unsigned input values, T is an n-bit wide unsigned output value, ci is a single bit carry in signal, and cpaco is a single bit carry out signal.
The adder of
Terminals V0, C0, D0, V1, C1, D1, and V2 can be used as outputs or inputs, and are connected through boxes 251 containing a “U V” to indicate that signals may be directed in either direction (similar to the boxes 151 of
The compressor circuit 267 of
The carry propagate circuit 268 of
Those skilled in the art understand that in an architecture containing an arithmetic-capable RAU and vertical switchboxes according to
A further enhancement consists of providing a signed compressor circuit which is capable of adding and subtracting, the latter operation obtained by first creating the two's complement version of the input that requires subtracting.
The compression tree described above may be further enhanced by designing the various stages so that they may be reconfigured in a manner similar to the reconfigurable XOR tree arrangement. More specifically,
The circuit 330 is one reconfigurable circuit which may be utilized as the counter circuit 170 in the first level of compression. The circuit 330 utilizes a plurality of NXOR/AND circuits 231. These circuits 231 may be the type such as circuits 32 illustrated in
As may be seen, the circuit 330 includes a selectable input NAND gate at the carryin input cib which can be used to fix the internal active-low carryin input acib to a logic ONE, hence breaking the carryout propagation between counter circuits in adjacent stages of the compression tree. The circuit 330 also includes a selectable output NAND gate before the carryout output co, which can be used to fix the active-output co to a logic ONE value, which may be used to enforce a carryin bit into the adjacent counter circuit of the next stage, provided that the corresponding selectable input NAND gate has a select set to ONE.
Similarly, the circuit 330 of
The circuit 380 is one reconfigurable circuit which may be utilized as the counter circuit 170 in the second level of compression. The circuit 380 utilizes a plurality of NXOR/AND circuits 381 which may also be the type such as circuits 32 illustrated in
Thus, the configurable circuits 330 and 380 illustrated may be utilized to provide the functions of the compression tree described above. In addition to these functions, however, the circuits 330 and 380 (and other circuits designed to provide similar results) allowing the achievement of the results provided by the reconfigurable XOR tree described above. This will be apparent by considering the circuit illustrated in
Vertical switchboxes 40x of
Horizontal switchboxes 30x may be the circuits of
In this discussion wire segments are eight bit buses. In
By configuring the switchbox configuration muxes and tristate buffers and the parity masking bits of RAUs 50x, inputs represent numerical values may be sent into the inputs of neighboring RAUs, to produce the multiplication result according to the described operation, in carry-saved format, on outputs Y and Z. For example, in order to perform a complete multiplication of two 8-bit operands, one operand is furnished to the A input of two horizontally adjacent RAUs; and the other operand is furnished to the input D of the rightmost RAU 50c as well as to the input E of the leftmost RAU 50a, while the remaining inputs E of the RAU 50c and D of RAU 50a are both set to be zeros. Those skilled in the art will recognize that carry signals produced on multiplier carry output bus mco of RAU 50c need to be furnished to multiplier carry input bus mci. This is obtained through a hardwired, non-reconfigurable connection mc2. Similar connections mcx are present between horizontally adjacent RAUs in the fabric. In
Similarly, the 4-to-2 compressors (
out=(inA*inB+inC*inD)mod(216) [Eq. N1]
where inA, inB, inC and inD are unsigned eight-bit integer input values, out is an unsigned sixteen-bit output value according to the computation, and *,+ and mod stand for arithmetic multiplication, addition, and modulo, respectively. In order to fit the result into a sixteen-bit value, the result bits beyond bit sixteen are ignored, which is equivalent to performing the modulo (216) operation. In the example, the out value is split into two eight-bit portions according to:
out=outHigh*256+outLow [Eq. N2]
The first step of the implementation is to provide input pairs (inA, inB) and (inC, inD) each to neighboring RAUs to implement the multiplication in carry-save format.
Input inA is sent in on horizontal wire segment h16b0. Horizontal switchbox 30a is configured to send inA on to h8a2, which is connected to input A of RAU 50a. Horizontal switchbox 30c is configured to send inA to input A of RAU 50c. Input inC is furnished to input A of RAUs 50c and 50d through similar configuration of horizontal switchboxes 30b and 30d.
Input inB on vertical wire segment v8a4 is furnished to input D of RAU 50c and input E of RAU 50a through configuration of vertical switchbox 40d and horizontal switch box 30c. Similarly, inD on vertical wire segment v8a3 is furnished to input E of RAU 50b and D of RAU 50d through configuration of vertical switchbox 40c and horizontal switch box 30d.
Inputs D of RAUs 50a and 50b, and inputs E of RAUs 50c and 50d, are set to constant zeros as required for the intended multiplication, through configuration of horizontal switchboxes 30a, 30b, 30e, and 30f, respectively.
The eight-bit carry C and sum S outputs of RAUs 50c and 50d are driven onto vertical wire segments v8b5 and v8a5, and v8b6 and v8a6, respectively. Vertical switchbox 40e is configured such that inputs Y0, Z0, Y1, and Z1 are propagated to compressor circuit 44e which in turn sends its S and C outputs to the carry propagate adder 42e. Finally, vertical switchbox 40e is configured to drive the output T of carry propagate adder 42e on vertical wire segment v16b4.
Compressor 44e and carry propagate adder 42e of vertical switchbox 40e, as well RAUs 50c and 50d, are each configured to ignore their respective carry-in inputs. RAUs 50a and 50b are configured to use their carry-in inputs mci, which are hardwired to the corresponding carry-out ouputs mco of RAUs 50c and 50d, respectively, to achieve a multiplier compression over a width of sixteen bits in RAU pairs 50a and 50c, and 50b and 50d, respectively.
Vertical switchbox 40b is configured in a similar way as vertical switchbox 40e, so that inputs Y0, Z0, Y1, and Z1 are propagated to compressor circuit 44b of which the outputs S and C in turn are propagated to carry propagate adder 42b, furnishing its output T on vertical wire segment v16b1. However, unlike vertical switchbox 40e, compressor 44b and carry propagate adder 42b are both configured to use their carry-in inputs cmpci and cpaci, respectively, which may be hardwired to the corresponding carry-out outputs cmpco and cpaco of compressor 44e and carry propagate adder 42e. Thus, a sixteen bit wide compression and carry propagate addition is achieved in vertical switchboxes 40b and 40e.
Horizontal switchboxes 40c and 40f propagate the values among vertical wire segments, from v16b1 to v16b2, and from v16b4 to v16b5, respectively, producing the 16-bit output pair (outHigh, outLow), which corresponds to the values of equation N2 above.
The invention may further benefit from the use of a reconfigurable Carry Propagate Adder that has a capability of operating as a Priority Encoder. A High-performance Encoder with Priority Lookahead, J. G. Delgado-Frias and J. Nyathi, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Volume 47, Issue 9, September 2000, pages 1390-1393, presents the formulas involved in designing a priority encoding circuit. Those skilled in the art will recognize that these formulas are similar to the propagate (P) formulas commonly used in carry propagate adder designs (see Principles of CMOS VLSI Design, A System Perspective, N. Weste and K. Eshraghian, 1988, p 326-331) and that substantially similar circuits may be employed to implement a carry propagate adder and a priority encoder circuit in a single circuit.
A regular arithmetic capable RAU outfitted with a reconfigurable compression tree may be operated in content addressable memory mode as described with
Although the present invention has been described in terms of a preferred embodiment, it will be appreciated that various modifications and alterations might be made by those skilled in the art without departing from the spirit and scope of the invention. The invention should therefore be measured in terms of the claims which follow.