1. Field of the Invention
This invention relates to electronic circuitry and, more particularly, to reconfigurable circuitry which may be utilized to carry out a number of different logical operations on data.
2. History of the Prior Art
The manipulation of data in modern electronic systems requires the transmission of data from place to place, for example, within a local network or over the internet. The speed of data transmission has increased (and continues to increase) as various new forms of hardware are created and older forms are improved. However, no matter how rapidly data is transmitted, it is only useful if transmitted without errors.
There are a number of methods for testing to determine if data has been correctly transmitted. One of these methods makes use of what is referred to as a cyclical redundancy check (CRC) to ascertain whether data being received is what was in fact sent. Typically, at the sending station the cyclical redundancy check process adds to the data stream of a message a sequence of bits generated from that data stream then determines at the receiving station whether the message with the added bits is correct. Since the bits added at the sending station are determined by the content of the message being sent, those added bits may be tested against the message to determine (within reason) whether the message received is correct.
Historically, the cyclical redundancy check was conducted on a bit serial basis as the data was received. Such a process functions well when the bits of a message are transmitted serially and at slower speeds. However, as faster transmission speeds are attained (including those attained by parallel bit transmission) such a test requires a relatively large amount of time and significantly reduces the overall speed of message transmission. Recently, software techniques have been devised for handling portions of the operation in parallel to decrease the time required for the cyclical redundancy check. For example, a paper entitled Fast Parallel CRC Algorithm and Implementation on a Configurable Processor, Ji and Killian, ICC2002-IEEE International Conference on Communications, vol. 25, no. 1, April 2002, pp. 1813-17, demonstrates an operation by which a cyclical redundancy check is accomplished on a message of indeterminate length by testing message portions of consistent segment lengths and then combining the results of the tests on the message portions.
Although such techniques have decreased the time required for the cyclical redundancy check, even more transmission speed can be attained by a hardware solution. However, once chip area is allocated to a hardware solution, the circuitry is typically useful for accomplishing only the limited purposes for which it was devised.
To this end, it is desirable to enhance the speed of data transmission by hardware techniques which may be utilized to carry out a large variety of different logical operations on data.
It is an object of the present invention to enhance the speed of data transmission by providing circuitry which may be utilized for a variety of purposes including providing and testing the correctness of cyclical redundancy check values.
The present invention is realized by a reconfigurable arithmetic circuit including a plurality of logical AND gates arranged in logical columns and rows, a plurality of conductors each connected to furnish input to the AND gates of a row, an array of memory cells each connected to furnish input to one of the AND gates, and a plurality of reconfigurable counting circuits, each counting circuit connected to receive the output of each of the AND gates in a column, each counting circuit being configurable to provide a count of parity of the outputs furnished by the AND gates of the column.
These and other objects and features of the invention will be better understood by reference to the detailed description which follows taken together with the drawings in which like elements are referred to by like designations throughout the several views. It is to be understood that, in some instances, various details of the invention may be shown exaggerated or otherwise modified to facilitate an understanding of the invention. Moreover, some aspects of the invention considered to be conventional may not be shown so as to avoid obfuscating more important aspects or features of the invention.
As has been pointed out, a cyclical redundancy check is used to determine that data which has been received is the data which was, in fact, sent. A cyclical redundancy check usually appends a sequence of bits to the data stream of a message at the sending station then determines whether the message including the appended bits is correct at the receiving station. The appended bits are generated at the sending station based on the message being sent. Consequently, the appended bits may be tested against the message received to determine (within reason) whether it is correct.
A cyclical redundancy check may be utilized in a great many aspects of data transmission. It may be used to determine that data sent over the internet or other communications has been correctly received. It may be used for the same purpose in local area networks. It may be used to determine within a specific piece of data manipulation hardware (such as a computer) whether data is correctly transmitted and received by the different components of that piece of hardware.
Because a cyclical redundancy check may be (and often is) used in so many aspects of data transmission, the speed with which the check takes place necessarily affects the speed at which data may be transmitted. Classically, the operations of generating and later testing a cyclical redundancy value appended to serially transmitted data were accomplished a bit at a time as the data was transferred. As data transmission speed has increased by improvements such as increasing the width of the data path, such a method has become much too time consuming.
Consequently, methods have been devised for generating and testing a cyclical redundancy value while handling bit portions of the transmitted data in parallel. Typically, these methods are accomplished by executing a specified algorithm on a processor to manipulate the data to generate or check a cyclical redundancy value.
As with many elements of data manipulation, a hardware solution can provide results faster than can a software solution. However, a hardware solution is typically not as flexible as a software solution.
The present invention offers a hardware solution to the operations of obtaining and checking cyclical redundancy values which solution overcomes the limitations of prior art solutions. More particularly, the hardware of the present invention is uniquely reconfigurable so that it may be used for a variety of different forms of cyclical redundancy check computations and may be converted to accomplish a plurality of other logical functions as well.
In order to understand the invention, it will be useful to understand how a cyclical redundancy check is carried out. First, an introduction to the Galois field framework is useful.
The cyclical redundancy check of concern is implemented utilizing Galois field arithmetic in an extension field GF(2N) of the finite field referred to as GF(2). In GF(2), which has elements that can be represent by single binary digits, the addition operation is a modulo 2 addition which is equivalent to an exclusive OR (XOR) operation, and the multiplication operation is equivalent to an AND operation. Because of these equivalences, Galois field arithmetic is useful in implementing data operations. Elements of the extension field GF(2N) may be represented as single-digit binary coefficients of a polynomial in x of degree-bound N, or as a binary number of N digits. The addition operation in GF(2N) is obtained by performing the addition in GF(2) on the coefficients of equal power of x, in the polynomial representation, or as a bit-wise XOR, in the binary number representation.
In order to define multiplication in GF(2N), a primitive polynomial G(x) must be selected. This primitive polynomial G(x) is a polynomial of degree N that is prime (i.e., it can not be factored into smaller polynomials) and that has at least one zero that is a generator of the field GF(2N). A generator of GF(2N) is an element that, when raised to the powers 1 through 2N sequentially, will cycle through each non-zero element of GF(2N) exactly once [see Fast CRC Calculation, R. J. Glaise and X. Jacquart, Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers, pp. 602-605, 1993. (IBM)]. In GF(2N) with primitive polynomial G(x), any intermediate result R(x) that has a polynomial representation of degree N or higher must be folded back into a polynomial of degree-bound N by calculating its value modulo G(x) (further referred to as mod G(x)), which is obtained by performing a polynomial division of R(x) by G(x) and determining the remainder. In order to calculate the multiplication of two elements A(x) and B(x) of GF(2N), the raw polynomial multiplication C(x), which is of degree-bound 2N, is first calculated; and then the remainder R(x)=C(x) mod G(x) is calculated.
In its simplest form, the basic operation of a cyclical redundancy check transmitter is to extend the message data with a number of zeroes matching the order of the primitive polynomial as the least significant bits; and then, while treating the obtained value as a polynomial with binary coefficients, determine the modulo G(x) remainder (which is the CRC value), and replace the padded zeroes by this CRC value in the data that is being transmitted. The data is transmitted most-significant-bit first, so the CRC value follows after the original message has been transmitted. The result is a data stream having a value which is an exact multiple of the primitive polynomial. When the data is received, the data including the CRC value in polynomial representation is divided by the same primitive polynomial used in generation of the CRC value. If the remainder of this division is zero, it is highly probable that the data has been transmitted and received correctly. The hardware required for implementing the CRC transmitter and receiver is largely identical.
It should be noted that in this scheme, the same primitive polynomial is used to accomplish the division both in generating and in testing the cyclical redundancy check value. Thus, the value used for the division need not be part of the information transferred with a message.
The polynomial division modulo G(x) can easily be serialized; and, in fact, the first methods for accomplishing a cyclical redundancy check using Galois field arithmetic operated serially upon the sequential bits of a message (and the message with an appended cyclical redundancy check value). As has been pointed out, such an operation functions well for sequentially transmitted messages but slows faster transmission methods. Consequently, methods for handling sequences of bits in parallel have been devised. Typically, the methods involve selecting from a message bit portions or segments having a common bit width, treating those segments individually to modulo G(x) division by means of software executing on a processor, and combining the results to complete the cyclical redundancy check for any message being transmitted whatever its data width. Examples of such methods are illustrated in Fast Parallel CRC Algorithm and Implementation on a Configurable Processor, H. M. Ji and E. Killian, ICC2002—IEEE International Conference on Communications, vol. 25, no. 1, April 2002, pp. 1813-1817.
The present invention is primarily concerned with calculating the CRC value serially on message portions that have a length that is a multiple of the size of the primitive polynomial G(x). The following discussion details how this can be achieved. The Galois field operations can be efficiently implemented according to the present invention.
Assume now a message M consisting of L binary values, where L is very large (L>>N). This message can be represented by a polynomial M(x) of degree-bound L.
To illustrate this, consider the Galois field GF(232) defined by the primitive polynomial:
G(x)=x32+x26+x23+x22+x16+x12+x11+x10+x8+x7+x5+x4+x2+x1+x0 [Eq. 1]
as defined for use in The Ethernet, A Local Area Network, Data Link Layer and Physical Layer Specifications, Digital Equipment Corporation, Intel Corporation and Xerox Corporation, Stamford, Conn., Version 1.0. Sep. 30, 1980, page 22.
Although this example utilizes N=32 and treats CRC32 using the example polynomial, the results are intended for the general case.
Elements of GF(232) can be represented as binary thirty-two bit vectors, e.g., A[31:0], B[31:0] or as the equivalent polynomials in GF(232) as follows:
A(x)=A[31] x31+ . . . +A[1] x1+A[0] x0 [Eq. 2]
B(x)=B[31] x31+ . . . +B[1] x1+B[0] x0 [Eq. 3]
Similarly, G(x) has a corresponding thirty-three bit binary value:
G[32:0]=1 00000100 11000001 00011101 10110111 [Eq. 4]
A polynomial A(x) can be multiplied by a polynomial B(x) for the case where B(x) is fixed. This is also referred to as Galois field scaling.
To scale A[31:0] by B[31:0] in GF(232) defined by G(x) is equivalent to finding:
R(x)=(A(x)*B(x)) mod G(x) [Eq. 5]
where mod G(x) is the modulo G(x) operation, or the remainder of the operand after polynomial division by G(x) in GF(232).
The following derivation relies on properties of the modulo G(x) operator:
(A(x)+B(x)) mod G(x)=A(x) mod G(x)+B(x) mod G(x)
and, for a polynomial A(x) of order 31 or less:
(A(x)*B(x)) mod G(x)=A(x)*(B(x) mod G(x))) modG(x).
Equation 5 can be rewritten as:
Which in turn equals:
=A[0]*F[0](x)
+A[1]*F[1](x) . . .
+A[31]*F[31](x) [Eq. 8]
Or:
=F(x)*A(x) [Eq. 9]
where F(x) is a vector of polynomials [F[0](x) . . . F[31](x)], and where each F[i](x)=(xi*B(x)) mod G(x). In other words, F[i](x) is the remainder of the polynomial division of xi*B(x) by G(x). In recursive form this yields:
F[i](x)=(x*F[i-1](x)) mod G(x) [Eq. 10]
Since F[i−1](x) mod G(x)=F[i−1](x), x*F[i−1](x) can only be of order 32, in which case a simple XOR of x*F[i−1](x) with G(x) suffices to calculate the mod G(x) remainder.
Hence, a recursive method to pre-calculate the coefficients F[i][31:0] of F[i](x) is as follows:
For each vector F[i][31:0], this describes a left shift of F[i−1][31:0], with conditional XOR with the coefficients of the primitive polynomial G(x) if F[i−1][31] is 1. An example of values for the matrix F[31:0][31:0] for scaling by B(x)=x32 for the primitive polynomial G(x) of [Equation 1], obtained using this method, is shown in
R[i]=F[i][0]*A[0]+F[i][1] *A[1] . . . F[i][31] *A[31] [Eq. 11]
where in this case “*” is AND and “+” is XOR according to GF(2) arithmetic.
Hence, the result R[31:0] of scaling of a variable A[31:0] by a constant in the Galois field GF(232) can be obtained as thirty-two wide XOR operations on different subsets of elements of the variable A[31:0], where inclusion in a particular XOR operation is predetermined. Such a calculation can be performed using a structure according to the present invention.
“Designing TCP/IP Functions in FPGAs”, W. Lu, MSc Thesis, Delft, The Netherlands, August 2003, pp. 34-38, shows how scaling in GF(232) can be used for cyclical redundancy check, in the case of thirty-two bit message segments. Let A(x) be a new message segment of thirty-two bits, and part of a larger message M(x). Let CRCPrev(x) be the CRC calculated for portion P(x) of the message M(x) up to but not including A(x), then:
CRCPrev(x)=x32 P(x) mod G(x) [Eq. 12]
Let CRCNew(x) be the CRC calculated for the portion of the message M(x) up to and including A(x):
CRCNew(x)=x32 (x32 P(x)+A(x)) mod G(x) [Eq. 13]
This can be rewritten in recursive format, using the earlier mentioned properties of mod G(x), as:
CRCNew(x)=((CRCPrev(x)+A(x)) x32) mod G(x) [Eq. 14]
Equation 14 shows that a CRC can be calculated iteratively using bitwise XOR followed by Galois field scaling.
Next, this result is combined with the derivation from Fast Parallel CRC Algorithm and Implementation on a Configurable Processor, referred to above, to show how to calculate the CRC value iteratively, using scaling, on message portions that have lengths that are a multiple of the order of the primitive polynomial. For example, let A(x) be a sixty-four bit message portion, consisting of two thirty-two bit portions A0(x) and A1(x) such that:
A(x)=x32 A1(x)+A0(x) [Eq. 17]
Meaning that A1(x) is the first arriving thirty-two bit portion.
Then, keeping the same notation as before:
This shows that the thirty-two bit CRC value can be obtained iteratively for sixty-four bit wide message chunks, using bit-wise XOR and GF scaling by either x32 mod G(x) or by x64 mod G(x). The matrices F containing values corresponding to these scaling amounts can be predetermined. If stored, these values are immediately available to accomplish Galois field operations at both the transmission and reception of a message. This greatly enhances the speed of the division operation.
Concatenating a message by the remainder from modulo G(x) division provides a message value which has a remainder of zero when again divided by the same divisor. Consequently, a remainder of zero from the second division, performed at the receiver, suggests that the data has been correctly transmitted.
The present invention provides a hardware solution which produces more rapid results than the software solution of the prior art and which may be reconfigured to allow a number of distinct logical operations in addition to cyclical redundancy check scaling of different types.
The arithmetic unit 10 includes a plurality of AND gates 12 logically arranged in rows and columns. In this description, “logical arrangement” means that the circuitry functions as though the devices were physically arranged in the manner illustrated even though the individual elements may be physically positioned differently. In the basic embodiment, each row and each column of the arithmetic unit 10 includes a number of AND gates 12. The number of AND gates in a column of a basic unit may be approximately equal to a convenient number of bits of a typical input message (i.e., ho-h31). That is, if an input message typically may be divided into groups each of which includes thirty-two bits, then each column of the arithmetic unit 10 has thirty-two AND gates. On the other hand, although the number of AND gates in each row of the arithmetic unit 10 may also be thirty-two, this number is selected based on the largest polynomial to be used in the cyclical redundancy check. Moreover, depending on the actual details of the mathematical operations being conducted, a larger or smaller number of AND gates 12 might be utilized in each row and column; and extra AND gates 12 might be added to each row or column for purposes such as a parity check.
In the arithmetic unit 10, each AND gate 12 in a row receives the same bit of the message as an input. Each AND gate 12 in a column receives a logically sequentially increasing (or decreasing) bit of the message as an input. Each AND gate 12 in a column also receives a second value (referred to as a ∂parity masking bit”) which is a logically sequentially increasing (or decreasing) bit of a Galois field value. The Galois field values furnished to the AND gates 12 of each sequential column of the arithmetic unit 10 are the sequential Galois field values computed for the particular message length. For Galois field scaling as described above, these values are typically constants which may be precomputed and stored for ready use in parity mask memory cells associated with the AND gates 12. For a thirty-two bit message portion, the associated Galois field scaling M(x) multiplied by (x32) mod G(x) can be obtained by using as parity mask memory values the values shown in the table of
It should be noted that the particular primitive polynomial for thirty-two bits is usually referred to as CRC32 and is the polynomial used in Ethernet communications selected by an industry standards committee and described in IEEE 802.3-2002, Section 3.8. The CRC32 polynomial is that presented in Equation 1, above, and represented in binary form in Equation 4 above.
As may be visualized, when a thirty-two bit message is presented to the arithmetic unit 10, the sequential bits ho-h31 of that message appear at the input terminals of the sequentially-positioned AND gates 12 of each column simultaneously. Each of the AND gates 12 of a column also receives one of the sequential bits bi according to the selected manipulation. In the basic case of Galois field scaling, bits bi are assigned to Galois field values such as those shown in
Thus, in a single operation, the results of the manipulation of the message bits by each of the Galois field values of
Associated with each logical column of the arithmetic unit 12 is a counting circuit comprising an exclusive OR (XOR) tree 14 (see the block diagram of
Those skilled in the art will recognize that both the logical AND function and the logical XOR function may be performed by many different circuits. Apart from the specific devices of which a logic circuit is composed, different logical steps may be performed by stages of two different circuits which provide the same ultimate result. For example, both a basic AND circuit and a NAND circuit with an inverted output provide a logical AND function.
Moreover, the following properties based on De Morgan's law may be relied upon. First, an XOR circuit with both of its inputs inverted produces the same result as an XOR circuit that does not have its inputs inverted. As a result, the XOR of the output of two AND circuits produces the same result as the XOR of the output of two NAND circuits of the same input signals. Similarly, an XOR of the output of two XOR circuits produces the same result as an XOR of the output of two NOT-XOR (NXOR) circuits of the same input signals. Hence, AND gates followed by a binary XOR tree produce the same result as NAND gates followed by a binary XOR tree, and produce the inverted result of AND or NAND gates followed by a binary NXOR tree. Secondly, the OR function with its inputs inverted is equivalent to the NAND function of its inputs. Hence, an AND circuit followed by an OR function produces the same result as a NAND function followed by another NAND function.
In addition, when a plurality of stages of logical operations are involved in producing a particular logical result, various of the manipulations such as is inversions particular to a specific circuit may be included within others of the stages yet produce the same results. These characteristics are utilized in providing a number of advantages of the present invention. In order to facilitate an understanding of the different aspects of the invention, particular logical operations which result from a particular configuration of the circuits utilized are referred to as “logical functions” (e.g., the logical AND function, the logical OR function) no matter which specific circuit performs the function. Consequently, the scope of the invention should be considered to include the various different circuits which may be utilized to carry out the referenced logic functions.
It should be noted that the particular arrangement illustrated in
More particularly, the tree 31 illustrated in
This may be appreciated by considering the results at each stage of the tree. Presuming that the physical arrangement utilizes NAND gates to furnish the inputs a-h and that all of the SEL values are set to ZERO so that the circuits 32 function as NXOR gates and the circuit 34 as an inverter, then the outputs of the circuits 32 at the first stage of the tree are NXOR(a,b), NXOR(c,d), NXOR(e,f), NXOR(g,h). Then, the outputs of the circuits 32 at the second stage of the tree are NXOR (NXOR(a,b), NXOR(c,d)), and NXOR (NXOR(e,f), NXOR(g,h)). These values of outputs of the circuits 32 at the second stage reduce to NXOR (a,b,c,d) and NXOR (e,f,g,h). Then, the output at the third stage of the tree becomes NXOR (a,b,c,d,e,f,g,h). When inverted by the circuit 34, the result is XOR (a,b,c,d,e,f,g,h).
The truth table of
In order to allow the tree to perform the different operations, circuits 32 and 34 may be utilized such as those illustrated in
On the other hand, when the SEL setting is chosen to be ZERO, the NOR gate functions to invert the input value B. This causes the multiplexer to furnish an output which is that of a NXOR circuit. Thus, depending on its particular configuration, the circuit 32 of
The circuit 34 which is the last stage of the tree of
The results produced by the XOR trees 14 of
Using a plurality of reconfigurable arithmetic units allows longer messages to be handled in parallel. It has been shown in the Fast Parallel CRC Algorithm and Implementation on a Configurable Processor publication and in the discussion above regarding equation 5 that it is possible to divide messages longer than thirty-two bits into thirty-two bit segments and handle in parallel the processing of the cyclical redundancy check values for of those individual segments. The results of processing the individual segments may then be combined to provide a result for the entire message.
In one embodiment of the present invention (see
Similarly, larger messages may be handled in series of one hundred twenty-eight bit segments furnished to the four reconfigurable arithmetic units provided for handling one hundred twenty-eight bits; since the operation is modulo 2, the remainder values wrap into the larger messages in a similar way as suggested by equation 14 for the sixty-four bit case.
It should be noted that a practical arrangement for handling messages of varying lengths might include additional circuitry. For example, If the messages to be handled are guaranteed to be multiples of thirty-two bits, four RAUs and additional multiplexers permits the last message portion to be sent into any of four, three, two, or one RAU and padded with zeroes. For messages which are guaranteed to be a multiple of eight bits in length (the most common form in Ethernet communication), an additional fifth RAU and additional multiplexers may be used to perform a necessary length correction step that can also be reduced to a Galois field scaling step as known to those skilled in the art.
Moreover, a practical arrangement might include two complete sets of four (or five) reconfigurable arithmetic units in order to accomplish the processing of both incoming and outgoing messages.
Although the values used in the Galois field manipulation are usually well known and may therefore be precomputed, the basic arithmetic unit 10 of the present invention allows a number of variations. The arrangement utilizes programmable inputs to the AND gates 12 for values generated using the primitive polynomials. Because these values are programmable, the use of the circuitry may be changed from simply accomplishing the cyclical redundancy check utilizing a constant primitive polynomial to other uses. By utilizing AND gates 12 having a variable input B, a different value may be provided in place of a standard internet Galois field value. For example, if another polynomial is utilized in the cyclical redundancy check manipulation, then the Galois field value used at the receiver may be different than that which has been precomputed. In such a case, the Galois field polynomial must be transferred from the transmitting station to the receiving station in order to accomplish the cyclical redundancy check operations. With the AND gate parity masking bit input to the arithmetic unit 10 of the present invention, this operation is easily accomplished since the Galois field values utilized in the manipulation may be readily varied.
The ability to modify the parity bit input values may also be useful for other than the determination of cyclical redundancy check values. Thus, the arithmetic unit 10 may be utilized to generate hash values useful in various operations of the circuitry with which the arithmetic unit 10 is associated. (e.g., see A Performance Study of Hashing Functions for Hardware Applications, M. Ramakrishna, E. Fu, and E. Bahcekapili, Proc. 6th. Int'l. Conf. Computing and Information, 1994, pp. 1621-36).
Another advantage of the invention is the ability to utilize primitive polynomials other than that for thirty-two bits in the computations. There are a number of additional cyclical redundancy check verifications having associated primitive polynomials which are utilized for other purposes. For example, there are also primitive polynomials which have been selected for cyclical redundancy check verifications for eight, ten, twelve, and sixteen bit messages. Because the B inputs of the AND gates 12 of the arithmetic unit 10 are changeable, the arrangement provides the ability to work with these additional cyclical redundancy check verifications in smaller portions of the AND-XOR array. For example, since the values furnished to the B inputs may be controlled, it is possible to utilize sixteen bit by sixteen bit portions of the array to test for other than internet cyclical redundancy check values. The same ability allows even smaller portions such as eight by eight bit portions of the arrays to be utilized for similar purposes.
More importantly, providing additional inputs to sub-portions of the AND gates allows the individual sub-portions of the array to be associated with one another in a manner that the parts may be made to cooperate to provide a whole that may be easily manipulated. For example, the array may be utilized in a manner that four individual sixteen by sixteen arrays are provided. By dividing the individual inputs of the message and the added inputs into sixteen bit sections, all of these arrays may be made to function so that they provide results based on individual sixteen bit inputs. Moreover, the outputs of these sub-arrays may be combined and utilized in a manner so that effectively four individual sixteen by sixteen operations are being conducted in parallel. Of course, the same in true of the smaller sub-arrays such as those of eight by eight bits as described above.
In order to obtain the aforementioned advantages, it is desirable to provide inputs to and take outputs from the individual sub-portions of the array. In order to accomplish this, it is useful to provide additional horizontal input buses to eight-by-eight and to sixteen-by-sixteen subsections of the reconfigurable arithmetic unit. These inputs may be made programmable so that individual eight-by-eight or sixteen-by-sixteen sub-sections may be utilized as well as the full thirty-two by thirty-two bit array. It should be noted that, assuming that the entire reconfigurable arithmetic unit is provided the XOR trees discussed above for each column, then these XOR trees will, in fact, function to provide the desired result without additional change.
The operations which the invention may be utilized to accomplish are increased by the ability to utilize sub-portions of the thirty-two by thirty-two bit array or other convenient sized array and the enhancements illustrated in an embodiment of the invention shown in
The provision of the additional inputs to the AND gates also allows the reconfigurable arithmetic unit to be utilized for general Galois field multiplication.
General Galois field multiplication means the GF multiplication
A(x) B(x) mod G(x)
in which both operands A(x) and B(x) are variable, so that the method of precalculating a table that was presented for Galois field scaling does not apply. A typical application for this is in Reed-Solomon decoders [Blahut], and a typical implementation uses the extension Galois field GF(28). For the following explanation, variables are elements of GF(28) and a primitive polynomial of order eight, for example [Ref BBC Whitepaper WHP031, p. 8, Eq. 8]:
G(x)=x8+x4+x3+x2+x0 [Eq. M1]
In reference [Gill 1992], it is shown how a multiplication of two operands A(x) and B(x) may be accomplished in a two step process, consisting of first, the calculation of a raw product polynomial T(x) that is of degree-bound 16, followed by a scaling of the top portion of T(x) by a fixed value, and bit-wise XOR of that value with the lower portion of T(x). For completeness, that derivation is repeated in the following paragraphs.
Input operands and result in GF(28) represented by the polynomials A(x), B(x) or the vectors A[7:0], B[7:0], and a result R(x) or R[7:0]:
First consider the raw multiplication result Prod(x) in polynomial form. This is a polynomial of order fourteen:
Prod(x) is hence obtained by calculating the partial products for every combination of elements of A[0:7] and B[0:7], and performing a bit-wise XOR for partial products corresponding to equal powers of x.
This polynomial Prod(x) can be split in a lower half ProdLow(x), containing the coefficients corresponding to the powers (0 . . . 7) of x, and a higher half ProdHigh(x), containing the coefficients corresponding to the powers (8 . . . 14), but with x8 divided out so that:
Prod(x)=x8*ProdHigh(x)+ProdLow(x) [Eq. M3]
where:
Note that here, the 8th coefficient of ProdHigh(x) is always zero but it is added so that all buses used in the calculation can be multiples of eight bits.
The remainder of Prod(x) divided by G(x) is the desired result polynomial R(x). Since ProdLow(x) mod G(x)=ProdLow(x) and ProdHigh(x) mod G(x)=ProdHigh(x):
Where:
T(x)=[x8 mod G(x), x9 modG(x), . . . , x14 mod G(x), x15 mod G(x))], [Eq. M9]
The coefficients of which can be precalculated since T(x) is independent of either operand.
Hence the coefficients R[7:0] of R(x) are obtained from:
First, three sub-portions of the array 61, 62, and 63 are chosen to accomplish the operation. Each of these sub-portions may have an array size equal to the values to be utilized. That is, since values of eight bits are to be multiplied, the sub-portions of the array chosen are eight-by-eight in size. Typically a byte is the smallest useful segment of data which might be manipulated. Before input values are furnished to the array, inputs to certain of the AND gates in two of the sub-portions are rendered inoperative by zeroing the inputs provided on the diagonal input conductors. As may be seen in
One of the values to be multiplied is furnished on the horizontal input lines to the two sub-portions 61 and 62 while the other value to be multiplied is furnished to the same sub-portions on the diagonal lines which are not furnishing disabling inputs. Thus all of the AND gates of all of the columns of the two sub-portions 61 and 62 receive the same input values on the horizontal lines; while, the value furnished on the diagonal lines increases by one bit weight with each column, proceeding from left to right. When the results produced by the AND gates are manipulated by the XOR trees of each column, a sixteen bit result consisting of a fifteen bit raw multiplication result padded in the most significand bit position by one 0 bit is provided by the two sub-portions.
To reduce this fifteen bit result to a GF (2**8) value, the high order bits provided as output by the XOR trees of sub-portion 61 are furnished on horizontal lines MpyH to the sub-portion 63. In sub-portion 63, these high order bits are scaled utilizing a matrix of values for a primitive polynomial for GF (2**8) Galois field operations. The results provided from the XOR trees of the columns of the sub-portion 63 are then combined (XORed) with the results provided from the XOR trees of the columns of the sub-portion 62 to provide a final result in GF (2**8) form.
Each of the RAUs 50 illustrated in
In discussing the embodiment of
The input signals available to each of the RAUs 50 include horizontal input values furnished on input buses prefaced by an “h” such as h8a2 furnished to RAU 50a. The nomenclature utilized indicates that the eight bit bus furnishes signals to a number of columns convenient to the particular size of the RAUs. Those buses including an “8” furnish signals across eight columns, while those including a “16” furnish signals across sixteen columns (across two RAUs). The latter buses are used to provide signals to two RAUs at the same time. The input signals available to each of the RAUs 50 also include two different values furnished on input buses connected to diagonal inputs of the RAU 50a. These are furnished on input buses prefaced by a “d” such as d8a0 and d8b3 which connect to the RAU 50a.
Output values are furnished by the RAUs 50 on buses prefaced by a “v” such as bus v8a1 joining RAU 50a.
Each of the horizontal switch boxes 30 receives input signals on three eight is bit input buses c8xx, h8xx, and d8xx and on one eight bit input bus h16xx which spans sixteen columns. Although not illustrated in order to reduce the complexity of this figure, particular embodiments may also include input buses which span thirty-two columns. Those skilled in the art will recognize that buses of other sizes also may be included depending on the particular use of the arrangement.
The internal elements of the switchboxes 30 and 40 may be similar so only a single switchbox is treated in detail. The manner in which signals may be provided to each of the RAUs from the horizontal and vertical switch boxes will be better understood by referring to
In
In addition, each of the circled intersections in
As may be seen, each of the outputs A1, D1, C1, H1, A0, D0, C0, H0, and H2 is preceded by a tristate buffer. On the other hand, the output MXOUT is furnished by a multiplexer selecting from one of two vertical buses, the output XOROUT is furnished by a XOR circuit receiving inputs from two vertical buses, and the output REGQ is furnished by a DQ register receiving inputs from two vertical buses. Each of the MXOUT, XOROUT, and REGQ outputs is also furnished as an internally routed input on a similarly labeled horizontal bus. These horizontal buses allow connections to be made (at circled crossing points) to the various output channels.
In addition to the other inputs, logical ZERO (“0”) and ONE (“1”) values are furnished within the switchbox 30 so that these values may be furnished on the various outputs of the circuits. These are especially useful for setting the diagonal values of certain areas of the RAUs in order to allow the arrangement to be used for various arithmetic purposes such as Galois field multiplication previously discussed.
The paths 155 and 156 meet at another sparse crossbar 158 which provides for interchanging the signals on the two signal paths. The paths 155a or 156a also proceed by bidirectional paths 165 and 166 to a sparse crossbar 161 where the signals may be switched to the outputs A and B at a reconfigurable circuit 167. Bidirectional signal paths 168 and 169 return from the circuit 167 through the sparse crossbar 161 and to the paths 155b and 156b. The multiplexer 161 allows signals on paths 165, 166, 168, and 169 to be cross-coupled to the other paths.
At the circuit 167, signals on paths 165 and 166 are furnished to separate AND gates 163 and 164 where they may be transferred in response to control signals. It should be noted that each AND gate shown represents a bus width of individual AND gates controlled by a single input control. The AND gate outputs are furnished to XOR circuitry, and the result furnished to the signal paths 168 and 169. The circuit 167 has the ability to function as either a bussed multiplexer or a bussed XOR circuit and offers a bussed register at one output. As is illustrated, the signal paths 168 and 169 connect back to the sparse crossbar 161 where the signals may be rerouted through the arrangement. The circuit of
In
In order to illustrate the operation of the access circuitry shown in
In particular,
In a similar manner,
Another operation of the RAU is illustrated in
In order to access a particular value stored in the RAU, the value being sought is furnished on the first, third, fifth, and seventh rows of the horizontal input bus; while ONEs are furnished on the second, fourth, sixth, and eighth rows of the horizontal input bus. That is, the value being sought is furnished on alternate lines of the input bus at rows in which ONEs are stored by memory (i.e., IN[0], IN[N/2-1]). The other lines of the input bus are furnished values of ONE. This causes the AND gates of the particular column storing the value being sought to furnish an output indicating a match of the input value and the value stored by the column. Referring to the table of settings of the select values which may be utilized for programming the XOR tree of
The present invention may also be utilized to accomplish the functions of a programmable logic array (PLA). A PLA is a circuit which allows the implementation of arbitrary logic by mapping it onto two levels of logic, either (1) AND followed by OR or (2) OR followed by AND. The functionality of a PLA may be obtained by cascading two RAU circuits so that the output of the first RAU circuit is used as the input of the second RAU circuit. In the first RAU, the selection values of the counting tree (
A PLA-like structures of a higher number of logic levels can be obtained by cascading more than two RAU circuits in this manner.
A different type of PLA may be realized by using a function which XORs the sum of products. This may be implemented by configuring the counting tree of the second RAU circuit to provide XOR logic by setting the select values as delineated in line A of the accompanying truth table. Such an arrangement allows certain types of logic to be mapped more efficiently.
Another illustration will assist those skilled in the art to understand that much more sophisticated logical operations may be accomplished by the inventive arrangement utilizing the reconfigurable arithmetic units and the accompanying access circuitry shown in
It may be recognized that the following two equations are implemented in a repetitive manner for incrementing indices i−1,i,i+1 etc:
L
i+1
<=L
i+Delta*Bi [Eq. RS1]
B
i<=Skip ? Bi:Li*DeltaRecip [Eq. RS2]
where the variables Li+1, Li, Delta, Bi, and DeltaRecip are elements of GF(28) and “Skip” is a binary variable; “+” stands for addition in GF(28), “*” stands for multiplication in GF(28), “<=” denotes an assignment through a flip-flop or register on a predetermined clock edge, and the notation “A ? B:C” denotes a multiplexer that has output value B if A equals “1” and C if A equals “0”.
These equations can be divided into smaller portions, each corresponding to a specific hardware component shown in Figure [BKMSysArrayNew.ps], while keeping the same notation, as follows:
L
i+1
<=K
i+1 [Eq. RS1a]
implemented in register 214a of section 210a,
K
i+1
=L
i
+C
i+1 [Eq. RS1b]
implemented in Galois field adder 212a of section 210a,
C
i
=B
i*Delta [Eq. RS1c]
implemented in Galois field multiplier 232b of section 235b, where Ki+1, Ci, and Ci+1 are elements of GF(28), and:
Bi<=Ai [Eq. RS2a]
implemented in register 220b of section 210b,
Ai=skip ? Bi:Mi [Eq. RS2b]
implemented in multiplexer 218b of section 210b,
M
i
=L
i*DeltaRecip [Eq. RS2c]
implemented in Galois field multiplier 216b of section 210b, where Ai and Mi are elements of GF(28).
It may be recognized that the hardware described in
Except for the single bit signal “Skip”, all signals are sent on 8 bit buses representing elements of Galois field GF(28). Vertical crossbar 40c of
At this point, it is useful refer to the description of Galois field multiplier 216b of
Vertical crossbar 40b is programmed to furnish a logic “0” on its output D0, and onto bus d8a1, which is connected to input D of RAU 50b, and vertical crossbar 40f is programmed to furnish a logic “0” on its output D1, and onto bus d8b7, which is connected to input E of RAU 50d. Vertical crossbars 40d and 40f have been programmed to furnish the signal of their inputs V1 on their outputs V0, and externally supplied value DeltaRecip is distributed along buses v16b3, v16b4, and v16b5.
Cross switch 60d is programmed to connect bus v16b4 to bus h16b4. Horizontal crossbar 30d is programmed to furnish the signal of its input H2 onto is outputs A0 and A1. As a result, DeltaRecip is available on bus h8a3 and on input A of RAU 50b, as well as on bus h8a5 and on input A of RAU 50d.
Furthermore, RAUs 50b and 50d have been set to operate as Galois field multipliers, and produce the high and low portions ProdHigh and ProdLow of the raw multiply value of Li*DeltaRecip on their outputs z, respectively. Vertical crossbar 40b and horizontal crossbar 30c are programmed to conduct signal ProdHigh from v8a2 to d8b3 and from d8b3 to h8a4, respectively. In
In another embodiment diagonal buses for routing purposes only could be present.
Vertical crossbars 40d are programmed to furnish a logic “0” on bus d8a3 to input D of RAU 50c, and vertical crossbar 40e is programmed to furnish a logic “0” on bus d8b6 to input E of RAU 50c. RAU 50c is programmed to implement the Galois field scaling of ProdHigh by x8 mod G(x) to obtain ResH, similar to Equation M3. Vertical crossbar 40e is further programmed to produce the Galois field addition (i.e. bitwise XOR) of signal ResH on bus v8a5 and signal ProdHigh on bus v8a6, and furnish the result Mi on bus c8b6.
Horizontal crossbar 30f implements multiplexer 218b and register 220b of
Due to the repetitive nature of the implementation, some logic belonging to the next slice 200a of the systolic array is also included in this figure. In particular, vertical crossbars 40a and 40d of
Although the present invention has been described in terms of a preferred embodiment, it will be appreciated that various modifications and alterations might be made by those skilled in the art without departing from the spirit and scope of the invention. The invention should therefore be measured in terms of the claims which follow.