FIELD OF THE INVENTION
The present invention relates to a scheme for arithmetic operations in finite fields generally and, more particularly, to a method and/or apparatus for implementing a universal Galois field multiplier.
BACKGROUND OF THE INVENTION
An error correction code is a technique for expressing a sequence of numbers such that any errors which are introduced may be detected and corrected (within certain limitations) based on the remaining numbers. The study of error correction codes and the associated mathematics is known as coding theory. The commonly used error correction codes in digital communications and data storage include BCH (Bose-Chaudhuri-Hochquenghem) codes, Reed-Solomon (RS) codes (which are a subset of BCH codes), turbo codes, and the like.
Error correction codes are often defined in terms of Galois or finite field arithmetic. A Galois field is commonly identified by the number of elements which the field contains. The elements of a Galois field may be represented as polynomials in a particular primitive field element, with coefficients in the prime subfield. Since the number of elements contained in a Galois field is always equal to a prime number, q, raised to a positive integer power, m, the notation GF(qm) is commonly used to refer to the finite field containing qm elements. In such a field, all operations between elements comprising the field yield results which are each elements of the field.
Finite fields of characteristic 2 are important because these fields have data structures suitable for computers and may be utilized in error correction coding and cryptography. Conventionally, inverse calculation over a finite field with characteristic 2 may require an enormous amount of calculations compared with multiplication. For example, a well-known method for calculating inverses in a finite field follows directly from the cyclic structure of such a field that the inverse of a field element may be obtained directly from exponentiation. To be more precise: a−1=a−2+2n. A person skilled in the art will recognize that this operation may be accomplished with 2n−3 multiplications. Logic circuits for inverse operation based on such a method may thus have large depth and complexity. The depth of a logic circuit is the maximal number of logic elements in a path from a circuit input to a circuit output. The depth may determine the delay of the circuit. The complexity of a logic circuit is the number of logic elements in the circuit. The logic elements may have two inputs and one output.
It would be desirable to provide a method for constructing logic circuits of small depth and complexity for operation of inversion in finite fields of characteristic 2.
SUMMARY OF THE INVENTION
The present invention concerns an apparatus including a multiplier circuit and a multiplexing circuit. The multiplier circuit may be configured to multiply a first multiplicand and a second multiplicand based on a programmable base value and generate a plurality of intermediate values, each intermediate value representing a result of the multiplication reduced by a respective irreducible polynomial. The multiplexing circuit may be configured to generate an output in response to the plurality of intermediate values received from the multiplier circuit and the programmable base value.
The objects, features and advantages of the present invention include providing a method and/or apparatus for implementing a universal Galois field multiplier that may perform multiplication in any field GF(2n), n=8, . . . , 16, in standard polynomial bases and/or normal bases.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other objects, features and advantages of the present invention will be apparent from the following detailed description and the appended claims and drawings in which:
FIG. 1 is a block diagram of a module 100 illustrating a universal Galois field multiplier in accordance with an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example implementation of multiplier circuit of FIG. 1 configured as a standard base multiplier;
FIG. 3 is a diagram illustrating an example implementation of a binary-unary encoder of FIG. 2;
FIG. 4 is a diagram illustrating an example implementation of a double conjunction module of FIG. 2;
FIG. 5 is a diagram illustrating an example implementation of a linear transform module of FIG. 2;
FIG. 6 is a diagram illustrating an example vector-column C over GF(2) used by the submodules of FIG. 5;
FIG. 7 is a diagram illustrating an example transform matrix Li over GF(2) used by the submodules of FIG. 5;
FIG. 8 is a diagram illustrating an example vector-column C over GF(2) of the linear transform module of FIG. 5;
FIG. 9 is a diagram illustrating an example of submatrices of a transform matrix Li over GF(2) of the linear transform module of FIG. 5;
FIG. 10 is a diagram illustrating a step in the computation of the matrix Li for i=5;
FIG. 11 is a diagram illustrating another step in the computation of the matrix Li for i=5;
FIG. 12 is a diagram illustrating still another step in the computation of the matrix Li for i=5;
FIG. 13 is a diagram illustrating a final step in the computation of the matrix Li for i=5;
FIG. 14 is a diagram illustrating the matrix L5 corresponding to the steps illustrated in FIGS. 10-13;
FIG. 15 is a diagram illustrating cover of the matrix L5 by unit submatrices;
FIGS. 16-18 are diagrams illustrating calculations corresponding to an alternative irreducible polynomial to the one associated to FIGS. 10-14;
FIG. 19 is a diagram illustrating the matrix L5 corresponding to the steps illustrated in FIGS. 16-18;
FIG. 20 is a diagram illustrating cover of the matrix L5 of FIG. 19 by unit submatrices;
FIG. 21 is a diagram illustrating an example implementation of the multiplexing module of FIG. 1;
FIG. 22 is a diagram illustrating an example of a 9 to 1 multiplexer in accordance with an embodiment of the present invention;
FIG. 23 is a diagram illustrating an example implementation of multiplier circuit of FIG. 1 configured as a normal base multiplier;
FIG. 24 is a diagram illustrating an example implementation of the linear transform form from normal to standard bases module of FIG. 23; and
FIG. 25 is a diagram illustrating a circuit based on a matrix and taking into account cover with units-submatrices.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring to FIG. 1, a block diagram of a module 100 is shown illustrating a universal Galois field multiplier in accordance with an embodiment of the present invention. The module 100 may perform multiplication in any field GF (2n), n=8, . . . , 16, in standard polynomial bases or normal bases. The module 100 may have an input 102 that may receive a first multiplicand signal (e.g., A), an input 104 that may receive a second multiplicand (e.g., B), an input 106 that may receive a programmable base value (e.g., R) and an output 108 that may present a result of the multiplication (e.g., O). In one example, the multiplicand A may comprise a plurality of values (e.g., a0, . . . , a15), the multiplicand B may comprise a plurality of values (e.g., b0, . . . , b15), the programmable base value R may comprise a plurality of values (e.g., R0, R1, R2, R3) and the result 0 may comprise a plurality of values (e.g., o0, . . . , o15). The signals R0, R1, R2, and R3 generally determine a dimension n of the field (e.g., n=8+8R0+4R1+2R2+R3).
The module 100 may comprise a block (or circuit) 110 and a block (or circuit) 112. The block 110 may comprise a multiplier circuit. the block 112 may comprise a multiplexing circuit. The block 110 may have a number of inputs that may receive the signals A, B and R and a number of outputs that may present a number of intermediate values (or results). The block 112 may have a number of inputs that may receive the intermediate values from the block 110 and the signal R. The block 112 may be configured to generate the signal O in response to the intermediate values from the block 110 and the signal R.
Referring to FIG. 2, a diagram is shown illustrating the module 100 configured as a standard base Galois field multiplier in accordance with an embodiment of the present invention. In one example, the block 110 may comprise a block 120, a number of block 122, a block 124 and a block 126. The block 120 may comprises a binary-unary encoder. The blocks 122 may comprise double conjunction circuits. The block 124 may comprise a polynomial multiplier. In one example, the block 124 may be configured to obtain a result C(x) by multiplying a first multiplicand A(x) and a second multiplicand B(x). In one example, the multiplicands may be of degree 15 and the result may be of degree 30. The block 126 may comprise a linear transform block. In one example, the block 126 may be configured to generate a number of intermediate values by reducing the result C(x) from the block 124 using a number of irreducible polynomials (e.g. C mod P0, C mod P1, C mod P2, C mod P3, C mod P4, C mod P5, C mod P6, C mod P7, C mod P8).
With the module 100 configured as a standard base multiplier and appropriate values of the input signals R0, R1, R2, R3, any elements of GF(2n) may be represented as vectors (e.g., (a0, . . . , an−1), (b0, . . . , bn−1)). The result of multiplication also may be represented as a vector (e.g., (c0, . . . , cn−1)). In standard base, the vectors (a0, . . . , an−1), (b0, . . . , bn−1) may be represented as the vector of coefficients of polynomials:
A(x)=a0+ . . . +an−1xn−1, B(x)=b0+ . . . +bn−1xn−1
Multiplication of these elements may be performed by the multiplication of the corresponding polynomials A(x), B(x) modulo Pn−8(x), where Pn−8(x) is an irreducible polynomial. The particular irreducible polynomial Pn−8(x) generally corresponds to the given standard base in GF(2n)).
In one example, the irreducible polynomials Pn−8(x) corresponding to the standard bases may be as follows:
P
0
=x
8
+x
4
+x
3
+x+1,
P
1
=x
9
+x+1,
P
3
=x
11
+X
2+1,
P
4
=x
12
+x
3+1,
P
5
=x
13
+x
4
+x
3
+x+1,
P=x
14
+x
5+1,
P
7
=x
15
+x+1,
P=x
16
+x
5
+x
3
+x+1.
Various circuits (e.g., simple and optimized) are presented herein for implementing a universal standard base multiplier in accordance with the present invention. In one example, a simple circuit may contain 482 AND(OR) cells and less than 478 XOR cells. The depth (delay) of the example simple circuit may be determined as 7DXOR+10DAND. In one example, an optimized circuit may contain 431 AND(OR) cells and less than 447 XOR cells. The depth of the optimized circuit may be determined as 9DXOR+10DAND.
An example of a universal normal base multiplier approximately double the size and the depth (e.g., 13DXOR+11DAND) is also presented (described below in connection with FIGS. 23 and 24). A normal base in GF(2n), n=8, . . . , 16, is a linearly independent system Bα={α, α2, α4, α8, . . . , α2̂{n−1}}, where α is a root of an irreducible polynomial Pn−8(x). Then the system Bα={1, α, α2, α3, . . . , αn−1} is the standard base corresponding to the normal base Bα. Also presented is an example when the irreducible polynomials of the bases are P0=x8+x7+x2+x+1 (e.g., corresponding to a unique normal base in GF(28)), P1=x9+x8+x6+x5+x4+x+1 (e.g., corresponding to the optimal normal base of type 2), P2=x10+x9+x8+x7+x6+x5+x4+x3+x2+x+1 (e.g., corresponding to the optimal normal base of type 1), P3=x11+x10+x8+x4+x3+x2+1 (e.g., corresponding to the optimal normal base of type 2), P4=x12+x11+x10+x9+x8+x7+x6+x5+x4+x3+x2+x+1 (e.g., corresponding to the optimal normal base of type 1), P5=x13+x12+x10+x7+x4+x3+1 (e.g., corresponding to the normal base with minimal complexity, but a better normal base may be obtained with polynomial x13+x12+x11+x10+x9+x8+x7+x6+x5+x4+x3+x2+1), P6=x14+x13+x12+x9+x8+x+1 (e.g., corresponding to the optimal normal base of type 2), P7=x15+x14+x12+x9+x7+x5+x4+x2+1 (e.g., corresponding to the normal base with minimal complexity), P8=x16+x15+x14+x5+1 (e.g., corresponding to a random base).
The module 124 may be implemented using conventional techniques. The module 124 may comprise, in one example, 256 AND cells and 225 XOR cells. The depth of the module 124 may be determined as 4DXOR+DAND. The size of the module 124 may be reduced as follows (e.g., application of Karatsuba's method). The polynomials A(x) and B(x) may be represented as
A(x)=A0(x)+A1(x)x8, B(x)=B0(x)+B1(x)x8,
then
C(x)=A(x)B(x)=(A0(x)+A1(x)x8)(B0(x)+B1(x)x8)=A0(x)B0(x)+((A0(x)+A1(x)))(B0(x)+B1(x))−A0(x)B0(x)−A1(x)B1(x))x8+A1(x)B1(x))x16.
A module implementing a 16-bit polynomial multiplier using Karatsuba's method may be constructed from three 8-bit polynomial multiplier modules and 8+8+15+15+7+7=60 XOR cells. The size of Karatsuba's module is 60XOR+3(64AND+49XOR)=192AND+207XOR. The depth of the module is 6DXOR+DAND.
In one example, binary-unary coding may be performed as summarized in the following coding table TABLE 1:
TABLE 1
|
|
R3, R2, R1, R0
U0, U1, U2, U3, U4, U5, U6, U7
|
|
0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0
|
0, 0, 0, 1
1, 0, 0, 0, 0, 0, 0, 0
|
0, 0, 1, 0
1, 1, 0, 0, 0, 0, 0, 0
|
0, 0, 1, 1
1, 1, 1, 0, 0, 0, 0, 0
|
0, 1, 0, 0
1, 1, 1, 1, 0, 0, 0, 0
|
0, 1, 0, 1
1, 1, 1, 1, 1, 0, 0, 0
|
0, 1, 1, 0
1, 1, 1, 1, 1, 1, 0, 0
|
0, 1, 1, 1
1, 1, 1, 1, 1, 1, 1, 0
|
1, 0, 0, 0
1, 1, 1, 1, 1, 1, 1, 1
|
|
A multiple-output Boolean function U(R)=(Ui(R0,R1,R2,R3), i=0, . . . , 7, for TABLE 1 above may be defined as follows:
U0=R0 V R1 V R2 V R3=U1 V R0,
U1=R1 V R2 V R3=U3 V R1,
U2=R0 & R1 V R2 V R3=U3 V R0 & R1,
U3=R2 V R3,
U4=(R0 V R1) & R2 V R3=U5 V R0 & R2,
U5=R1 & R2 V R3,
U6=(R0 & R2) & R1 V R3
U7=R3.
Referring to FIG. 3, a diagram is shown illustrating an example implementation of the binary-unary encoding module 120 of FIG. 1. The size of the binary-unary encoding (B-U E) module 120 computing the multiple-output Boolean function U(R) as defined above may be equal to 11 and the depth may be equal to 3. In one example, the module 120 may comprise a block (or circuit) 140, a block (or circuit) 142, a block (or circuit) 144, a block (or circuit) 146, a block (or circuit) 148, a block (or circuit) 150, a block (or circuit) 152, a block (or circuit) 154, a block (or circuit) 156, a block (or circuit) 158 and a block (or circuit) 160. The blocks 140-146 may be implemented as two-input AND gates. The blocks 148-160 may be implemented as two-input OR gates.
In one example, the block 140 may have a first input that may receive the signal R0, a second input that may receive the signal R1, and an output. The block 142 may have a first input that may receive the signal R0, a second input that may receive the signal R2, and an output. The block 144 may have a first input that may receive the signal R1, a second input that may receive the signal R2, and an output. The block 146 may have a first input that may receive the signal R1, a second input that may be connected to the output of the block 142, and an output. The block 148 may have a first input that may receive the signal R2, a second input that may receive the signal R3, and an output that may present the signal U3. The block 150 may have a first input that may receive the signal U3, a second input that may be connected to the output of the block 140, and an output that may present the signal U2.
The block 152 may have a first input that may receive the signal R3, a second input that may be connected to the output of the block 144, and an output that may present the signal U5. The block 154 may have a first input that may receive the signal U3, a second input that may receive the signal R1, and an output that may present the signal U1. The block 156 may have a first input that may receive the signal R3, a second input that may be connected to the output of the block 146, and an output that may present the signal U6. The block 158 may have a first input that may receive the signal U5, a second input that may be connected to the output of the block 142, and an output that may present the signal U4. The block 160 may have a first input that may receive the signal U1, a second input that may receive the signal R0, and an output that may present the signal U0.
Referring to FIG. 4, a diagram is shown illustrating an example implementation of the blocks 122 of FIG. 2. In one example, the blocks 122 may comprise a block (or circuit) 162 and a block (or circuit) 164. The blocks 162 and 164 may be implemented, in one example, as two-input AND gates. The block 162 may receive a first multiplicand signal (e.g., ai+8) at a first input and a signal (e.g., Ui) at a second input. The block 164 may receive a second multiplicand signal (e.g., bi+8) at a first input and the signal (e.g., Ui) at a second input. The blocks 162 and 164 are generally configured to logically AND the signal Ui with the respective multiplicand ai+8 or bi+8.
Referring to FIG. 5, a diagram is shown illustrating an example implementation of the module 126 of FIG. 2. The module 126 generally computes, in parallel, the result signals C(x) mod Pi(x), i=0, . . . , 8, where Pi represents respective irreducible polynomials over field GF(2). In one example, the polynomials Pi, i=0, . . . , 8, may be comprised of the following:
P
0
=x
8
+x
4
+x
3
+x+1,
P
1
=x
9
+x+1,
P
2
=x
10
+x
3+1,
P
3
=x
11
+x
2+1,
P
4
=x
12
+X
3+1,
P
5
=x
13
+x
4
+x
3
+x+1,
P
6
=x
14
+x
5+1,
P
7
=x
15
+x+1,
P
8
=x
16
+x
5
+x
3
+x+1.
In one example, the block 126 may comprise a number of blocks (or circuits) 170a-170n. The blocks 170a-170n may be implemented as submodules configured to perform the linear transformation C mod Pi. In one example, each of the blocks 170a-170n may be configured to compute the linear transform C→LiC, where C represents a vector-column over GF(2) (e.g., illustrated in FIG. 6) and Li represents a (8+i, 15+2i)-matrix over GF(2) (e.g., illustrated in FIG. 7).
The block 126 generally computes the overall linear transform C→LC, where C is the vector-column over GF(2), illustrated in FIG. 8, and L is a (108,31)-matrix over GF(2) consisting of submatrices Li, i=0, . . . , 8 (illustrated in FIG. 9, where the symbol 0 indicates null-submatrices).
Using as an example i=5, the matrix Li may be computed as follows. At first, each monomial c13+jx13+j may be replaced on c13+j(x4+j+x3+j+x1+j+x), j=0, . . . , 9 (illustrated in FIG. 10, where each monomial is represented as a cell in a table). In subsequent steps, the manipulations of the first step are repeated (illustrated in FIGS. 11-13). The final result is represented in FIG. 13. The corresponding matrix L5 obtained by the above manipulations is illustrated in FIG. 14.
The following formula represents the linear transform C(x)→C(x) mod P5, P5=x13+x4+x3+x+1, where ai is written instead of ci:
The linear transform may be implemented as a circuit comprising two-input XOR-cells (as illustrated below in connection with FIG. 25). The complexity of such a circuit is equal to 55 and the depth is equal to 3.
Referring to FIG. 15, a diagram is shown illustrating a cover of the matrix of FIG. 14 by unit-submatrices. A set of unit submatrices (e.g., submatrices with element values of “1”) may be used to form the cover for a given matrix if and only if any units of the given matrix belong to only one unit-submatrix from the given set (or cover). The notion of cover is generally used in OR-circuits, but may also be applied as herein with XOR-circuits corresponding to the various tables and figures which will be readily apparent to those skilled in the art(s). The complexity of generating the matrix of FIG. 14 may be minimized by applying the cover of the given matrix by nontrivial units-submatrices (e.g., all cells of a particular unit-submatrix are indicated by similar shading). The complexity of the linear transform performed by the block 126 with such a matrix is equal to 42 and the depth is equal to 3.
In another example, an irreducible polynomial P5(x)=f(x)=x13+x12+ . . . +x2+1=(x14+1)/(x+1)+x may be used instead of P5(x)=x13+x4+x3+x+1. The reduction modulo f(x) with the substituted polynomial P5(x)=f(x)=x13+x12+ . . . +x2+1=(x14+1)/(x+1)+x may be computed as follows:
c
24
x
24
+ . . . +c
1
x+c
0(mod f(x))=(c24x24+ . . . +c1x+c0(mod f(x)(x+1)))(mod f(x))=(c24x24+ . . . +c1x+c0(mod x14+x2+x+1)))(mod f(x))=b13x13+ . . . +b1x+b0(mod f(x))=(b12+b13)x12+ . . . +(b2+b13)x2+b1x++b0+b13.
The steps of the corresponding calculations are illustrated in FIGS. 16-18. The corresponding matrix L5 is illustrated in FIG. 19. The complexity of a corresponding circuit implemented by a brute force method is equal to 42 and the depth is equal to 3. Minimizing the complexity of the matrix of FIG. 19 with units submatrices is illustrated in FIG. 20.
The following formula represents the linear transform C(x)→C(x) mod P5, P5=x13+x12+ . . . +x2+1, where ai is written instead of ci:
The complexity of the linear transform with the matrix of FIG. 20 is equal to 36. The depth of the linear transform with the matrix of FIG. 20 is equal to 3. However, fan-out of the input c13 is equal to 12 (in the case P5(x)=x13+x4+x3+x+1 all fanouts are less than 4).
For brevity, examples (similar to the formulas for the case i=5 presented above) are presented below with only the formulas for reducing C(x) mod Pi(x), i=0, 1, 2, 3, 4, 6, 7, 8. As was illustrated above for the case i=5, implementation of the linear transform corresponding to each of the matrices below may be optimized by constructing an optimal cover of the set of unit cells by some rectangular all-units submatrices with the indicated number of XOR cells.
For the case i=0, the following formula represents the linear transform C(x)→C(x) mod P0, where ai is written instead of ci:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal to 3. The complexity of circuit corresponding is equal to 21.
For the case i=1, the following formula represents the linear transform C(x)→C(x) mod P1, where ai is written instead of ci:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal to 2. The complexity of the corresponding circuit is 16.
For the case i=2, the following formula represents the linear transform C(x)→C(x) mod P2, where ai is written instead of ci:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal 2. The complexity of the corresponding circuit is 20.
In the i=2 case, the irreducible polynomial P2=x10+x9+ . . . +x+1 may be used instead of P2=x10+x3+1. Using the irreducible polynomial P2=x10+x9+ . . . +x+1 results in the formulas having only 28 monomials:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal 2. The complexity of the corresponding circuit is 18. However, after optimization of the first formula, circuits with the same complexity are generally obtained.
For the case i=3, the following formula represents the linear transform C(x)→C(x) mod P3, where ai is written instead of ci:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal 2. The complexity of the corresponding circuit is 21.
For the case i=4, the following formula represents the linear transform C(x)→C(x) mod P4, where ai is written instead of ci:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal to 2. The complexity of the corresponding circuit is 24.
In the i=4 case, the irreducible polynomial P4=x12+x11+ . . . +x+1 may be used instead of P4=x12+x3+1. When the irreducible polynomial P4=x12+x11+ . . . +x+1 is used, the formulas have only 34 monomials:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal to 2. The complexity of the corresponding circuit is 22. However, after optimization of first formula, circuits with the same complexity may be obtained.
For the case i=6, the following formula represents the linear transform C(x)→C(x) mod P6, where ai is written instead of ci:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal to 2. The complexity of the corresponding circuit is 30.
For the case i=7, the following formula represents the linear transform C(x)→C(x) mod P7, where ai is written instead of ci:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal to 2. The complexity of the corresponding circuit is 28.
For the case i=8, the following formula represents the linear transform C(x)→C(x) mod P8, where ai is written instead of ci:
The corresponding matrix of coefficients is as follows:
The depth of the corresponding circuit is equal to 3. The complexity of the corresponding circuit is 70.
The total complexity of the block 126 implemented based upon the above formulas is less than 268 XOR, and the depth is equal to 3DXOR. The full (108,31) matrix of linear transform L may be summarized as follows (null elements in the ends of rows are omitted):
Implementation of the linear transform corresponding to the above matrix may be optimized by constructing an optimal cover of the set of unit cells by some rectangular all-units submatrices, as will be apparent to those skilled in the art(s). For example, the complexity of the linear transform above may be reduced by approximately 30. As a result, the above linear transform may be implemented by a circuit containing less than 240 XOR cells.
Referring to FIG. 21, a diagram is shown illustrating an example implementation of the multiplexing circuit 112 of FIG. 1 in accordance with an embodiment of the present invention. The multiplexing module 112 may be constructed as follows. The outputs of the submodules computing C(x) mod Pi may be designated as Ci,j, j=0, 1, . . . , 7+i, i=0, 1, . . . , 8. The multiplexing module 112 may include, in one example, 15 multiplexers:
MUX
j
=MUX(C0,j,C1,j, . . . , C8,j,R0,R1,R2,R3), j=0, 1, . . . , 7,
MUX
8
=MUX(C1,8, . . . , C8,8,R0,R1,R2)
MUX
9
=MUX(C2,9, . . . , C8,9,R0,R1,R2),
MUX
10
=MUX(C3,10, . . . , C8,10,R0,R1,R2),
MUX
11
=MUX(C4,11, . . . , C8,11,R0,R1,R2),
MUX
12
=MUX(C5,12,C6,12,C7,12,C8,12,R0,R1)
MUX
13
=MUX(C6,13,C7,13,C8,13,R0,R1)
MUX
14
=MUX(C7,14,C7,14,R0).
The functions of the individual multiplexers may be summarized as in the following TABLES 2-9:
TABLE 2
|
|
Muxi (U0, U1, U2, U3, U4, U5,
|
R3, R2, R1, R0
U6, U7, U8, R0, R1, R2, R3)
|
|
0, 0, 0, 0
U0
|
0, 0, 0, 1
U1
|
0, 0, 1, 0
U2
|
0, 0, 1, 1
U3
|
0, 1, 0, 0
U4
|
0, 1, 0, 1
U5
|
0, 1, 1, 0
U6
|
0, 1, 1, 1
U7
|
1, 0, 0, 0
U8
|
|
TABLE 3
|
|
Mux8 (U1, U2, U3, U4,
|
R2, R1, R0
U5, U6, U7, U8, R0, R1, R2)
|
|
0, 0, 1
U1
|
0, 1, 0
U2
|
0, 1, 1
U3
|
1, 0, 0
U4
|
1, 0, 1
U5
|
1, 1, 0
U6
|
1, 1, 1
U7
|
0, 0, 0
U8
|
|
TABLE 4
|
|
Mux9 (U2, U3, U4, U5,
|
R2, R1, R0
U6, U7, U8, R0, R1, R2)
|
|
0, 1, 0
U2
|
0, 1, 1
U3
|
1, 0, 0
U4
|
1, 0, 1
U5
|
1, 1, 0
U6
|
1, 1, 1
U7
|
0, 0, 0
U8
|
|
TABLE 5
|
|
Mux10 (U3, U4, U5,
|
R2, R1, R0
U6, U7, U8, R0, R1, R2)
|
|
0, 1, 1
U3
|
1, 0, 0
U4
|
1, 0, 1
U5
|
1, 1, 0
U6
|
1, 1, 1
U7
|
0, 0, 0
U8
|
|
TABLE 6
|
|
Mux11 (U4, U5, U6, U7, U8,
|
R2, R1, R0
R0, R1, R2, R3)
|
|
1, 0, 0
U4
|
1, 0, 1
U5
|
1, 1, 0
U6
|
1, 1, 1
U7
|
0, 0, 0
U8
|
|
TABLE 7
|
|
R1, R0
Mux12 (U5, U6, U7, U8, R0, R1)
|
|
0, 1
U5
|
1, 0
U6
|
1, 1
U7
|
0, 0
U8
|
|
TABLE 8
|
|
R1, R0
Mux13 (U6, U7, U8, R0, R1)
|
|
1, 0
U6
|
1, 1
U7
|
0, 0
U8
|
|
TABLE 9
|
|
R0
Mux14 (U7, U8, R0)
|
|
1
U7
|
0
U8
|
|
In general, a standard 2-input multiplexer may comprise three two-input cells and have a depth of 2, a standard 3-input multiplexer may comprise six two-input cells and have a depth of 4, a standard 4-input multiplexer may comprise nine two-input cells and have a depth of 4, a standard 5-input multiplexer may comprise twelve two-input cells and have a depth of 6, a standard 6-input multiplexer may comprise fifteen two-input cells and have a depth of 6, a standard 7-input multiplexer may comprise eighteen two-input cells and have a depth of 6, a standard 8-input multiplexer may comprise twenty-one two-input cells and have a depth 6, and a standard 9-input multiplexer may comprise twenty-four two-input cells and have a depth of 8. Consequently, the size of the multiplexing module 112 is generally less than or equal to 8·24+21+18+15+12+9+6+3=276 AND(OR) and the depth is 8 DAND(OR).
Referring to FIG. 22, a diagram of a 9 to 1 multiplexing circuit 180 is shown illustrating an alternative embodiment of the multiplexing circuit 112 of FIG. 21. In one example, all of the multiplexers may have joint inputs Ri, i=0, 1, 2, 3. Consequently, each of the multiplexers MUXi, i=0, . . . , 14, in FIG. 21 may be implemented using a circuit similar to the multiplexing circuit 180. In one example, the circuit 180 may comprise a block (or circuit) 182 and a block (or circuit) 184. The block 182 may have four inputs that may receive the signals Ri, i=0, 1, 2, 3, and thirteen outputs that may present control signals (e.g., yi, i=0, 1, . . . , 12). The block 182 may be configured to implement a function K(R0, . . . , R3) with 4 inputs (e.g., R0, . . . , R3) and 13 outputs (e.g., R1&R0, →R1&R0, R1&→R0, →R1&R0, R2&R1&R0, R2&→R1&R0, R2&R1&→R0, R2&→R1&→R0, →R2&R1&R0, →R2&→R1&R0, →R2&R1&→R0, →R2&→R1&→R0, R3&→R2&→R1&→R0, where → represents the operation of taking the logical complement). In one example, the block 182 may be constructed from 13 two-input cells with a depth 3. The block 184 may be configured to implement a function Y=x1&y1V . . . . Vxm&ym, where m represents the number of inputs multiplexed (e.g., m=2, 9).
In one example, the multiplexers MUXi, i=0, . . . , 14, in FIG. 21 may be implemented with a common block 182. Consequently, the size of multiplexing module 112 implemented with the multiplexing structure similar to the block 180 is less than or equal to 8·17+15+13+11+9+7+5+3+13=212 AND(OR) and the depth (from inputs xi) is 5 DAND(OR).
Referring to FIG. 23, a diagram of a module 200 is shown illustrating a universal multiplier for normal bases in accordance with an embodiment of the present invention. In one example, the module 200 may comprises a block (or circuit) 202, a block (or circuit) 204, a block (or circuit) 206 and a block (or circuit) 208. The block 202 may be configured to perform a linear transform from a normal base to a standard base. The block 204 may comprise a standard multiplier circuit. In one example, the block 204 may be implemented similarly to the block 110 (described above in connection with FIG. 2). For example, the block 204 may comprise a bibary-unary encoder, a polynomial multiplier and a linear transform block configured to reduce results from the polynomial multiplier modulo Pi, i=0, . . . , 8. The block 206 may be configured to perform a linear transform from the normal base to the standard base. The block 208 may comprise a multiplexing circuit. The block 208 may be implemented similarly to the block 112 (described above in connection with FIGS. 21 and 22).
Referring to FIG. 24, a diagram is shown illustrating an example implementation of the block 202 of FIG. 23. The block 202 may comprise a block (or circuit) 210, a block (or circuit) 212, a block (or circuit) 214 and a block (or circuit) 216. The block 210 may be configured to perform a linear transform from the normal base to the standard base for the first multiplicand (e.g., A). The block 212 may comprise a multiplexing circuit. The block 214 may be configured to perform a linear transform from the normal base to the standard base for the second multiplicand (e.g., B). The block 216 may comprise a multiplexing circuit. The blocks 212 and 216 may be implemented similarly to the block 112 (described above in connection with FIGS. 21 and 22).
The block 206 (FIG. 23) and the blocks 210 and 214 may be implemented similarly. For example, each of the blocks 206, 210 and 214 may comprise nine submodules Li, i=0, . . . , 9. Each submodule Li has 8+i inputs and 8+i outputs and computes linear transform Li·Xi, where Xi is a vector-column with 8+i components and Li is a (8+i)×(8+i)-matrix.
Taking i=0 for example, the standard base is
Bα={1,α,α2,α3, . . . , α7},
and the corresponding normal base is
Bα={α,α2,α4,α8, . . . , α128},
where α is a root of the irreducible polynomial P0(x)=x8+x7+x2+x+1. The transition matrix from standard (polynomial) base Bα to the corresponding normal base Bα is equal to M=(Mi,j)=
This means that
α2̂i=Mi,0+Mi,1α+Mi,2═2+ . . . +Mi,7α7, i=0, . . . , 7.
For example,
α8=1+α+α2+α7.
The transition matrix from the normal base Bα to the corresponding polynomial base Bα is equal to the inverse matrix to M=(M−1i,j)=
If X=(x0, . . . , x7) is the coordinate vector of any element GF(28) in the standard base Bα, and the coordinate vector of the same element in the normal base Bα is Y=(y0, . . . , y7), then
ΣxiαiΣyiα2̂i=Σyi(Mi,0+Mi,1α+Mi,2α2+ . . . +Mi,7α7)
xj=ΣyiMi,j, j=0, . . . , 7,
consequently
X
T
=M
T
·Y
T
, Y
T=(MT)−1×T=(M−1)TXT,
where the symbol T means transposition of the matrix.
The linear transform X→M·XT (or X→MT·XT) may be implemented as a circuit comprising 2-input XOR-cells. The size of such a circuit is equal to 17 and the depth is equal to 3. The size of an optimized circuit is less than or equal to 13 (for example the cover of given matrix by unit submatrix may be used). For any matrix M the minimal size of the circuit for performing the linear transform X→M·XT is the same as for performing the linear transform X→MT·XT (but the depths may be different).
The linear transform module of the block 204 is generally configured to reduce modulo Pi, i=0, . . . , 8 and may comprise, in one example, nine submodules, each configured to compute a respective C(x) mod Pi, i=0, . . . , 8. For example, the linear transform C(x)→Q(x)=C(x) mod P0 may be represented by a formula Q=S·CT, where the matrix S comprises the following matrix:
The linear transform Q→S·CT may be implemented as a circuit comprising 2-input XOR-cells. The size of the circuit is equal to 32 and the depth is equal to 3. The size of the optimized circuit may be less than or equal to 22. For brevity, only matrices M, M−1, and S are presented below for other values of the indices i=1, . . . , 8. The depth is given always for transposed matrix MT and (M−1)T. For the polynomial: 1+x1+x4+x5+x6+x8+x9, the transition matrix M from the standard base to the normal base may be implemented as follows:
Complexity 9, depth 2. The transition matrix M−1 from a normal base to the standard base may be as follows:
Complexity 9, depth 2.
The matrix S for C(x)→Q(x)=C(x) mod P1 may be as follows:
Complexity 21, depth 3.
For the polynomial: 1+x1+x2+x3+x4+x5+x6+x7+x8+x9+x10, the transition matrix M from the standard base to the normal base may be as follows:
Complexity 9, depth 1. The transition matrix M−1 from the normal base to the standard base:
Complexity 9, depth 1.
The matrix S for C(x)→Q(x)=C(x) mod P2 may be as follows:
Complexity 18, depth 2.
For the polynomial: 1+x2+x3+x4+x8+x10+x11, the transition matrix M from the standard base to the normal base may be as follows:
Complexity 12, depth 3. The transition matrix M−1 from the normal base to the standard base may be as follows:
Complexity 12, depth 2.
The matrix S for C(x)→Q(x)=C(x) mod P3 may be as follows:
Complexity 37, depth 3.
For the polynomial: 1+x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12, the transition matrix M from the standard base to the normal base may be as follows:
Complexity 11, depth 1. The transition matrix M−1 from the normal base to the standard base may be as follows:
Complexity 11, depth 1.
The matrix S for C(x)→Q(x)=C(x) mod P4 may be as follows:
Complexity 22, depth 2.
For the polynomial: 1+x3+x4+x7+x10+x12+x13, the transition matrix M from the standard base to the normal base may be as follows:
Complexity 29, depth 3. The transition matrix M−1 from the normal base to the standard base may be as follows:
Complexity 27, depth 3.
The matrix S for C(x)→Q(x)=C(x) mod P5 may be as follows:
Complexity 49, depth 3. A better normal base in GF(213) may be provided by the following polynomial: 1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13. The transition matrix M from the standard base to the normal base may be as follows:
Complexity 27, depth 3.
The transition matrix M−1 from the normal base to the standard one:
Complexity 26, depth 3.
The matrix S for C(x)→Q(x)=C(x) mod P5 may be as follows:
Complexity 33, depth 3
For the polynomial: 1+x1+x8+x9+x12+x13+x14, the transition matrix M from the standard base to the normal base may be as follows:
Complexity 19, depth 3. The transition matrix M−1 from the normal base to the standard base may be as follows:
Complexity 19, depth 3.
The matrix S for C(x)→Q(x)=C(x) mod P6 may be as follows:
Complexity 53, depth 4.
For the polynomial: 1+x2+x4+x5+x7+x9+x12+x14+x15, the transition matrix M from the standard base to the normal base may be as follows:
Complexity 35, depth 3. The transition matrix M−1 from the normal base to the standard base may be as follows:
Complexity 35, depth 3.
The matrix S for C(x)→Q(x)=C(x) mod P7 may be as follows:
Complexity 78, depth 4.
For the polynomial: 1+x5+x14+x15+x16, the transition matrix M from the standard base to the normal base may be as follows:
Complexity 59, depth 4. The transition matrix M−1 from the normal base to the standard one:
Complexity 41, depth 3.
The matrix S for C(x)→Q(x)=C(x) mod P8 may be as follows:
Complexity 81, depth 4.
The normal multiplier module 200 may be implemented with a total size of less than 1153 XOR+893 AND(OR) and a depth of less than 15DXOR+11DAND. Use of an unbalanced tree of OR-cells in the multiplexer submodules may decrease the depth on 2DAND. If Karatsuba construction is used for the polynomial multiplier within the block 204, the total size of the normal multiplier may be less than 1135 XOR+829 AND(OR) and the depth may be less than 15DXOR+11DAND.
The linear transform module 124 may be represented as a circuit for performing the linear transform X→L·XT, where L is the (108,31)-matrix given below.
The matrix L may be partially optimized as above by using covers for separated submatrices. Additional optimization of the full (108,31)-matrix may be performed to unite submatrices.
A similar method of optimization may be applied to the modules 210 and 214 configured to perform the linear transform from normal bases to standard bases for the first and second multiplicand. The united matrix has a size of (108,16) in both cases.
The depth of the multiplier 204 may also be decreased. At first, the module configured to reduce the result of multiplication modulo Pi and the module configured to perform the linear transform from the standard bases to the normal bases may be combined in one module configured to perform the linear transform with a (108,31)-matrix similar to the matrix described above in connection with FIG. 9. For example, the linear transform performed by the module corresponding to the matrix of FIG. 9 is a superposition of the linear transforms corresponding to the modules configured to reduce the result of multiplication modulo Pi and perform the linear transform from the standard bases to the normal bases. The depth of the corresponding circuit for the (108,31) linear transform is 5. The total depth of modules configured to reduce the result of multiplication modulo Pi and perform the linear transform from the standard bases to the normal bases is 7. Consequently, the total depth of the whole multiplier may be less than 13DXOR+11DAND. The size may be less than 31-16+29·15+27·14+25·13+23·12+21·11+19·10+17·9+15·8−108=2242. However, the complexity (number of ones) of a particular matrix is essentially less than 2000.
The circuit may be optimized by applying a method similar to the method described above as well as other well-known methods. An upper bound on the size of the circuit may be determined without computing the matrix. If (a,b)-matrix contains N ones, then the size of the circuit for the (a,b)-matrix is less than or equal to N−a. The negation of the (a,b)-matrix has ab-N ones and may be computed by a circuit with the size ab−N−a. Using the circuit for negation of a given matrix, the circuit for the matrix may be constructed with a size of ab−N+b−1. The depth of the circuit increases by 1. The size of minimal circuits is less than (ab+b−a+1)/2 in any case. Applying this bound to each submatrix with sizes (31,16), (29,15), . . . , (15,8), the total bound may be determined as 1229. The upper bound for the depth of the circuit is 6. The real values of depth and size are less than the values given earlier.
Another upper bound on the size of the circuit may be determined without computing of the matrix. All inputs x1, . . . , x31 are separated on blocks X1=(x1, . . . , x6), . . . , x4=(x19, . . . , x24), x5=(x25, . . . x31). For any i=1, . . . , 4 the module computing all linear forms on variables from Xi consists of 26−7=54 XOR cells and has the depth of 3. The module computing all linear forms on variables from X5 may be comprised of 23−4+15+16=35 XOR cells and may have a depth of 3 (see FIG. 9 above). Using this module and in addition 2·8+2·9+3·10+3·11+3·12+4·13+ . . . +4·16=365 XOR cells, the circuit for any (108,31) linear transform may be constructed (see FIG. 9 above). The size of the circuit is 365+4·54+35=616 and the depth is 6. Consequently the total size of the normal base multiplier constructed above is less than 1225 XOR+893 AND(OR) and the depth is less than 14DXOR+11DAND.
The size of the multiplier presented may be compared with the straightforward multiplier constructed from Hasan-Reyhani-Masoleh (HMR) multipliers in GF(2n), n=8, . . . , 16 used in parallel. The size of HMR-multiplier in normal base B for GF(2n) equals n(CB+3n−2)/2, where CB is the complexity of base B (number of ones in Massey-Omura matrix of B). For optimal normal base of type 1 the size is equal 2n2−1. The minimal values of complexity of normal bases in GF(2n), n=8, . . . , 16 are 29, 17, 19, 21, 23, 45, 27, 45, 85. Hence, the size of the straightforward multiplier is (8(29+22)+9(17+25)+2.102−1+11(21+31)+2·122−1+13(45+37)+14(27+40)+15(45+43)+16(85+46))/2+212=3844. The depth is equal to 8DXOR+6DAND.
The straightforward standard base multiplier constructed from Mastrovito multipliers in GF(2n), n=8, . . . , 16 used in parallel, has the size (82+72)+(92+82)+(102+92)+(112+102)+(122+112)+(132+122)+(142+132)+(152+142)+(162+152)+365+212=3082. The depth is equal 8DXOR+6DAND.
Referring to FIG. 25, a diagram is shown illustrating a circuit 300 for computing a linear transform based on a matrix and taking into account cover with units-submatrices. The techniques described below may utilized in the realization the matrices presented above. In one example, a matrix may be defined as follows:
that computes the following linear transform:
y
1
=x
1
+x
2
+x
3,
y
2
=x
1
+x
2
+x
4,
y
3
=x
1
+x
2,
y
4
=x
2
+x
3.
Using a cover of one 2×3 unit-submatrix and four 1×1 unit-submatrices, the XOR circuit 300 may be constructed for the above transform as follows:
z
1
=x
1
+x
2,
y
1
=z
1
+x
3,
y
2
=z
1
+x
4,
y3=z1,
y
4
=x
2
+x
3.
Implementing each of the above equations as a single XOR gate generally results in the circuit 300.
While the invention has been particularly shown and described with reference to the preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made without departing from the scope of the invention.