Low complexity bit-parallel systolic architecture for computing C+AB, AB, C+AB2 or AB2 over a class of GF (2m)

Information

  • Patent Application
  • 20060106908
  • Publication Number
    20060106908
  • Date Filed
    November 17, 2004
    20 years ago
  • Date Published
    May 18, 2006
    18 years ago
Abstract
A systolic architecture for computing C+AB, AB, C+AB2 or AB over a class of GF(2m) free global connection, wherein the A, B and C are the input elements of the GF(2m). The systolic architecture includes an inner product unit and a modular unit. The inner product unit includes m2 pieces of U cells and 2m+1 pieces of latch units. Each U cell includes a AND gate, a repulsive (or XOR) gate and three latches. The coefficients Aj, Bj and C<2j> of A, B and C are respectively inputted via the input ends Aj, Sj and C<2j> of U0,j, wherein the <2j> represents 2j modulo m+1. The modular unit includes m XOR gates for computing the modular p(x).
Description
BACKGROUND OF THE INVENTION

1. Field of the Invention


The present invention relates to a low complexity bit-parallel systolic architecture, and more particularly to a low complexity bit-parallel systolic architecture for computing C+AB, AB, C+AB2 or AB2 over a class of GF(2m) free global connection.


2. Description of Related Art


Finite fields GF(2m) have broadly applied to error control coding and cryptography [reference 12]. The fundamental operations in a finite field are addition, multiplication, exponentiation, division and multiplicative inversion. However, information processing usually requires the power-sum (C+AB2) operation to be performed in error control coding. AB2 circuits have been shown to be more effective than AB circuits in performing exponentiation, inversion and division in GF(2m). This AB2 operation can be performed by typical multiplication, but not necessarily in an efficient way. Recently, several studies have sought to solve this problem. For example Wei [reference 1] presented a systolic array with bi-directional data flow to compute C+AB2 over GF(2m) using the standard basis representation, Wang and Guo [reference 2] presented a systolic array with unidirectional data flow over GF(2m); Liu [reference 3] proposed an AB2 multiplier that used a cellular architecture in GF(2m) and was based on an irreducible all one polynomial (AOP), and Lee [reference 4] presented a bit-parallel systolic array over a class of GF(2m) which also based on an irreducible AOP. This study focuses on the implementation of the systolic circuit of the C+AB, AB, C+AB or AB2 operation over the class of AOP-based GF(2m) and the class of equally spaced polynomial based (ESP-based) GF(2m).


Irreducible AOP or irreducible ESP generates a special finite field, in which arithmetic operation can be simplified. In 1989, Itoh and Tsujii [reference 5] designed two low-complexity multipliers in a class of GF(2m) based on the irreducible AOP of degree m or the irreducible ESP of degree mr. Since then, many bit-parallel low-complexity multipliers have been proposed for error-control coding or cryptographic applications, such as those described in [references 6-9]. Recently, Lee et. al. [reference 10] employed cyclic shifting and inner product to implement efficient systolic multipliers over a class of GF(2m), in which an irreducible AOP or an irreducible ESP generates each element of the finite field, such that the systolic circuits have low latency and low complexity. However, the circuit includes many surplus inputs and latches [reference 10] if the order m of GF(2m) is large. Later, Lee et. al. [reference 11] used some global connections disused inputs and latches in another design. In particular, public-key cryptography applies the finite field GF(2m) [reference 12], in which the order m ranges from dozens to hundreds. If m is in the order of hundred, then reducing the number of redundant inputs and latches or eliminating the global connections becomes important.


This study develops an algorithm for computing C+AB, AB, C+AB2 or AB2 over a class of fields GF(2m) using the characteristics of an irreducible AOP of degree m. Based on the algorithm, a ringed parallel-in parallel-out systolic multiplier for computing C+AB2 is proposed. The multiplier consists of m2 identical cells, each consisting of one 2-input AND gate, one 2-input XOR gate and three 1-bit latches. The gates in the multiplier are fewer than in [reference 3, 4, 10 or 11]. The architecture includes no redundant inputs, latches and has no global connections; it is therefore is suitable for use in VLSI design. Moreover, extending this algorithm enables the ringed bit-parallel systolic architecture over the class of GF(2m) also to be applied to ESP-based multiplication over the class of GF(2nr).


SUMMARY OF THE INVENTION

The main objective of the present invention is to provide an improved a bit-parallel systolic architecture for computing C+AB, AB, C+AB2 or AB2 over a class of GF(2m) based on the irreducible all one polynomial (AOP) or the irreducible equally spaced polynomial (ESP), where A, B and C are elements of GF(2m).


To achieve the objective, If elements over GF(2m) are represented by extended forms, then these elements have two important properties: first, the polynomial of the elements is cyclic with modulo xm+1+1, and second, some fixed zero terms of the product of two elements can be ignored in the polynomials. Then, with these properties, ringed low-complexity bit-parallel systolic multipliers are presented. The ringed bit-parallel systolic multiplier over the class of GF(2m) requires few gates and no global connections. Accordingly, the new multiplier has a low complexity and few input pins. This ringed configuration can be easily implemented by taking advantage of three-dimensional routing in VLSI systems. The architecture of the multiplier was designed to compute C+AB2 over GF(24), based on the irreducible AOP, or over GF(26), based on the irreducible ESP as examples, respectively. Notably, the field GF(24) or GF(26) is used to illustrate the structures and operations of the two new multipliers presented in this paper, However, the extension of these structures to a general case of GF(2m) is straightforward.


Further benefits and advantages of the present invention will become apparent after a careful reading of the detailed description with appropriate reference to the accompanying drawings.




BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1(a) is a bit-parallel systolic inner product unit for the C+AB, AB, C+AB2 or AB2 over GF(24) in accordance with the present invention;



FIG. 1(b) is a detailed circuit of Ui,j cell;



FIG. 1(c) is a modular unit



FIG. 2 is a cyclic sequence <a0 a2 a4 a1 a3> with modulo (a5+1);



FIG. 3 is a ringed bit-parallel systolic circuit for computing C+AB, AB, C+AB2 or AB over GF(24) based on the irreducible AOP of degree 4; and



FIG. 4 is a ringed systolic structure for computing C+AB, AB, C+AB2 or AB2 over GF(26) based on the irreducible ESP of degree 6.




DETAILED DESCRIPTION OF THE INVENTION

1. Mathematical Background


These section introduces the properties of the cyclic shifting and the inner product of the field GF(2m) based on an irreducible AOP introduced in [reference 10]. These properties are important in developing the multipliers hereinafter.


1.1 Extended Canonical Basis


A polynomial of the form p(x)=p0+p1x+ . . . +pmxm over GF(2) is called an AOP of degree m if pi=1 for i=0, 1, . . . , m [reference 5]. An AOP has been shown to be irreducible if and only if m+1 is a prime and 2 is a primitive element of the field GF(m+1). For m≦100, the possible values of m for which an AOP of degree m is irreducible, are 2, 4, 10, 12, 18, 28, 36, 52, 58, 60, 66, 82 and 100.


Suppose that a is a root of an irreducible AOP of degree m; then any element A in the Galois field GF(2m) can be represented as A=a0+a1a+a2a2+ . . . +am−1am−1, where the coefficients aiεGF(2) for 0≦i≦m−1, and {1, a, a2, . . . , am−1} is called a canonical basis of GF(2m). Notably, the element A can also be represented as A=A0+A1a+A2a2+ . . . +Amam, with Ai=ai+Am for 0≦i≦m−1 and Am=0 or 1. The basis {1, a, a2, . . . , am} is then called an extended basis of the canonical basis {1, a, a2, . . . , am−1}.


1.2 Inner Product


Let P(x)=1+x+x2+ . . . +xm be an irreducible AOP of degree m; and let α be a root of P(x), such that P(α)=1+α+α2+ . . . +αm=0. Then,

αm+1=1,  (1)

Definition 1: Let A=A0+A1a+A2a2+ . . . +Amam be an element in GF(2m), which is represented with the extended basis. Then, A(1)(=Am+A0a+A1a2+ . . . +Am−1am) and A(−1)(=A1+A2a+A3a2+ . . . +A0am) denote the elements obtained by shifting A cyclically one position to the right and one position to the left, respectively.


Analogously, A(i) and A(−i), where i=0, 1, 2 . . . m, represent the elements obtained by shifting A cyclically i positions to the right and i positions to the left, respectively.
A(i)=Am-i+1+Am-i+2α++Am-iαm(2)=j=0mAj-iαjA(-i)=Ai+Ai+1α++Am+iαm(3)=j=0mAj+1αj

where <θ>, the subscript of A<θ>, represents the least nonnegative residues of θ modulo m+1 (for all AOP-based GF(2m)). Notably, A(0)=A(−0)=A.


An important operation, called the inner product, is defined as follows.


Definition 2: Let A=A0+A1a+ . . . +Amam and B=B0+B1a+ . . . +Bmam be two elements of GF(2m), where a is a root of the irreducible AOP of degree m. Then the inner product of A and B is defined as,
A·B=(j=0mAjαj)·(j=0mBjαj)=j=0mAjBjα2j(4)

By Definitions 1 and 2, the inner product of A(i) and B(i) is given by,
A(i)·B(-i)=(j=0mAj-iαj)·(j=0mBj+iαj)=j=0mAj-iBj+iα2j(5)


The inner product operation defined in Definition 2 is important in the proposed algorithm.


Theorem 1: Assume that A=A0+A1a+ . . . +Amam and B=B0+B1a+ . . . . +Bmam are two elements in GF(2m). Then, the A and B over GF(2m) can be multiplied using,
AB=A(0)·B(-0)+A(1)·B(-1)++A(m)·B(-m)=i=0mA(i)·B(-i)(6)


Based on theorem 1, bit-parallel systolic multipliers for computing C+AB2 was presented in [reference 3] and [reference 4] the latency of those multipliers is only m+1 clock cycles. However, the circuit still requires (m+1)2 cells and 5m+3 input pins. Following the above preliminaries, Section 3 presents a modified multiplier for computing C+AB over GF(2m), based on an irreducible AOP.


2. Multiplier for Computing C+AB2


2.1 Representation for Computing C+AB2


Definition 3: Let B=B0+B1a+ . . . +Bmam be over GF(2m) be generated by an irreducible AOP of p(x), where a is a root of the irreducible AOP of p(x). Then the square of B is defined as,
B2=(B0+B1a+B2a2++Bmam)2=B0+B1a2+B2a4++Bma2m=S0+S1a+S2a2++Smam(7)where,Si={Bi/2,eveniB(i+m+1)/2,oddi(8)


Let A and B be two elements of GF(2m), both represented with the extended basis {1, a, a2, . . . , am}; then, the inner product of A and B2 is obtained by,
A·B2=(A0)(S0)+(A1α1)(S1α1)++(Amαm)(Smαm)=(j=0mAjαj)·(j=0mSjαj)=j=0mAjSjα2j(9)

By Definitions 1 and 2 again, the inner product of A(i) and (B2)(−i) is given by,
A(i)·(B2)(-i)=(j=0mAj-iαj)·(j=0mSj+iαj)=j=0mAj-iSj+iα2j(10)


According to Eqs. (1) and (7), the product of A and B2 over GF(2m) is,
AB2=(A0+A1a+A2a2++Amam)(S0+S1a+S2a2++Smam)=(j=0mAjαj)(i=0mSiαi)=i=0mj=0mAjSi-jαi(11)where,Si-j={B(i-j)/2,even(i-j)B(i-j+m+1)/2,odd(i-j)


EXAMPLE 1

Assume that A=A0+A1a+A2a2+A3a3+A4a4 and B=B0+B1a+B2a2+B3a3+B4a4 are two elements in the field GF(24). Let D=D0+D1a+D2a2+D3a3+D4a4 denote the product of A and B2 over GF(24).
D=AB2=(A0+A1a+A2a2+A3a3+A4a4)(S0+S1a+S2a2+S3a3+S4a4)=(A0+A1a+A2a2+A3a3+A4a4)(B0+B3a+B1a2+B4a3+B2a4)


Then, from Eq. (1), a5=1, and from Eq. (11), the coefficients of D are given by,

D0=A0B0+A4B3+A3B1+A2B4+A1B2,
D1=A1B0+A0B3+A4B1+A3B4+A2B2,
D2=A2B0+A1B3+A0B1+A4B4+A3B2,
D3=A3B0+A2B3+A1B1+A0B4+A4B2,
and
D4=A4B0+A3B3+A2B1+A1B4+A0B2.

2.2 AOP-Based Algorithm and Circuit


Theorem 2: Assume that A=A0+A1a+A2a2+ . . . +Amam and B=B0+B1a+B2a2+ . . . +Bmam are two elements in GF(2m). Then, A and B2 over GF(2m) can be multiplied using,
AB2=A(0)·(B2)(-0)+A(1)·(B2)(-1)++A(m)·(B2)(-m)=i=0mA(i)·(B2)(-i)

Proof: A and B are two elements in GF(2m); then, the product A and B2 can be obtained from Eq. (11) as,
AB2=i=0mj=0mAjS<i-j>αi.

Splitting the right side of this equation into two terms with i=even and i=odd, yields,
AB2=i=0mevenj=0mA<i-j>Sjαi+i=1m-1oddj=0mA<i-j>Sjαi.(12)

Notably, m must be even for an irreducible AOP of degree m. Substituting ai=am+1+i and <i−j>=<m+1+i−j> into the second term on the right side of Eq. (12) gives
AB2=i=0mevenj=0mA<i-j>Sjαi+i=0moddj=0mA<m+1+i-j>Sjαm+1+i.(13)

Taking i=2p for i=even where p=0, 1, . . . , m/2, and taking i=2p−m−1 for i=odd, where p=(m/2)+1, (m/2)+2, . . . , m, Eq. (13) can be rewritten as,
AB2=p=0mj=0mA<2p-j>Sjα2p.(14)

Let k be an integer such that 0≦k≦m. Then <p+k> must be in the range 0≦<p+k>≦m for 0≦p≦m. Thus, j=<p+k> can be substituted into the subscripts of A<2p−j>Sj in Eq. (14) to obtain,
AB2=k=0mp=0mA<p-k>S<p+k>α2p.(15)

Comparing Eq. (15) with Eq. (10) finally gives,
AB2=k=0mA(k)·S(-k)

That is,
AB2=i=0mA(i)·(B2)(-i)


EXAMPLE 2

Assume that {1, a, a2, a3, a4} is an extended basis of the field GF(24). Let A=A0+A1a+A2a2+A3a3+A4a4 and B=B0+B1a+B2a2+B3a3+B4a4 be two elements of the field GF(24). And let D=D0+D1a+D2a2+D3a3+D4a4 be the product of A and B2. By employing the properties of am+1+i=ai modulo (am+1+1) for m=4, the product D can then be computed using Theorem 2:
a0a2a4a6(=a1)a8(=a3)A(0)·(B2)(-0)=A0B0A1B3A2B1A3B4A4B2A(1)·(B2)(-1)=A4B3A0B1A1B4A2B2A3B0A(2)·(B2)(-2)=A3B1A4B4A0B2A1B0A2B3A(3)·(B2)(-3)=A2B4A3B2A4B0A0B3A1B1+A(4)·(B2)(-4)=A1B2A2B0A3B3A4B1A0B4D0D2D4D1D3


Definition 4: Let A=A0+A1a+ . . . +Amam and B=B0+B1a+ . . . +Bmam be two elements of GF(2m), represented with the extended basis {1, a, a2, . . . , am}, where a is a root of the irreducible AOP of degree m. If A and B are represented with Am=Bm=0, then AiBm and AmBi equal zero, for 0≦i≦m. Those terms are called fixed zero terms.


Definition 4 yields the following theorem.


Theorem 3: Assume that A=A0+A1a+ . . . +Amam and B=B0+B1a+ . . . +Bmam are two elements in GF(2m), and a is a root of the irreducible. AOP of degree m. If A and B are represented with Am=Bm=0, then the product of A and B over GF(2m) includes 2m+1 fixed zero terms.


Proof: According to Eq. (11), the product of A and B2 over GF(2m) has (m+1)2 terms Since Am=Bm=0, Eq. (11) can be simplified as,
AB2=(A0+A1α++Am-1αm-1+0αm)(B0+B1α++Bm-1αm-1+0αm)2=(A0+A1α++Am-1αm-1)(B0+B1α2++Bm-1α2(m-1))=(j=0m-1Ajαj)(i=0m-1Biα<2i>)(16)


According to Eq. (16) the product of A and B2 over GF(2m) has m×m=m2 terms. Therefore, the product of A and B2 over GF(2m) has 2 m+1 fixed zero terms.


Using theorem 3, the C+AB2 circuit can be simplified by omitting the fixed zero terms. The following example illustrates the fixed zero terms of C+AB2 over GF(24).


EXAMPLE 3

Assume that {1, a, a2, a3, a4} is an extended basis of the field GF(24). Let A=A0+A1a+A2a2+A3a3+A4a4, B=B0+B1a+B2a2+B3a3+B4a4 and C=C0+C1a+C2a2+C3a3+C4a4 be three elements of the field GF(24), where A4=B4=C4=0. Let D=D0+D1a+D2a2+D3a3+D4a4 be the product of C+AB 2. The product D can then be computed using theorems 1 and 3:
a0a2a4a6(=a1)a8(=a3)C=C0C2C4C1C3A(0)·(B2)(-0)=A0B0A1B3A2B1(A3B4=0)(A4B2=0)A(1)·(B2)(-1)=(A4B3=0)A0B1(A1B4=0)A2B2A3B0A(2)·(B2)(-2)=A3B1(A4B4=0)A0B2A1B0A2B3A(3)·(B2)(-3)=(A2B4=0)A3B2(A4B0=0)A0B3A1B1+A(4)·(B2)(-4)=A1B2A2B0A3B3(A4B1=0)(A0B4=0)D=D0D2D4D1D3


Example 3 involves nine fixed zero terms that forms A4Bi and AiB4 are zeroes and need not be computed.



FIG. 1(a) shows a parallel-in-parallel-out systolic multiplier to perform the above computation. The multiplier consists of 16 U cells and nine latch units. Each U cell employs one 2-input AND gate and one 2-input XOR gate, as shown in FIG. 1(b). The three 1-bit latches in each cell are used to delay each output of the cell by one clock cycle. Notably, bits A4, B4 and C4 are zeroes and need not be input. The modular unit (MU), as shown in FIG. 1(c), is used to compute the operation of modulo p(α). Since p(α)=1+α+α234=0 (or α4=1+α+α23), the product can be obtained from the relationship D(a)=d0+d1a+d2a2+d3a3=D0+D1a+D2a2+D3a3+D4a4 mod p(α); and therefore di=Di+D4, for i=0, 1, 2, 3.


2.3 Ringed AOP-Based circuit FIG. 1(a) shows some global connections that cause a long delay in a VLSI circuit over GF(2m) if m is large. From Eq. (5), the order of a2i has a cyclic property with modulo (am+1+1). For example, the sequence <a0 a2 a4 a1 a3> is cyclic with modulo (a5+1) as in FIG. 2.


Using the cyclic property of the sequence <a0 a2 a4 a1 a3>, FIG. 3 depicts a ringed parallel-in parallel-out systolic multiplicative circuit that realizes the computation in example 3. The circuit includes 16 U cells, Ui,j, where i and j are the row and column numbers, respectively. The circuit of the U cell is that same as that shown in FIG. 1. FIG. 3 performs the following equations.

T0,j=C<2j>, initialization, for j=0, 1 . . . , m.  (17)
Ti+1,j=Ti,j+Aj(i)Sj(−i), for i=0, 1 . . . , m and j=0, 1 . . . , m  (18)
D<2j>=Tm+1,j, for j=0, 1 . . . , m  (19)


Where Sj is defined as in Eq. (8). The product D can be computed, as the following steps:

embedded image


The item a3 is rearranged to the leftest by cyclic property in above steps. The advantage of the circuit in FIG. 3 is no any global connections. Several points should be addressed. Using Eq. (18), in the ring level 0, the U cell at position P0,3 for computing the bit operation T1,3=T0,3+A3B4 can be replaced by a bit latch because B4=0, and the U cell at position P0,4 for computing the bit operation T1,4=T0,4+A4B2 can be replaced by a bit latch because A4=0. In the next level ring, A4 or B4 shifts to the right or the left, respectively. Then, in the ring level 1, at position P1,0 or P1,2 each bit operation for computing T2,0=T1,0+A4B3 or T2,2=T1,2+A1B4 requires only one bit latch rather than a U cell. The others, the positions P2,1 P3,0 P3,2 P4,3, and P4,4, can be replaced by bit latches.


The positions of the ring using latches instead of U-cells are as the follows.

embedded image


Where Pi,j denotes position in row i and column j. In FIG. 3, as in the example illustrated in FIG. 1, the three elements A, B and C in GF(24) are used as the three inputs of the modified version, and D represents the result of C+AB2. Comparing the modified circuit with the circuit in [reference 4] shows that the total number of input pins has been reduced from 23 to 12, and the number of U cells has been reduced from 25 to 16.


3. Modified ESP-Based Multiplier


This section proposes an ESP-Based multiplier. The method for computing C+AB2 based on an irreducible AOP can also be applied to compute the multiplication based on an irreducible ESP.


3.1 Algorithm


A polynomial of the form g(x)=1+xr+ . . . +x(n−1)r+xnr is called an r-equally spaced polynomial (r-ESP) of degree nr. Let g(x)=p(xr), then p(x) is an AOP of degree n. If p(x) is an irreducible AOP, then r-ESP g(x) has been shown to be irreducible if and only if r=(n+1)j≠1 modulo (n+1)r, for j≧1 [reference 5]. For nr≦100, the possible pairs (nr,r) for which an r-ESP of degree nr is irreducible, are (6,3), (18,9), (20,5), (54,27) and (100,25).


Now, suppose that a is a root of the irreducible r-ESP of degree nr. Then, an element A in the Galois field GF(2nr) can be represented as A=a0+a1a+ . . . +anr−1anr−1 using the canonical basis {1, a, a2 . . . , anr−1} where aiεGF(2) for 0≦i≦nr−1. The element A can also be represented using the extended basis {1, a, a2, . . . , a(n+1)r−1}, as,
A=A0+A1a++A(n+1)r-1a(n+1)r-1=i=0(n+1)r-1Aiαi,

where Ai=ai, for 0≦i≦nr−1 and Ai=0 for nr≦i≦(n+1)r−1.


EXAMPLE 4

Assume that a is a root of the r-ESP g(x)=1+x3+x6 (that is, g(x) is an irreducible ESP with nr=6 and r=3). Then, {1, a, a2, a3, a4, a5} is a canonical basis of the Galois field GF(26) and {1, a, a2, a3, a4, a5, a6, a7, a8} can be used as an extended basis of this canonical basis. Thus, an element in GF(26) can be represented as A=a0+a1a+a2a2+a3a3+a4a4+a5a5=A0+A1a+A2a2+A3a3+A4a4+A5a5+A6a6+A7a7+A8a8 using the extended basis, where the A=ai, for 0≦i≦5, and A6=A7=A8=0.


Theorem 4: Assume that A=A0+A1a+ . . . +A(n+1)r−1a(n+1)r−1 and B=B0+B1a+ . . . +B(n+1)r−1a(n+1)r−1 are two elements in GF(2nr), which are represented with the extended basis {1, a, a2, . . . , a(n+1)r−1} where a is a root of the irreducible r-ESP of degree nr. Then, the product of A and B2 over GF(2nr) includes (2n+1)r2 fixed zero terms of the form AiBj or AjBi, for nr≦j≦(n+1)r−1, and 0≦i≦(n+1)r−1, if A and B are represented with Aj=Bj=0, for nr≦j≦(n+1)r−1.


Proof: According to Eq. (16), the product of A and B2 over GF(2nr) is,
AB2=(A0+A1α++A(n+1)r-1α(n+1)r-1)(B0+B1α++B(n+1)r-1α(n+1)r-1)2,=(j=0(n+1)r-1Ajαj)(i=0(n+1)r-1Biα<2i>),=i=0(n+1)r-1j=0(n+1)r-1AjB<i-j>αi.(20)

where <θ>, the subscript of B<θ>, denotes the least nonnegative residues of θ modulo (n+1)r (for all ESP-Based GF(2nr)). Equation (20) has ((n+1)r)2 multiplicative terms. Since Aj=Bj=0 for nr≦j=(n+1)r−1, Eq. (20) can be simplified as,
AB2=(A0+A1α++Anr-1αnr-1)(B0+B1α++Bnr-1αnr-1)2,=(j=0nr-1Ajαj)(i=0nr-1Biα<2i>),=i=0nr-1j=0nr-1AjB<i-j>αi.(21)

According to Eq. (21) the product of A and B2 over GF(2nr) has (nr)2 terms. Therefore, the product of A and B2 over GF(2m) has ((n+1)r)2−(nr)2=(2n+1)r2 fixed zero terms.


Since a is a root of the irreducible r-ESP g(x)=1+xr+ . . . +xnr, g(a)=1+ar+ . . . +anr=0. Assume that two elements A=A0+A1a+A2a2+ . . . +A(n+1)r−1a(n+1)r−1 and B=B0+B1a+B2a2+ . . . +B(n+1)r−1a(n+1)r−1; then, the product of A and B2, according to Theorem 2 and Eq. (20), can be expressed as,
AB2=A(0)·(B2)(-0)+A(1)·(B2)(-1)++A((n+1)r-1)·(B2)(-(n+1)r+1)=i=0(n+1)r-1A(i)·(B2)(-i)(22)

Thus, the method of multiplication based on an irreducible AOP can also be used for multiplication based on an irreducible ESP.


3.2 Ringed Circuit of an ESP-Based Multiplier


Assume that two elements A=a0+a1a+a2a2+a3a3+a4a4+a5a5=A0+A1a+A2a2+ . . . +A8a8 and B=b0+b1a+b2a+b3a+b4a4+b5a5=B0+B1α+B2α2+ . . . +B8α8, Let D=D0+D1a+D2a2+ . . . +D8a8 be the product of AB2+C, where A, B and C are elements over GF(26). Set the initial value T0=C. The product D can then be computed using Eq. (22), as follows.

embedded image


The sequence D0, D2, D4, D6, D8, D1, D3, D5, D7, is a permutation of the sequence D0, D1, D2, D3, D4 D5, D6, D7, D8. Notably, the terms that include A6, A7, A8, B6, B7 and B8 are all zeros, such that AjBk and AkBj need not be computed for 6≦j≦8 and 0≦k≦8. Using Eq. (18), the zeroth ring level, U cells for computing the bit operation T1,3=T0,3+A3B6, T1,5=T0,5+A5B7 and T1,7=T0,7+A7B8 can be replaced by bit latches respectively, because B6=B7=B8=0, and those for performing the bit operation T1,6=T0,6+A6B3 T1,7=T0,7+A7B8, and T1,8=T0,8+A8B4 can be replaced by bit latches since A6=A7=A8=0. In the first level ring, A4 or B4 shifts to the right or the left, respectively. Then, each bit operation for computing T2,2=T1,2+A1B6, T2,4=T1,4+A3B7, T2,6=T1,6+A5B8, T2,7=T1,7+A6B4, T2,8=T1,8+A7B0 and T2,<9>=T2,0=T1,0+A8B5 requires only one bit latch instead of a U cell, respectively.


Now, positions of the ring that uses latches rather than cells is described briefly as follows.

embedded image


where position Pi,j, in which i and j are the row and column numbers, respectively.


As introduced in Section 3, use a ringed structure to realize the circuit of the cyclic shift sequence <a0 a2 a4 a6 a8 a1 a3 a5 a7>. FIG. 4 depicts the ringed bit-parallel systolic multiplier based on 3-ESP x6+x3+1, as a simple illustration; the detail of the U-cell circuit is as shown in FIG. 1. FIG. 4 shows the positions of each level ring that uses a latch rather than a U cell. the proposed ESP-based systolic multiplier comprises (nr)2 U cells and (2n+1)r2 latch units. Herein, only the positions of the ring in which cells can be replaced by latches are discussed. From FIG. 4, cells over GF(26) in positions Pi<2j> with AkB6AkB7, AkB8 and A6Bk A7Bk A8Bk for 0≦k≦8 can be replaced by latches.


The positions of the ringed ESP-based over GF(2nr) are obtained according to a general rule as follows.

  • Step 1: //Initialization. Hereafter, Pi,j denotes the position of level i and column j, in an r-ESP structure
    • for every i=1, 2, . . . , (n+1)r−1, and j=1, 2, . . . , (n+1)r−1 that Pi,j=U-Cell;
  • Step 2: //Replace U-cells of AjBk and AkBj with latches
    • for every i=1, 2, . . . , (n−1)r−1,
      • for j=nr+i, nr+i+1, . . . , (n+1)r+i−1 that
        • Pi,j=Latch; // for AjBk, where 0≦k≦(n+1)r−1, fixed zero terms,
      • for j=(n−1)r−i, (n−1)r−i+2, . . . (n+1)r−i−2, that
      • Pi,j=Latch; // for AkBj, where 0≦k≦(n+1)r−1, fixed zero terms


        This rule is suitable for both AOP-based and ESP-based systolic architectures. For r=1, the above algorithm is an AOP-based systolic architecture.


Clearly, the proposed three-dimensional ESP-based systolic architecture over GF(2nr) requires only (n+1)r clock cycles. Moreover, the circuit needs no global connections and the proposed ESP-based systolic multiplier can save (2n+1)r2 U cells by ignoring the fixed zero terms.


4. Comparison and Discussion


This work has presented a three-dimensional ringed parallel systolic AOP-based multiplier for computing C+AB, AB, C+AB2 or AB2 over GF(2m). The latency of the AOP-based multipliers is only m+1 clock cycles in performing a multiplication over GF(2m). The number of input pins is only 3m, which equals the sum of the number of bits in A, B and C. Table 1 compares the new AOP-based parallel systolic multipliers with those of Liu [reference 3], Lee [reference 4] and Lee [reference 11]. The table reveals that the ringed AOP-based multipliers (RAOPM) include fewer gates and fewer input pins than other multipliers. Clearly, the ringed systolic multipliers involve much low hardware complexity and no global connections, which characteristics are of course advantageous in VLSI implementation. Notably, the Architecture of C+AB2 is used to illustrate the structures and operations of a new multiplier presented in this paper, However, the extension of these structures to a general case of C+AB, AB or AB2 is straightforward.


Although the invention has been explained in relation to its preferred embodiment, it is to be understood that many other possible modifications and variations can be made without departing from the spirit and scope of the invention as hereinafter claimed.

TABLE 1Comparison of the ringed AOP multiplier with relatedbit-parallel systolic multipliers over GF(2m).MultipliersProposedItemsLiu[3]Lee[4]Lee[11]in FIG. 3typeC + AB2C + AB2C + ABC + AB2Number oftotal gates2-input AND(m + 1)2(m + 1)2(m + 1)2m22-input XOR(m + 1)2(m + 1)2(m + 1)2m21-bit latch3(m + 1)23(m + 1)23(m + 1)23m2 +4m − 1Minimum possibleTA +TA +TA +TA +clock periodTX + TLTX + TLTX + TLTX + TLGlobalFree, butFreeyesFreeconnectionsjumpconnectionsInput pins5m + 35m + 33m + 33mLatency2m + 2m + 1m + 1m + 1


REFERENCES



  • [1] S. W. Wei, “A Systolic Power-Sum Circuit for GF(2m),” IEEE Trans. on Computers vol. 43, no. 2, pp. 226-229, February 1994.

  • [2] C. L. Wang and J. H. Guo, “New Systolic Array for C+AB2, Inversion, and Division in GF(2m),” IEEE Trans. on Computers vol. 49, no. 10, pp. 1120-1125, October 2000.

  • [3] C. H. Liu, N. F. Huang and C. Y Lee, “Computation of AB2 Multiplier in GF(2m) Using an Efficient Low-Complexity Cellular Architecture,” IEICE Trans. Fundaments, vol. E83-A, no. 12, pp. 2657-2663, December 2000.

  • [4] C. Y. Lee, E. H. Lu and L. F. Sun, “Low-Complexity Bit-parallel Systolic Architecture for Computing AB2+C in a Class of Finite Field GF(2m),” IEEE Trans. on Circuits Syst. II vol. 48, no. 5, pp. 519-523, May. 2001.

  • [5] T. Itoh and S. Tsujii, “Structure of parallel multipliers for a class of fields GF(2m),” Information and Computation, Vol. 83, pp. 21-40, 1989.

  • [6] M. A. Hasan, M. Z. Wang, and V. K. Bhargava, “Modular construction of low complexity parallel multipliers for a class of finite fields GF(2m),” IEEE Trans. on Computers vol. 41, no. 8, pp. 962-971, August 1992.

  • [7] C. K. Koc and B. Sunar, “Low complexity bit-parallel canonical and normal basis multipliers for a class of finite fields,” IEEE Trans. on Computers vol. 47, no. 3, pp. 353-356, March 1998.

  • [8] H. Wu, and M. A. Hasan, “Low-complexity bit-parallel multipliers for a class of finite fields,” IEEE Trans. on Computers vol. 47, no. 8, pp. 883-887, August 1998.

  • [9] H. Wu, M. A. Hasan, and L. F. Blake, “New low-complexity bit-parallel finite field multipliers using weakly dual bases,” IEEE Trans. on Computers vol. 47, no. 11, pp. 1223-1234, November 1998.

  • [10] C. Y. Lee, E. H. Lu, and J. Y Lee, “Bit-Parallel Systolic Multipliers for GF(2m) Fields Defined by All-One and Equally-Spaced Polynomials,” IEEE Trans. on Computers, No. 5, pp. 385-393, May 2001.

  • [11] C. Y. Lee, E. H. Lu, and J. Y. Lee, “Bit-Parallel Systolic Modular Multipliers for for a class of GF(2m),” 15th IEEE Symposium on Computer Arithmetic (Arith-2001), Vail, Colo., USA, pp. 51-58, June 2001.

  • [12] EEE-SA Standards Board, “IEEE Std. 1363-2000, IEEE Standard Specifications for Public-Key Cryptography,” January 2000.


Claims
  • 1. A low complexity bit-parallel systolic architecture for computing C+AB, AB, C+AB2 or AB2 over a class of GF(2m) free global connection, wherein the A, B and C are the input elements of the GF(2m).
  • 2. The systolic architecture as claimed in claim 1 comprising an inner product unit and a modular arithmetic unit, the inner product unit including m2 pieces of U cells and 2 m+1 pieces of latch units, each U cell including a AND gate, an XOR gate and three latches, the coefficients Aj, Bj and C<2> of A, B and C respectively inputted via the input ends Aj, Sj and C<2j> of U0,j, wherein the <2j> represents the 2j modulo m+1, the modular arithmetic unit including m pieces of repulsive XOR gate for computing the modular p(x).
  • 3. The systolic architecture as claimed in claim 1 further comprising an inner product unit, after the inner product unit computing the U cell of the first stratum, the A and B respectively right and left endlessly moved into the cell of the second stratum and running the following formula,
  • 4. The systolic architecture as claimed in claim 1, wherein the circuit achieves GF(24) and the output D is a result of C+AB that can be easily popularized to a class of GF(2m), wherein the m is a plus integer that is kept in a modular polynomial.
  • 5. The systolic architecture as claimed in claim 1 being used to computing A multiply B when the coefficient of C is zero.
  • 6. The systolic architecture as claimed in claim 1 being used in GF(2m) formed by a modular polynomial for computing C+AB2.
  • 7. The systolic architecture as claimed in claim 6 comprising an inner product unit and a modular arithmetic unit, the inner product unit including m2 pieces of U cells and 2m+1 pieces of latch units, each U cell including a AND gate, an XOR gate and three latches, the coefficients Aj, Bj and C<2j> of A, B and C respectively inputted via the input ends Aj, Sj and C<2j> of U0,j, wherein the <2j> represents the 2j modulo m+1, the modular arithmetic unit including m XOR gates for computing the modular p(x).
  • 8. The systolic architecture as claimed in claim further comprising an inner product unit, after the inner product unit computing the U cell of the first stratum, the A and B respectively right and left endlessly moved into the cell of the second stratum and running the following formula,
  • 9. The systolic architecture as claimed in claim 6, wherein the circuit achieves GF(24) and the output D is a result of C+AB2 that can be easily popularized to a class of GF(2m), wherein the m is a plus integer that is kept in a modular polynomial.
  • 10. The systolic architecture as claimed in claim 6 being used to computing A multiply B2 when the coefficient of C is zero.
  • 11. A architecture for computing C+AB over a class of GF(2nr) formed by a all one polynomial, wherein the A, B and C are the input elements of the GF(2nr).
  • 12. The systolic architecture as claimed in claim 11 comprising an inner product unit and a modular arithmetic unit, the inner product unit including (nr)2 pieces of U cells and (2n+1)r2 pieces of latch units, each U cell including a AND gate, an XOR gate and three latches, the coefficients Aj, Bj and C<2j> of A, B and C respectively inputted via the input ends Aj, Sj and C<2j> of U0,j, wherein the <2j> represents the 2j modulo (n+1)r, the modular arithmetic unit including n*r XOR gates for computing the modular p(x).
  • 13. The systolic architecture as claimed in claim 11 further comprising an inner product unit, after the inner product unit computing the U cell of the first stratum, the A and B respectively right and left endlessly moved into the cell of the second stratum and running the following formula,
  • 14. The systolic architecture as claimed in claim 11, wherein the circuit achieves GF(26) and the output D is a result of C+AB that can be easily popularized to a class of GF(2nr), wherein the nr is a plus integer that is kept in a modular polynomial.
  • 15. The systolic architecture as claimed in claim 11 being used to computing A multiply B when the coefficient of C is zero.
  • 16. A architecture for computing C+AB over a class of GF(2nr) based on an equally spaced polynomial (ESP), wherein the A, B and C are the input elements of the GF(2nr).
  • 17. The systolic architecture as claimed in claim 16 comprising an inner product unit and a modular arithmetic unit, the inner product unit including (nr)2 pieces of U cells and (2n+1)r2 pieces of latch units, each U cell including an AND gate, an XOR gate and three latches, the coefficients Aj, Bj and C<2j> of A, B and C respectively inputted via the input ends Aj, Sj and C<2j> of U0,j, wherein the <2j> represents the 2j modulo (n+1)r, the modular arithmetic unit including n*r XOR gates for computing the modular p(x).
  • 18. The systolic architecture as claimed in claim 16 further comprising an inner product unit, after the inner product unit computing the U cell of the first stratum, the A and B respectively right and left endlessly moved into the cell of the second stratum and running the following formula,
  • 19. The systolic architecture as claimed in claim 16, wherein the output D is a result of C+AB that can be easily popularized to a class of GF(2nr) based on ESP, wherein the n and r are integers.
  • 20. The systolic architecture as claimed in claim 16 being used to computing A multiply B when the coefficients of C are zeroes.