Modular polynomial multiplication involves determining the product of two polynomials of order n or less and then determining the modulo (xn+1) of the product. Such modular polynomial multiplication is used in cryptography with values of n equal to or greater than 256. In the discussion below, a modular polynomial product is the polynomial resulting from determining the modulo (xn+1) of a product of two polynomials. A device that determines a modular polynomial product is referred to as a modular polynomial multiplier.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
A modular polynomial multiplier includes a plurality of processing elements. Each processing element includes a multiplication unit, an addition unit and a delay unit. The multiplication unit has a first input, a second input and an output, wherein with each of a series of clock cycles, the output of the multiplication unit carries the product of a value provided on the first input and a value provided on the second input. The addition unit has a first input, a second input and an output wherein the first input is connected to the output of the multiplication unit. The delay unit has an input connected to the output of the addition unit and an output, wherein the input carries an input value and the output provides the input value delayed by one clock cycle. The first input of the multiplication unit of each processing element carries a respective coefficient of a first polynomial and the second input of the multiplication unit of each processing element is connected to one of an input line carrying a sequence of coefficients of a second polynomial having n coefficients and a delay line carrying the sequence of coefficients of the second polynomial delayed by n clock cycles and negated.
In accordance with a further embodiment, a modular polynomial multiplier includes a first modular polynomial multiplier configured to produce a first modular product of a first portion of a first polynomial and a first portion of a second polynomial, the first modular product produced as a first series of coefficients with a separate coefficient at each of a set of clock cycles. A second modular polynomial multiplier is configured to produce a second modular product of a second portion of the first polynomial and a second portion of the second polynomial, the second modular product produced as a second series of coefficients with a separate coefficient at each of the set of clock cycles. A first delay circuit is configured to delay the first series of coefficients by one clock cycle to form a delayed series of coefficients and a second delay circuit is configured to delay a first coefficient in the second series of coefficients by a number of clock cycles equal to the number of coefficients in the second series of coefficients to form a modified series of coefficients. An addition unit is configured to add coefficients in the delayed series of coefficients to coefficients in the modified series of coefficients.
In accordance with a still further embodiment, a modular polynomial multiplier includes a first circuit receiving a first sub-polynomial of a first polynomial and a first sub-polynomial of a second polynomial and producing a modular product of the first sub-polynomial of the first polynomial and the first sub-polynomial of the second polynomial. A second circuit receives a second sub-polynomial of the first polynomial and a second sub-polynomial of the second polynomial and produces a modular product of the second sub-polynomial of the first polynomial and the second sub-polynomial of the second polynomial. The first circuit and the second circuit are identical to each other.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The embodiments described below improve the response time and latency of systems that perform modular polynomial multiplication. The response time is defined as the number of clock cycles between when a first coefficient of a polynomial is input to the system and when a first coefficient of the modular polynomial product is output. Latency is defined as the number of clock cycles between when the first coefficient of the polynomial is input to the system and when the last coefficient of the modular polynomial product is output.
In accordance with one embodiment, a modular polynomial multiplier with a sequential weight-stationary systolic structure is used for modular polynomial multiplication. This structure achieves low latency and full hardware utilization. In a further embodiment, a low-latency fast-parallel modular polynomial multiplication architecture is used for modular polynomial multiplication that integrates a modular reduction at a merging level. In a still further embodiment, an iterated fast-parallel architecture is used for modular polynomial multiplication.
For the product P(x) of two polynomials
A(x)=a[0]+a[1]x+a[2]x2+ . . . a[n−1]xn−1 (1)
B(x)=b[0]+b[1]x+b[2]x2+ . . . b[n−1]xn−1 (2)
over Rq, all the coefficients of P(x) need to be less than q but non-negative integers, while the degree of P(x) should be less than n, where Rq=Zq/(xn+1) is the ring of the polynomial, and Zq is the ring of integers modulo a power-of-two integer q. The schoolbook polynomial multiplication between A(x) and B(x) modulo (xn+1, q) can be described as
To improve the efficiency and reduce the complexity of schoolbook polynomial multiplication, methods based on the divide-and-conquer strategy to increase the parallelism are of great interest. One of the examples is the Karatsuba algorithm. The 2-level Karatsuba polynomial multiplication first decomposes the input polynomials into higher-degree and lower-degree parts as A(x)=A0(x)+A1(x)·xn/2 and B(x)=B0(x)+B1(x)·xn/2 and computes
C
0(x)=A0(x)·B0(x)
C
1(x)=(A0(x)+A1(x))·(B0(x)+B1(x))
C
2(x)=A1(x)·B1(x) (4)
Then the above products are summed up and polynomial modular reduction is carried out to derive the product P(x) over the ring as
P(x)=C0(x)+C3(x)·xn/2+C2(x)·xn mod(xn+1) (5)
where
C
3(x)=(C1(x)−C0(x)−C2(x)) (6)
Note that the degrees of C3(x)·xn/2 and C2(x)·xn are
and 2n, respectively. Hence polynomial subtractions are needed to perform the modular reduction by xn+1. Based on this divide-and-conquer strategy of the Karatsuba algorithm, the number of coefficient multiplications is reduced from n2 to 3(n/2)2.
Consider the design for a degree-n modular polynomial multiplier described by Equation (3). In this section, we use n=4 as an example to illustrate our proposed novel modular polynomial multiplier. The modular polynomial multiplication is described by:
The polynomial multiplication of A(x) and B(x) leads to
P′(x)=p′[0]+p′[1]x+p′[2]x2+p′[3]x3+p′[4]x4+p′[5]x5+p′[6]x6 (8)
Since the polynomial multiplication has a degree higher than three, the terms x4, x5, and x6 are replaced by −1, −x, and −x2, respectively, to perform the modular reduction. Thus, the coefficients of the modular polynomial multiplication are:
p[3]=a[3]b[0]+a[2]b[1]+a[1]b[2]+a[0]b[3],
p[2]=a[2]b[0]+a[1]b[1]+a[0]b[2]−a[3]b[3],
p[1]=a[1]b[0]+a[0]b[1]−a[3]b[2]−a[2]b[3],
p[0]=a[0]b[0]−a[3]b[1]−a[2]b[2]−a[1]b[3]. (9)
A dependence graph (DG) 100 of the modular polynomial multiplication for the n=4 example is shown in
Modular polynomial multiplier 200 includes input line 208, shift register 212, negation unit 213, delay line 214, multiplexers, such as multiplexers 216, 218, and 220, processing elements, such as processing elements 222, 224, 226 and 228, and output line 210.
With each clock cycle, the current coefficient on input line 208 is loaded into shift register 212 and any coefficients previously loaded into shift register 212 are shifted one place. After n clock cycles, the oldest coefficient in shift register 212 is negated by negation unit 213 and is output onto delay line 214. With each subsequent clock cycle, another respective coefficient in shift register 212 is negated and output on delay line 214. Thus, for the first n clock cycles, the coefficients of A(x) appear on input line 208, one coefficient per clock cycle, in order from the most-significant coefficient (a[n−1]) to the least-significant coefficient (a[0]). For the next n clock cycles, the negatives of the coefficients of A(x) appear on delay line 214, one coefficient per clock cycle, in order from the most-significant coefficient (−a[n−1]) to the least-significant coefficient (−a[0]).
There are n−1 multiplexers. Each multiplexer has two inputs, a control line and an output. One input of each multiplexer is connected to input line 208 and the other input is connected to delay line 214. Each control line receives a respective control signal that causes the multiplexer to either connect input line 208 to the output of the multiplexer or connect delay line 214 to the output of the multiplexer. The output of each multiplexer is connected to a respective processing element. For example, multiplexer 216 has input 230 connected to input line 208, input 232 connected to delay line 214, control line 234 and output 236 connected to processing element 224.
There are n processing elements. Processing element 222, referred to as the first tap, includes a multiplication unit 238 and a delay unit 240. Multiplication unit 238 has two inputs 242 and 244 and an output 246. Input 242 is connected to input line 208 and input 242 receives the least-significant coefficient, b[0], of input polynomial B(x) 204. With each clock cycle, multiplication unit 238 multiplies the current value on input line 208 with coefficient b[0] and provides the product on output 246. Output 246 of multiplication unit 238 is connected to an input of delay unit 240. Delay unit 240 delays each value received from multiplication unit 238 by one clock cycle and outputs the delayed value on a processing element output 248.
Processing element 228, referred to as the last tap, includes a multiplication unit 250 and an addition unit 252. Multiplication unit 250 has two inputs 254 and 256 and an output 258. Input 254 is connected to the output of multiplexer 220 and input 256 receives the most-significant coefficient, b[n−1], of input polynomial B(x) 204. With each clock cycle, multiplication unit 250 multiplies the current value provided by multiplexer 220 with coefficient b[n−1] and provides the product on output 258. Output 258 of multiplication unit 250 is connected to an input 260 of addition unit 252, which also includes an input 262 and an output 264. Input 262 carries an accumulated sum produced by other processing elements as discussed further below. Addition unit 252 adds the value on input 260 to the value on input 262 and provides the sum on output 264. In accordance with one embodiment, addition unit 252 forms the sum in less than a clock cycle of modular polynomial multiplier 200. Output 264 of addition unit 252 forms output line 210 of modular polynomial multiplier 200.
Between first tap processing element 222 and last tap processing element 228, there are n−2 structurally identical processing elements, such as processing elements 224 and 226, connected in series. Since all of the n−2 processing elements are identical, the structure is described below with reference to just processing element 224. However, the description of processing element 224 is applicable to all of the structurally-identical processing elements.
Processing element 224 has a multiplication unit 270, an addition unit 272 and a delay unit 274. Multiplication unit 270 has two inputs 276 and 278 and an output 280. Input 276 is connected to the output of a respective multiplexer (in this case, output 236 of multiplexer 216) and input 278 receives a respective coefficient of input polynomial B(x) (in this case, coefficient b[1]). With each clock cycle, multiplication unit 270 multiplies the two coefficients on inputs 276 and 278 and provides the product on output 280. Addition unit 272 includes two inputs 282 and 284 and an output 286. Input 282 is connected to output 280 of multiplication unit 270 and input 284 is connected to the output of a delay unit of a respective preceding processing element (in this case output 248 of delay unit 240 of preceding processing element 222). Addition unit 272 adds the values on inputs 282 and 284 and provides the sum on output 286. Addition unit operates at less than a clock cycle so that the sum is provided within the same clock cycle that the product is provided on output 280 by multiplication unit 270. Output 286 is connected to delay unit 274, which delays the value on output 286 by one clock cycle and provides the delayed value on a processing element output 288.
For a value of n=4, modular polynomial multiplier 200 implements Equation 9 above. At a first clock cycle, a[3]b[0] is determined by multiplication unit 238. At the next clock cycle, a[2]b[1] is determined by multiplication unit 270 and a[3]b[0] is output by delay unit 240. Within this same clock cycle, addition unit 272 forms the sum a[3]b[0]+a[2]b[1]. During the next clock cycle, a[1]b[2] is determined by the multiplication unit of processing element 226, a[1]b[1] is determined by multiplication unit 270 and a[1]b[0] is determined by multiplication unit 238. Within this same clock cycle, the addition unit of processing element 226 forms the sum a[3]b[0]+a[2]b[1]+a[1]b[2], and addition unit 272 forms the sum a[2]b[0]+a[1]b[1].
During the next clock cycle, a[0]b[3] is determined by multiplication unit 250, a[0]b[2] is determined by the multiplication unit of processing element 226, a[0]b[1] is determined by multiplication unit 270 and a[0]b[0] is determined by multiplication unit 238. Within this same clock cycle, addition unit 252 forms the sum a[3]b[0]+a[2]b[1]+a[1]b[2]+a[0]b[3], the addition unit of processing element 226 forms the sum a[2]b[0]+a[1]b[1]+a[0]b[2], and addition unit 272 forms the sum a[1]b[0]+a[0]b[1]. As shown in Equation 9, the sum produced by addition unit 252 represents p[3].
At the next clock cycle, the control signal to the multiplexers causes all of the multiplexers to switch from connecting input line 208 to the processing elements to connecting delay line 214 to the processing elements. As a result, during this clock cycle −a[3] is input to each processing element after processing element 222 and −a[3]b[3] is determined by multiplication unit 250, −a[3]b[2] is determined by the multiplication unit of processing element 226, and −a[3]b[1] is determined by multiplication unit 270. Within this same clock cycle, addition unit 252 forms the sum a[2]b[0]+a[1]b[1]+a[0]b[2]−a[3]b[3], the addition unit of processing element 226 forms the sum a[1]b[0]+a[0]b[1]−a[3]b[2], and addition unit 272 forms the sum a[0]b[0]−a[3]b[1]. As shown in Equation 9, the sum produced by addition unit 252 represents p[2].
During the next clock cycle −a[2] is input to each processing element after processing element 222 and −a[2]b[3] is determined by multiplication unit 250, and −a[2]b[2] is determined by the multiplication unit of processing element 226. Within this same clock cycle, addition unit 252 forms the sum a[1]b[0]+a[0]b[1]−a[3]b[2]−a[2]b[3], and the addition unit of processing element 226 forms the sum a[0]b[0]−a[3]b[1]−a[2]b[2]. As shown in Equation 9, the sum produced by addition unit 252 represents p[1].
During the next clock cycle −a[1] is input to each processing element after processing element 222 and −a[1]b[3] is determined by multiplication unit 250. Within this same clock cycle, addition unit 252 forms the sum a[0]b[0]−a[3]b[1]−a[2]b[2]−a[1]b[3]. As shown in Equation 9, the sum produced by addition unit 252 represents p[0].
In the description above, the coefficients provided on output line 210 are surrounded by random values. In other embodiments, the coefficients on output line 210 can be surrounded by zeros by adding n zeros before the coefficients of A(x), n zeros after the coefficients of A(x) and controlling the multiplexers so that they output a value of zero for the values that surround the product coefficients. Using this technique, when a[n−1] appears on input line 208, delay line 214 carries a zero. Thus, during this clock cycle, all of the multiplexers connect delay line 214 to the processing elements so each of the processing elements other than processing element 222 receives a value of zero. With the next clock cycle, multiplexer 216 connects processing element 224 to input line 208 so that processing elements 222 and 224 receive a[n−2] while the remaining processing elements remain connected to delay line 214 and thus receive a value of zero. This progression continues until all of the processing elements are connected to input line 208. At the next clock cycle, each multiplexer other than multiplexer 216, is switched so that the output of the multiplexer is connected to delay line 214. As a result, each of the switched multiplexers provide −a[n] at their output while multiplication units 238 and 270 receive a value of zero from input line 208. With each clock cycle thereafter, an additional multiplexer is switched to connect its output to input line 208 until all of the multiplexer outputs are connected to input line 208.
Taken together,
T
lat
=n·(L+1)−1 (10)
The modular reduction can be performed by simply keeping the least ϵ bits for a 2ϵ modulus. For the lattice-based cryptography schemes, the degrees of the polynomial are relatively large, i.e., n can be up to hundreds or thousands, which could cause a high fan-out issue on the output of the shift register and the input node. To overcome this, buffers (registers) are inserted after the multiplexers, shown as dashed line 290 in
In accordance with some embodiments, modular polynomial multiplier 200 is used to construct a highly parallel modular polynomial multiplier that is based on a fast parallel filter algorithm. These embodiments have a significantly lower addition cost in the post-processing stage than the Karatsuba algorithm. Furthermore, these embodiments require less resource overhead than prior schoolbook polynomial multipliers.
One example of a fast parallel modular polynomial multiplier is the fast 2-parallel modular polynomial multiplier 300 shown in
In a pre-processing step (step 1), input polynomials A(x) and B(x) are decomposed based on the even and odd indices (also called polyphaser decomposition). With y=x2, the polynomial A(x) is expressed as:
A(x)=A0(x2)+A1(x2)·x=A0(y)+A1(y)·x (11)
where the even indexed polynomial A0(y) and the odd indexed polynomial A1(y) are expressed as:
A
0(y)=a[0]+a[2]y+a[4]y2+ . . . +a[n−2]yn
A
1(y)=a[1]+a[3]y+a[5]y2+ . . . +a[n−1]yn
Similar decomposition is applied to B(x) to obtain its even and odd polynomials B0(y) and B1(y). The coefficients of the even and odd polynomials of each respective power are then summed by an adder 301 to form (A0(y)+A1(y)) and by an adder (not shown) to form (B0(y)+B1(y)).
The product P(x) can be computed as:
The polyphase decomposition describes one polynomial multiplication of length-n in terms of four polynomial multiplications of length-n/2. While this step in itself does not reduce the computation complexity, it is an essential first step.
In Step 2 of algorithm 1, modular polynomial multiplier 300 uses three modular polynomial multipliers 302, 304 and 306 to perform three modular multiplications in parallel. In accordance with one embodiment, each of modular polynomial multipliers 302, 304 and 306 is structurally identical to systolic modular polynomial multiplier 200 of
Modular polynomial multiplier 302 determines the modular product of A0(y)B0(y), referred to as U(y); modular polynomial multiplier 304 determines the modular product of (A0(y)+A1(y))(B0(y)+B1(y)), referred to as W(y); and modular polynomial multiplier 306 determines the modular product of A1(y)B1(y), referred to as V(y).
P1(y) of the product P(x) is computed as:
Thus, P1(y) can be determined by subtracting the output of modular polynomial multipliers 302 and 306 (U(y), V(y)) from the output of modular polynomial multiplier 304 (W(y)). These subtractions are performed by negation units 307 and 309 and addition units 308 and 310 in
P0(y) of the product P(x) is computed as:
Since V(y) needs to be multiplied by y before adding the coefficients of U(y), the highest degree of coefficient exceeds the range of the ring (yn
P
0(y)=(u[0]−v[n/2−1])+p0[1]y+p0[2]y2+ . . . +p0[n/2−1]yn
In accordance with one embodiment, the summation of Equation 21 is achieved using multiplexers and delays and is explained using the timing diagrams for n=8 shown in
In order to implement the multiplication of V(y) by y, the embodiments delay U(y) by one clock cycle. This aligns the coefficient for yx in U(y) with the coefficient for yx−1 in V(y) as shown in
The modular reduction is performed by delaying the most-significant coefficient, v[n/2−1], by n/2 clock cycles and then subtracting the delayed value from u[0] as shown in
After n/2 clock cycles, control signal 336 causes multiplexer 330 to connect the output of negation unit 328 to addition unit 332. As a result, v[n/2−1], which is held in delay unit 326, is negated by negation unit 328 and is applied to the input of addition unit 332. Addition unit 332 then adds the negative of v[n/2−1] to u[0] to provide the last coefficient of P0(y).
Note that no additional adder/subtractor is needed and full hardware utilization is retained for all the components in the circuit. Moreover, this optimization technique still allows continuous processing of modular polynomial multiplications without requiring any null operations.
In accordance with some embodiments, registers are added along dashed line 350 to reduce the critical path of modular polynomial multiplier 300.
The computation V(y)·y is inherently a non-causal operation. This is transformed to a causal operation by introducing delay unit 320. This does not increase the latency beyond one clock cycle and preserves the feed-forward property of the architecture and continuous data-flow property.
Different from the traditional methods that execute the polynomial modular reduction during or after post-processing (i.e., combining the intermediate polynomials back to a single polynomial), the embodiments integrate polynomial modular reduction into the three intermediate polynomial multiplications. This is achieved by using the sequential systolic modular polynomial multiplication described in
or (3n−3) modular additions/subtractions for post-processing. In contrast, by employing the sequential weight-stationary systolic polynomial modular multiplier as shown in
modular additions/subtractions.
In the fast 2-parallel modular polynomial multiplier discussed above, the input polynomials and the output polynomial are decomposed into two phases. The invention is not limited to two phases and can be implemented using any number of phases. For example,
During the polyphase decomposition (step 1), polynomials A(x) and B(x) are decomposed as
A(x)=A0(y)+A1(y)·x+A2(y)·x2.
B(x)=B0(y)+B1(y)·x+B2(y)·x2. (22)
The modular multiplication result P(x) is also decomposed as:
P(x)=P0(y)+P1(y)·x+P2(y)·x2, (23)
where y=x3.
Fast 3-parallel modular polynomial multiplier 500 includes six modular polynomial multipliers 502, 504, 506, 508, 510 and 512 that operate in parallel with each other and that each perform a modulo (yn/3+1) multiplication of two respective polynomials of length n/3. In accordance with one embodiment, each of modular polynomial multipliers 502, 504, 506, 508, 510 and 512 are structurally identical to modular polynomial multiplier 200.
In step 2 of algorithm 2, multiplier 502 determines the modular polynomial product C0(y) of A0(y)B0(y); multiplier 504 determines the modular polynomial product C1(y) of A1(y)B1(y); multiplier 506 determines the modular polynomial product C2(y) of A2(y)B2(y); multiplier 508 determines the modular polynomial product C3(y) of (A0(y)+A1(y))(B0(y)+B1(y)) where (A0(y)+A1(y)) is produced by addition unit 514 and (B0(y)+B1(y)) is determined by another addition unit (not shown); multiplier 510 determines the modular polynomial product C4(y) of (A1(y)+A2(y))(B1(y)+B2(y)) where (A1(y)+A2(y)) is produced by addition unit 516 and (B1(y)+B2(y)) is determined by another addition unit (not shown); and multiplier 512 determines the modular polynomial product C5(y) of (A0(y)+A1(y))+A2(y))(B0(y)+B1(y))+B2(y)) where (A0(y)+A1(y)+A2(y)) is produced by addition unit 518 and (B0(y)+B1(y))+B2(y)) is determined by another addition unit (not shown).
In step 3, negation unit 519 and addition unit 520 determine D0(y)=C3(y)+(−C1(y)) and negation unit 521 and addition unit 522 determine D1(y)=C4(y)+(−C1(y)). In addition, an addition unit 534, a delay unit 532, and a delay circuit that includes demultiplexer 524 (also referred to as a switch), delay unit 526, negation unit 528, and multiplexer 530 (also referred to as a switch) determine D2(y)=C0(y)−C2(y)·y mod (yn/3+1,q). The modular reduction is performed by delaying the most-significant coefficient, c2[n/3 −1], by n/3 clock cycles and then subtracting the delayed value from c0[0]. Note that n/3 is equal to the number of coefficients in C2(y). When c2[n/3 −1] appears on the output of modular polynomial multiplier 506, a control signal causes demultiplexer 524 to connect the output of modular polynomial multiplier 506 to the input of delay unit 526, which stores c2[n/3 −1]. At the next clock cycle, the control signal to demultiplexer 524 and a control signal to multiplexer 530 cause demultiplexer 524 and multiplexer 530 to connect the output of modular polynomial multiplier 506 to an input of addition unit 534. As a result, for the next n/3 −1 clock cycles, the coefficients of C2(y) are provided to one input of addition unit 534. The other input of addition unit 534 is connected to the output of delay unit 532 and thus receives the coefficients of C0(y) delayed by one clock cycle. As a result, addition unit 534 determines the following sums C2[n/3 −1]+C0[n/3 −2], C2[n/3−2]+C0[n/3 −3], . . . , C2[1]+C0[0]. After n/3 clock cycles, the control signal causes multiplexer 530 to connect the output of negation unit 528 to addition unit 534. As a result, c2[n/3 −1], which is held in delay unit 526, is negated by negation unit 528 and is applied to the input of addition unit 534. Addition unit 534 then adds the negative of c2[n/3 −1] to c0[0] to provide the last coefficient of D2(y).
In step 4, negation unit 535 and addition unit 536 determine P1(y)=D0(y)+(−D2(y)) and negation units 537 and 539 and addition units 538 and 540 determine P2(y)=C5(y)+(−D0(y))+(−D1(y)). In order to align D2(y) with D0(y) before the addition, a delay unit 542 delays D0(y) by one clock cycle. In addition, an addition unit 562 and a delay circuit that includes a demultiplexer 554 (also referred to as a switch), a delay unit 556, a negation unit 558, and a multiplexer 560 (also referred to as a switch), determine P0(y)=D2(y)+D1(y)·y mod (yn/3+1,q). The modular reduction is performed by delaying the most-significant coefficient, d1[n/3 −1], by n/3 clock cycles and then subtracting the delayed value from d1[0]. Note that n/3 is equal to the number of coefficients in D2(y). When d1[n/3 −1] appears on the output of addition unit 522, a control signal causes demultiplexer 554 to connect the output of addition unit 522 to the input of delay unit 556, which stores d1[n/3 −1]. At the next clock cycle, the control signal to demultiplexer 554 and a control signal to multiplexer 560 cause demultiplexer 554 and multiplexer 560 to connect the output of addition unit 522 to an input of addition unit 562. As a result, for the next n/3 −1 clock cycles, the coefficients of D1(y) are provided to one input of addition unit 562. The other input of addition unit 562 is connected to the output of addition unit 534 and thus receives the coefficients of D2(y). As a result, addition unit 562 determines the following sums D1[n/3 −1]+D2[n/3 −2], D1[n/3 −2]+D2[n/3 −3], . . . , D1[1]+D2[0]. After n/3 clock cycles, the control signal causes multiplexer 560 to connect the output of negation unit 558 to addition unit 562. As a result, d1[n/3 −1], which is held in delay unit 556, is negated by negation unit 558 and is applied to the input of addition unit 562. Addition unit 562 then adds the negative of d1[n/3 −1] to d2[0] to provide the last coefficient of P0(y). To align P2(y) with P0(y), P2(y) passes through two delay units 544 and 546. To align P1(y) with P0(y), P1(y) passes through delay unit 564.
In accordance with one embodiment, registers are added at the modular polynomial multiplier's outputs as shown by dashed line 580 to shorten the critical path of the system.
The fast 2-parallel architecture and/or fast 3-parallel architecture can be iterated to achieve higher levels of parallelism. Therefore, we can implement various fast M-parallel architectures, where the level of parallelism M can be a power-of-two integer, power-of-three integer, or product of any power-of-two and power-of-three. Note that the coefficients from all the sub-polynomials of P(x) should be aligned after all operations. This is similar to inserting a pipelining cutset to transform non-causal operations to causal operations, at the expense of an increase in latency by one cycle.
For example,
In Step 1 of Algorithm 3, A(x) and B(x) are each split as two parts, referred to as portions or sub-polynomials, based on the odd and even indices as part of the first iteration of the fast 2-parallel modular polynomial multiplier. This results in:
A(x)=A0(x2)+A1(x2)·x2
B(x)=B0(x2)+B1(x2)·x2
In the fast 2-parallel modular multiplier, the sub-polynomials formed through this decomposition and their sums were applied to three modular polynomial multipliers that were structurally identical to the modular multiplier of
The first step of applying the sub-polynomials and their sums to each fast 2-parallel modular multiplier is to decompose each sub-polynomial into sub-sub-polynomials (also referred to as sub-polynomials of portions of polynomials). For fast 2-parallel modular multiplier 602, this involves the following decompositions:
A
0(x2)=A00(x4)+A01(x4)·x4
B
0(x2)=B00(x4)+B01(x4)·x4
For fast 2-parallel modular multiplier 606 this involves the following decompositions:
A
1(x2)=A10(x4)+A11(x4)·x4
B
1(x2)=B10(x4)+B11(x4)·x4
For fast 2-parallel modular multiplier 604 this involves the following decompositions:
(A0(x2)+A1(x2))=(A00(x4)+A10(x4))+(A01(x4)+A11(x4))x4
(B0(x2)+B1(x2))=(B00(x4)+B10(x4))+(B01(x4)+B11(x4))x4
These decompositions result in the coefficients a[ ] and b[ ] of A(x) and B(x) being assigned to each sub-sub-polynomial as:
A
00(y)=a[0]+a[4]y+a[8]y2+ . . . +a[n−4]yn/4-1,
A
10(y)=a[1]+a[5]y+a[9]y2+ . . . +a[n−3]yn/4-1
A
01(y)=a[2]+a[6]y+a[10]y2+ . . . +a[n−2]yn/4-1
A
11(y)=a[3]+a[7]y+a[11]y2+ . . . +a[n−1]yn/4-1
where
A(x)=A00(y)+A10(y)x+A01(y)x2+A11(y)x3.
B
00(y)=b[0]+b[4]y+b[8]y2+ . . . +b[n−4]yn/4-1,
B
10(y)=b[1]+b[5]y+b[9]y2+ . . . +b[n−3]yn/4-1
B
01(y)=b[2]+b[6]y+b[10]y2+ . . . +b[n−2]yn/4-1
B
11(y)=b[3]+b[7]y+b[11]y2+ . . . +b[n−1]yn/4-1
where
B(x)=B00(y)+B10(y)x+B01(y)x2+B11(y)x3
and y=x4.
At step 3, the three fast 2-parallel modular multipliers 602, 604 and 606, also referred to as circuits, execute in parallel resulting in sub-sub-polynomials of the product. In particular, fast 2-parallel modular multiplier 602 produces sub-sub-polynomials C0(y) and C1(y), fast 2-parallel modular multiplier 604 produces sub-sub-polynomials C2(y) and C3(y), fast 2-parallel modular multiplier 606 produces sub-sub-polynomials C4(y) and C5(y).
As shown in
At step 4, post processing is performed to form the sub-polynomials of the product: P0(y), P1(y), P2(y), and P3(y). P1(y)=C2(y)−C0(y)−C4(y) and is produced using negation units 607 and 609 and addition units 608 and 610. P2(y)=C1(y)+C4(y) and is produced using addition unit 612. P3(y)=C3(y)−C1(y)−C5(y) and is produced using negation units 613 and 615 and addition units 614 and 616.
Sub-polynomial P0(y) requires a modular reduction. The modular reduction is performed by delaying the most-significant coefficient, C5[n/4-1], by n/4 clock cycles and then subtracting the delayed value from c0[0]. Note that n/4 is equal to the number of coefficients in C5(y). To implement this modular reduction, fast 4-parallel modular polynomial multiplier 600 uses an addition unit 632 and a delay circuit that includes a demultiplexer 624 (also referred to as a switch), a delay unit 626, a negation unit 628, a multiplexer 630 (also referred to as a switch). When C5[n/4-1], appears on the output of fast 2-parallel modular polynomial multiplier 606, a control signal causes demultiplexer 624 to connect output 620 of modular polynomial multiplier 606 to the input of delay unit 626, which stores C5[n/4-1], At the next clock cycle, the control signal to demultiplexer 624 and a control signal to multiplexer 630 cause demultiplexer 624 and 630 to connect the output of fast 2-parallel modular polynomial multiplier 606 to an input of addition unit 632. As a result, for the next n/4-1 clock cycles, the coefficients of C5(y) are provided to one input of addition unit 632. The other input of addition unit 632 is connected to the output of a delay unit 652 and thus receives the coefficients of C0(y) delayed by one clock cycle. As a result, addition unit 632 determines the following sums C0[n/4-1]+C5[n/4-2], C0[n/4-2]+C5[n/4-3], . . . , C0[1]+C5[0].
After n/4 clock cycles, the control signal causes multiplexer 630 to connect the output of negation unit 628 to addition unit 632. As a result, C5[n/4-1], which is held in delay unit 626, is negated by negation unit 628 and is applied to the input of addition unit 632. Addition unit 632 then adds the negative of C5[n/4-1] to C0[0] to provide the last coefficient of P0(y).
Delay units 654, 656 and 658 are used to align P2(y), P1(y), and P3(y), respectively, with P0(y).
The timing performance can be theoretically derived as follows. The fast M-parallel design can reduce the response time to approximately n/M clock cycles. In general, the total latency of an M-parallel modular polynomial multiplier for L polynomial multiplications can be expressed as:
T
lat
=n(1+L)/M+┌log2(M)┐. (24)
The performance of the embodiments described above was evaluated for the Saber scheme using Verilog HDL implementation. Several changes are adopted specifically for the Saber scheme. Due to the Saber scheme's advantages, the basic components do not consume a large amount of hardware resources. In particular, the modular multiplier discussed above can be replaced by general adders since the random elements are small (since the coefficients of polynomial B(x) are in [−4, 4]). As the moduli are power-of-two integers, the modular reduction can again be performed by simply keeping the lower bits. Note that, the coefficients in both polynomials A(x) and B(x) are represented in the sign-magnitude form, and the word-lengths of the magnitudes of these two polynomials are 13-bit and 3-bit, respectively. The modular adder calculates the 13-bit sum (sum) by adding the product (prod) of the corresponding a[i] and b[j], and the output from the register (acc) as shown in
where asign and bsign are the sign bits of the two operands a[i] and b[j], respectively.
The experiment was performed on the Xilinx Artix-7 AC701 FPGA board, which is recommended by NIST for PQC hardware implementation. In addition, since several prior works also used the high-performance Xilinx UltraScale+ FPGA board, we also demonstrate the performance of the present embodiments on this board for more comparisons. The communication and data transmission between FPGA and PC use the universal asynchronous receiver-transmitter (UART) module provided by AC701 device for functionality verification.
We first examine the performance of the modular polynomial multipliers, including systolic architecture (
For a fair comparison, we focus on the evaluation against the Roy architecture, since both designs use the same clock frequency while the implementation of the Zhu design has a much lower clock frequency. Compared to Roy, the present systolic modular polynomial multiplier has slightly fewer LUTs and less total latency while requiring a larger number of flip-flops (FFs) due to the additional shift registers. Our design achieves 18% and 25% reductions on the LUTs and the clock cycles for all the polynomial multiplications in the Saber scheme. Even though our design requires more FFs in the data-path and shift registers, we argue that it makes a smaller influence on the overall performance on UltraScale+ and Artix-7 FPGA boards, since both devices have a much higher resource budget for FFs than LUTs.
Furthermore, both the polynomial multiplier in Zhu's LWRpro and the compact polynomial multiplier in Zhu and Mera use the Karatsuba algorithm with 8-level and 4-level, respectively. For instance, the compact polynomial multiplier has a long critical path of five adders/subtractors and two multipliers in the interpolation part, which requires two pipelining stages to reduce the critical path for maintaining a high frequency. The compact polynomial multiplier of Zhu and Mera targets the low-area performance, which only requires limited numbers of LUTs, FFs, and only 38 DSP units, as shown in Tables 1. While the compact polynomial multiplier has a lower LUT usage than the embodiments described above, the compact polynomial multiplier suffers from a low speed since it uses degree-64 polynomial multipliers that require 1168 clock cycles for each computation, which causes the actual latency in such a compact design to be around 19 times of the latency in the present fast 4-parallel architecture as presented in Table 2. If we consider the area-time product (ATP) [LUTsxus] as the performance metric, our proposed fast 4-parallel architecture and the prior low-area design yield an ATP of 1.71×105 and 6.86×105, respectively, for the key generation. In other words, our design achieves a 75.07% reduction on the ATP. Besides, the modular polynomial multiplier in Zhu has the lowest clock cycles among all the prior works, while having a lower clock frequency as illustrated in Table 2. In comparison, the present fast 4-parallel architecture requires 14.72% fewer clock cycles and achieves a 65.85% reduction in the actual latency for the encryption. Besides, the present embodiments achieve a 13.24% lower ATP than Zhu (1.36×105 in Zhu versus 1.18×105 in the present embodiments). Moreover, the design in Zhu requires 24.71% more DSPs than the present fast 4-parallel architecture. Thus, the present embodiments achieves significant reductions in latency or the delay (critical path) which leads to reductions in ATP, when comparing to the two prior works that employ the Karatsuba polynomial multiplication.
For the implementation of the entire Saber scheme, the modular polynomial multiplication is implemented by the present fast 4-parallel architecture. Table 3 presents the comparison of the FPGA performance with recent hardware implementations for the PQC schemes, including Saber as well as some other schemes for a more comprehensive comparison. The latency in our design is 52% less than the latency in Roy, where the reduction is mainly from our optimized low-latency modular polynomial multiplier and the hash function block. For example, the total latency of SHA3-256 (needs to process 32-byte, 64-byte, 992-byte, and 1088-byte seeds) operating in the hash function block is reduced from 585 clock cycles to 526 clock cycles in the Saber encapsulation. The rationale behind this latency reduction is as follows. Most open-source packages add stages of pipelining to achieve a high frequency (low critical path) design in order to adapt to general applications. However, the critical path among the prior works are under the NTT-based or schoolbook modular polynomial multiplier that requires addition or multiplication, which is much higher than Keccak core provided in the open-source packages, thus implying that some pipelines are redundant. Different from the prior works, we implement our own hash function block as we aim to reduce the total latency for computing the hash functions by eliminating unnecessary pipelining stages.
Although elements have been shown or described as separate embodiments above, portions of each embodiment may be combined with all or part of other embodiments described above.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms for implementing the claims.