This disclosure relates to multipliers and in particular to large unsigned integer multipliers.
The schoolbook method (classical approach) to multiply two polynomials is to multiply each term of a first polynomial by each term of a second polynomial. For example, a first polynomial of degree 1 with two terms a1x+a0 may be multiplied by a second polynomial of degree 1 with two terms b1x+b0 by performing four multiply operations and three addition operations to produce a polynomial of degree 2 with three terms as shown below:
(a1x+a0)(b1x+b0)=a1b1x2+(a0b1x+a1b0x)+a1b1
The number of multiply operations and additions increases with the number of terms in the polynomials. For example, using the schoolbook method, the number of multiply operations to multiply two polynomials each having n terms is n2 and the number of additions is (n−1)2.
The Karatsuba (KA) algorithm reduces the number of multiply operations compared to the schoolbook method by multiplying two two-term polynomials (A(x)=(a1x+a0) and B(x)=(b1x+b0)), each having two coefficients ((a1,a0) and (b1 b0)), using three scalar multiplications instead of four multiplications as shown below:
C(x)=(a1x+a0)(b1x+b0)=a1b1x2+((a0+a1)(b0+b1)−a0b0−a1b1)x+a0b0
Thus, four additions and three multiplications are required to compute the result C(x) of multiplying two two-term polynomials using the KA algorithm. The KA algorithm relies on the ability to perform shift operations faster than a standard multiplication operation.
Encryption/decryption operations typically require integer multiply operations to be performed on large operand sizes, for example, 512-bit operands. This is typically implemented in a large core hardware multiplier.
Features of embodiments of the claimed subject matter will become apparent as the following detailed description proceeds, and upon reference to the drawings, in which like numerals depict like parts, and in which:
Although the following Detailed Description will proceed with reference being made to illustrative embodiments of the claimed subject matter, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art. Accordingly, it is intended that the claimed subject matter be viewed broadly, and be defined only as set forth in the accompanying claims.
Examples of multipliers that operate on operands (multiplier/multiplicand) having 8 through 64 bits include multipliers that use an array-multiplier organization, a shift accumulate algorithm and tree-based multipliers such as Wallace or Dadda. However, these multipliers do not scale well for operand sizes greater than 64-bits.
The Karatsuba (KA) algorithm reduces the number of multiply operations compared to the schoolbook method by multiplying two two-term polynomials (A(x)=(a1x+a0) and B(x)=(b1x+b0)), each having two coefficients ((a1, a0) and (b1 b0)), using three scalar multiplications instead of four multiplications
In an embodiment of the present invention, a multiplication problem having an operand size greater than 64-bits is decomposed using the KA algorithm into a plurality of multiplication operations that operate on operands having less than or equal to 64-bits. The decomposition allows techniques used in multipliers that operate efficiently on operands in the range 8 through 64-bits to be combined in a modular fashion. The decomposition of large multiply operations (operating on operands greater than 64-bits) into small multiply operations (operating on operands in the range 8 through 64-bits) results in fewer multiply operations at the expense of more additions/subtractions.
In an embodiment of the present invention, a large integer multiplier unit includes a small multiplier block to perform a sequence of small multiply and add/subtract operations efficiently using the KA algorithm.
The system 100 includes a processor 101, a Memory Controller Hub (MCH) 102 and an Input/Output (I/O) Controller Hub (ICH) 104. The MCH 102 includes a memory controller 106 that controls communication between the processor 101 and memory 108. The processor 101 and MCH 102 communicate over a system bus 116. In an alternate embodiment, the functions in the MCH 102 may be integrated in the processor 101 and the processor 101 coupled directly to the ICH 104.
The processor 101 may be any one of a plurality of processors such as a single core Intel® Pentium IV® processor, a single core Intel Celeron processor, an Intel® XScale processor or a multi-core processor such as Intel® Pentium D, Intel® Xeon® processor, or Intel® Core® Duo processor or any other type of processor.
The memory 108 may be Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Synchronized Dynamic Random Access Memory (SDRAM), Double Data Rate 2 (DDR2) RAM or Rambus Dynamic Random Access Memory (RDRAM) or any other type of memory.
The ICH 104 may be coupled to the MCH 102 using a high speed chip-to-chip interconnect 114 such as Direct Media Interface (DMI). DMI supports 2 Gigabit/second concurrent transfer rates via two unidirectional lanes.
The ICH 104 may include a storage I/O controller 110 for controlling communication with at least one storage device 112 coupled to the ICH 104. The storage device may be, for example, a disk drive, Digital Video Disk (DVD) drive, Compact Disk (CD) drive, Redundant Array of Independent Disks (RAID), tape drive or other storage device. The ICH 104 may communicate with the storage device 112 over a storage protocol interconnect 118 using a serial storage protocol such as, Serial Attached Small Computer System Interface (SAS) or Serial Advanced Technology Attachment (SATA).
The processor 101 includes a large integer multiplier 103 to perform multiplication problems, that is, to compute the result of multiplying a multiplier and a multiplicand. The multiplication problems (operations) may be used to encrypt or decrypt information stored in memory 108 and/or stored in the storage device 112.
In an embodiment of the present invention, the 512×512 “large” multiply operation is decomposed into a plurality of 64×64 “small” multiply operations. The decomposition of large multiply operations into a plurality of small KA multiply operations results in fewer multiply operations at the expense of more additions and subtractions.
Referring to
The Karatsuba algorithm requires 3 multiply operations to multiply two two-term polynomials (A(x)=(a1x+a0) and B(x)=(b1x+b0)), each having two coefficients ((a1, a0) and (b1 b0)), as shown below:
The three multiply operations are: (1) a1 b1; (2) a0 b0 and (3) ((a0+a1) (b0+b1). The first two multiply operations (1) a1 b1 and (2) a0 b0 use t-bit operands, whereas the other multiply operation (3) (a0+a1) (b0+b1) uses (t+1)-bit operands.
As shown in
The first level of decomposition subdivides 512-bit multiplier (operand) A and 512-bit multiplicand (operand) B into two 256-bit sub-elements (for example, A12, A02, B12, B02 (
The second level of decomposition subdivides the two 256-bit sub-elements from the first level into two 128-bit sub-elements (for example, A13, A12, A11, A10, B13, B12, B11, B10(
Each second level operation includes 3 third level operations that each perform a 68-bit (64-bit data plus 3 carry bits) multiply operation using the “small” multiplier block. As shown in
In the embodiment shown, the number of levels is three, however, the number of levels is not limited to three. The number of levels used is dependent on the performance required, the size of the multiplier and the cost of additional add and subtraction operations that require extra registers for storage.
Referring to
C(x)=A21·B21x2+((A21+A20)·(B21+B20)−A20·B20−A21·B21)x+A20·B20
There are three multiply operations: (i) A21·B21 performed in 302-1(ii) A20·B20 performed in 302-2 and (iii) (A21+A20)·(B21+B20, performed in 302-3. As shown in FIG. 2, A21 includes segments a7-a4; A20 includes segments a3-a0; B21 includes segments b7-b4; and B20 includes segments b3-b0.
In the second level, each of the first level operations is decomposed into a KA multiplication with each of the second levels 304-1 . . . , 304-9 having three multiply operations. For example, the three second level multiply operations decomposed from first level 302-1 are: (i) A13·B13 performed in 304-1(ii) A12·B12 performed in 304-2 and (iii) (A13+A12)·(B13+B12) performed in 304-3. As shown in
In the third level, each of the second level operations is decomposed into a KA multiplication with each of the third levels having three multiply operations. For example, the three third level multiply operations decomposed from second level 304-1 are: (i) a7·b7 performed in 306-1(ii) a6·b6 performed in 306-2 and (iii) (a7+a6)·(b7+b6) performed in 306-3. Segments a7-a6 and b7-b6 are shown in
In an embodiment of the invention, 27 multiply operations are performed using 64-bit segments a7:a0 and b7:b0 of the 512-bit operands A, B shown in
An embodiment of a multiple phase multiplier that performs the 27 multiply operations using 64-bit segments of the 512-bit operands and combines the partial results of the multiply operation to provide a 1024-bit result will be described later in conjunction with
A First In First Out memory (FIFO) in the MMP 402 stores the multiplier (A), and the multiplicand (B). The multiplier unit 103 starts working on the multiplier and multiplicand (A, B) of a new problem when it has finished a previous problem and detects that a sufficient portion of the bits of each of the multiplier and multiplicand have been enqueued into result FIFOs in the MMP 402. In an embodiment, the least-significant-words (LSW) are enqueued first. The multiplier unit 103 is designed to operate without stalling to maximize performance.
The multiplier 103 is a (16*k+e−1) by (16*k+e−1) bit multiplier that is fully parameterized using two global variables: k and e. Global variable ‘e’ is derived from the fact that the KA decomposition grows the operands at the Most Significant Bit (MSB). Every recursion of the KA algorithm increments the largest potential Most Significant Bit (MSB) by one. Thus, the selection of e=4 is sufficient to handle multiplication of operands having up to {2̂[10+log 2(k)]−1} bits.
A multiply operation is optimized based on an optimal choice of number of levels of Karatsuba decomposition and the order in which the plurality of (2 k+e)×(2 k+e) multiply operations are performed and the results of the multiply operations are combined. In an embodiment, the Karatsuba Multiplier Unit 400 includes full-adders, a core ((2 k+e)×(2 k×e)) multiplier, Carry Save Adders (CSA)s, and memory such as Random Access Memory (RAM)). Partial products may be sequenced and re-combinations ordered in multiple balanced phases to provide efficient usage of the memory and low latency, largely independent of the operand size. In an embodiment, the KA multiplier unit 400 includes two (k+e) bits carry propagate adders and five k-bit carry propagate adders. K is a power of two in order to simplify the transfer of data to/from 32 bit data paths. The multiplier and multiplicand operands are 2 k-bit wide. In one embodiment, k is 32.
The MMP 402 serializes the data for the multiplier and multiplicand by dividing the multiplier and multiplicand into k-bit segments and sending multiplier and multiplicand data to the multiplier k-bits at a time. In an embodiment, the KA multiplier unit 400 includes five logic blocks (referred to as phase 0-4 interfaces) which will be described in greater detail later in conjunction with
The “small” multiplication operations use a Karatsuba Multiplier unit 400 that performs multiply operations on operands having 2 k-bits. The results of all of these multiply operations are combined using add/subtract operations spread over a plurality of pipeline stages in the Karatsuba Multiplier 400.
Referring to
Referring to
Returning to
An embodiment will be described to compute a 1024-bit product of two operands each having 512-bits (that is, k is 32, e is 4). However, the invention is not limited to computing a 1024-bit product of 512-bit operands. The large integer multiplier unit 103 may compute a 2×(N×2M)-bit product of two operands each having (N×2M)-bits using M-levels of Karatsuba in M2 cycles.
In the embodiment shown, the KA Multiplier unit 400 includes a (2 k+e)-bit unsigned core multiplier 502 (integer multiplier block), carry-save accumulator blocks, carry propagate adders, registers and memory 524a-b, that may be Random Access Memory for storing data between phase interfaces. The KA multiplier 400 may also include a state machine for sequencing multiply operations, addition operations and data-transfers to/from input/result First In First Out (FIFO)s in the MMPs 402.
In an embodiment for multiplying a 512-bit multiplicand and a 512-bit multiplier with operands treated as unsigned integers, the Karatsuba multiplier unit 400 takes 27-cycles to compute the 1024-bit product.
In the embodiment shown in
Operand segments each having (2*k)-bits are received from an MMP 402 and stored in memory in the phase 0 interface (block) 506. In an embodiment, the (2*k)-bits are received as two k-bit segments. The first k-bits received has the low order (Least Significant Bits (LSB)) k-bits of the (2*k)-bits segment and the second k-bits received has the high-order (Most Significant Bits (MSB)) k-bits of the (2*k) bits segment.
The phase 0 block 506 includes four propagate adders 504, two for operand A (one for each sub-segment) and two for operand B (one for each sub-segment).
The phase 0 interface 506 also includes a plurality of registers (memory buffers) for performing the level 3 decompositions described in conjunction with
The phase 0 interface 506 performs the initial additions and multiplications. A 27 element ‘Karatsuba triangle’ is generated given 8 element operands each element having 64-bits. The KA algorithm requires subtractions in the middle section of the triangle as shown below:
C(x)=A21·B21x2+((A21+A20)·(B21+B20)−A20·B20−A21·B21)x+A20·B20
The subtractions ((A21+A20)·(B21+B20)−A20·B20−A21·B21) are handled separately in combining Carry Save Adders (CSAs) using the ones-complement of the products and compensating at a suitable point in time.
Table 1 below illustrates an embodiment of a schedule of operations performed in 27-cycles in the phase 0 interface 506 to decompose one of the sub-segments of the (2*k)-bits segment.
The sub-segment (LSB or MSB) of the 512-bit operand includes 256-bits which are further sub-divided into eight 32-bit portions. The LSB sub-segment and the MSB sub-segment are identical other than the carry handling. The carry-in bits for the LSB sub-segment are all zero and may be zero or one for the MSB sub-segment dependent on the result of the operation performed in the LSB sub-segment.
KA multiplication is performed in the core multiplier 502 using one of the 32-bit portions of the LSB sub-segment and the corresponding one of the 32-bit portions of the LSB segment of each respective operand (A, B) which may be referred to as aL0-aL7 for the LSB sub-segment of operand A. Each element 600 in the 27-element triangle shown in
In the example shown in
As discussed previously, the KA multiplication algorithm performs the following operations:
C(x)=A1·B1x2+((A1+A0)·(B1+B0)−A0·B0−A1·B1)x+A0·B0
Referring to Table 1, the KA multiplication algorithm is performed nine times using different elements 600. The KA multiplication algorithm computes the following nine products using a total of 27 multiply operations: (1) E1·E2; (2) E4·E5; (3) E10·E11; (4) E13·E14; (5) (E1:E2)·(E4:E5); (6) (E10:E11)·(E14:15); (7) E19·E0; (8) E22·E23; and (9) (E22:E19)·(E23:E20). To compute the first of the nine products, that is, (1) E1·E2, the KA multiplication algorithm uses elements labeled E1, E2 and E3 with element labeled E1 corresponding to A0, element labeled E2 corresponding to A1 and element labeled E3 corresponding to (A1+A0). The decomposition shown in
In cycle 1 of the 27-cycle KA multiplication, the phase 0 interface 506 outputs the last element (element number 27) associated with the previous problem.
In cycle 2, element E1 of the current problem is output to the core multiplier. For operand A, element E1 includes the LSBs of the LSB sub-segment of the A operand, that is, the 32 LSBs of A0, which as shown in
In cycle 3, while the sum of element E1 and element E4 is computed to compute (E1:E2)·(E4:E5) in the carry sum adder, element E2 to compute E1·E2 is output to the core multiplier to compute the last of the three products of the KA algorithm for E1·E2. A carry (C) may also be added to the sum of element E1 and element E4 to provide element E7. As discussed earlier, the carry (C) added to the sum of element E1 and element E4 for the LSB sub-segment is zero. The carry (C) may be zero or one for the MSB sub-segment dependent on the result of the operation performed in the LSB sub-segment.
In cycle 4, while the sum of element E5 and element E4 is computed to compute (E1:E2)·(E4:E5) in the carry sum adder, element 3 (A1+A0) is output to the core multiplier to compute the last of the three products of the KA algorithm for the product of E1·E2, that is, (A1+A0)·(B1+B0).
In cycle 5, while the sum of element 2 and element 5 is computed in the carry sum adder to provide element E8 used to compute (E1:E2)·(E4:E5), element 4, that is, A2 is output to the core multiplier to compute product (E1:E2)·(E4:E5).
In cycle 6, while the sum of element 8 and element 7 is computed in the carry sum adder to provide element 9 used to compute (E1:E2)·(E4:E5), element 5, that is, A3 is output to the core multiplier to compute product E4·E5.
In cycle 7, while the sum of element 10 and element 13 is computed in the carry sum adder t to provide element 12 used to compute E10·E11, element 7 is output to the core multiplier to compute (E1:E2)·(E4:E5).
In cycle 8, while the sum of element 10 and element 11 is computed in the carry sum adder t to provide element 12 to computer E10·E11, element 8 is output to the core multiplier to compute (E1:E2)·(E4:E5).
In cycle 9, while the sum of element 14 and element 11 is computed in the carry sum adder t to provide element 17 to compute (E10:E11)·(E14:15), element 9 is output to the core multiplier to compute E10·E11.
In cycle 10, element 10 is output to the core multiplier to compute E10·E11.
The remaining cycles 11 through 27 follow a similar pattern to cycles 1 to 10 as shown in Table 1 to compute the remainder of the nine KA multiplication operations.
The core multiplier 502 computes all of the partial products of the KA Algorithm as discussed earlier. The core multiplier receives two (2*k+e)-bit operands from the phase 0 interface and outputs a (4*k+2*e)-bit result in redundant form to the phase 1 interface (block). In an embodiment, the core multiplier may be pipelined in order to decrease the delay of the critical path.
The core multiplier 502 computes the result of multiplying a 68 bit multiplicand and a 68 bit multiplier. The result is a 136 bit partial product. The order of the plurality of 68 bit×68 bit multiply operations is fixed as discussed in conjunction with the sequence of operations performed by the phase 0 interface in Table 1 and is chosen to reduce latency and minimize storage space in registers in the multiply unit 100. The 136-bit product is generated in carry-save redundant (CSR) format.
The partial products are combined with previous accumulated partial results (also in carry-save redundant format) in the carry-save accumulator blocks in each of the phase interfaces.
The phase 1 interface (module/block) 508 performs recombination of the lowest level recursion level. For example, the phase 1 interface 508 performs the recombination of the second lowest recursion level of the KA Algorithm, that is, first level operations 302-1, 302-2, 302-3, . . . 302-27. The recombination is performed using elements numbered 1, 2, 3 in
First, the 128-bit products (a0*b0) and (a1*b1) received from the core multiplier plus carry bits are added in a carry save adder and the result of the computation is inverted to provide −(a0*b0+a1*b1). Next, the product (a0+a1)*(b0+b1) received from the core multiplier 504 is added in the carry save adder to −(a0*b0+a1*b1). The result of these two operations, that is, recombination (x1, x2) is forwarded to the phase 2 interface 510.
In the first pass, the segment x0 received from the phase 1 interface 508 is passed through the 4 k adder with the other adder inputs set to 0 to the phase 3 interface 512. This value is output as y0.
In the second pass, −(x0+x2) is calculated by the adder and temporarily stored in memory for use in the next pass through the adder.
In the third pass, the result of the second pass is added to x1 to compute x1−(x0+x2). The result is stored in memory for use in the next pass.
In the fourth pass, the result of the third pass is added to x4 to compute x4+x1−(x0+x2). The result is stored in memory for use in the next pass and output as y2, the second segment of the output (y1) of the phase 2 module.
In the fifth pass, x1 passes through the adder with the other adder inputs set to 0. The value x1 is stored in memory for use in the next pass.
In the sixth pass, x3 and the result of the fifth pass (x1) are added to compute −(x1+x3). The result is stored in memory for use in the next pass.
In the seventh pass, x2 and the result of the sixth pass are added to compute x2−(x1+x3). The result is stored in memory for use in the next pass.
In the eighth pass, x5 and the result of the seventh pass are added to compute x5+x2−(x1+x3). The result is forwarded to the phase 3 module as the third segment of the output (y2).
In the ninth pass, x3 passes through the adder with the other adder inputs set to 0 and forwarded to the phase 3 module as the fourth segment of the output (y3).
These nine passes through the adder in the phase 2 module occur three times per problem. Every recursive iteration operates for one segment of the output. The nine passes discussed above construct four recursive iterations of addition, each iteration handling one segment.
The phase 3 interface 512 handles the recombination of the highest recursion level of the KA Algorithm for the three levels shown in
The phase 4 interface 514 performs conversion of the redundant output of the phase 3 module (z7, . . . z0) into non-redundant form. The phase 4 interface 514 includes a carry-propagation adder that retires 64-bit result words. The carry-propagation adder in the phase 4 interface 514 returns a non-redundant result. The data output from the phase 4 interface 514 is sent through separate First In First Out (FIFO) blocks as low order data and high order data back to the MMP 402.
The combining CSA phases (phase interfaces 1-3 508, 510, 512) are decomposed into phases that are well-balanced in terms of critical paths and correspond to the level of recursion in the Karatsuba algorithm. The width of the CSA phases is optimized for area; when a larger sum is required in a phase, it is performed in multiple cycles dependent on the latency (delay) before the sum is needed for a subsequent operation.
The 512-bit×512-bit multiply operation is performed using a sequence of multiplication operations using the “small” 68-bit×68-bit multiplier and combining add/subtract operations on the results of the small multiply operations spread over a plurality of pipeline stages. The sequence of multiply and add/subtract operations is performed efficiently using a hardware implementation of the KA algorithm using carry-save adders (CSA). A CSA computes the sum of three or more n-bit numbers and outputs a partial sum and carry bit(s).
The ordering of partial results (partial sum and carry bit(s)) from each level affects the overall propagate size, the number and width of CSAs, the number and width of carry-propagate adders, the registers and memory required.
An embodiment of the invention has been described for 512-bit operands. In other embodiments other operand sizes such as 256-bit or 1024-bit may be used.
Performance is optimized based on selection of the number of levels of Karatsuba decomposition, the organization of the full-adders, multiplier, Carry Save Adders (CSAs), and memory, the sequencing of partial products and ordering of recombinations in multiple balanced phases with efficient usage of memory and low latency. Latency is independent of the operand size.
It will be apparent to those of ordinary skill in the art that methods involved in embodiments of the present invention may be embodied in a computer program product that includes a computer usable medium. For example, such a computer usable medium may consist of a read only memory device, such as a Compact Disk Read Only Memory (CD ROM) disk or conventional ROM devices, or a computer diskette, having a computer readable program code stored thereon.
While embodiments of the invention have been particularly shown and described with references to embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of embodiments of the invention encompassed by the appended claims.