1. Field of the Invention
The present invention relates to high performance digital arithmetic algorithms and circuitry. In particular, the present invention relates to apparatus and method for high-speed modulo multiplication and division particularly useful of the implementation of data encryption in computer systems and networks.
2. Description of the Related Art
Advances in networking and data processing speeds have led to the need for high-speed cryptosystems. Military applications, financial transactions and multimedia communications are examples of particular fields and applications that require fast authentication and secure communication.
Public-key cryptosystems, which are based upon one-way mathematical functions, are popular because they do not require a complex key distribution mechanism. Commonly used public-key systems, e.g., the Rivest-Shamir-Adleman system (RSA), the Elgamal system and Elliptic-Curve Cryptosystems (ECC), utilize modular multiplication operations heavily for both encryption and decryption.
Encryption and decryption algorithms may be implemented using either software or hardware. Software implementations are less expensive and easy to modify, but slow. Hardware implementations are more expensive and difficult to modify, but are quite faster than software implementations. Hardware implementations are being studied for mass distribution because of their high speed, which results in greater convenience, increased network efficiency, greater productivity, and consequent cost savings. The speed of hardware cryptosystems depends upon the implemented algorithm complexity, the efficiency of the hardware implementation, and the technology used for the implementation. Accordingly, efficient hardware implementation of modular multipliers is essential in the design of efficient high-speed crypto-processors.
The RSA algorithm is one of the most widely used public key cryptographic methods. According to the RSA algorithm, if M represents a message to be encrypted (M being an integer produced by processing a plain text message by a symmetric algorithm, with padding if required to prevent unauthorized decryption of the message) and C represents the ciphered message, then the RSA algorithm is based upon the following three requirements: 1) finding integers e, d and N satisfying M=Med mod N; 2) it should be relatively easy to compute Me and Cd; and 3) it should be almost impossible to find d knowing only e and N.
Typically, N is a large, difficult to factor integer, and the message block M satisfies 0≦M≦N. The ciphertext Cis computed by the relation: C=Me mod N. The plaintext message can be retrieved using the decryption key d as follows: M=Cd mod N=(Me)d mod N=Med mod N. With key sizes of approximately 1024 or 2048 bits, it is obvious that the speed of both encryption and decryption both heavily depend on the speed of the modulo multiplication operation.
The modulus N is defined as the product of two prime numbers p, q where N=pq. Therefore, φ(pq)=(p−1)(q−1), where φ(x) is the number of positive integers which are smaller than x and are relatively prime or coprime to x. The decryption key d is computed as: gcd(φ(N), d)=1 and 1<d<φ(N) and e≡d−1 mod φ(N).
The Elgamal algorithm has two public keys, N and g, where N is a large prime number, N−1 has at least one large prime factor, and g is a primitive element mod N. Each party has its own private key KR_x (where 1<KR_x<N−1) and its own public key KU_x, which can be computed from the private key as follows: KU_x=gK
For USER_A to send a message M(0≦M≦N) to USER_B, USER_A must first choose a random number U (0<U<N), and then a transaction key K is computed using USER_B's public key, KU_b, as follows: K=KU_bU mod N.
The ciphered message is then computed as a pair C=(c1, c2), where c1=gU mod N and c2=KM mod N. It should be noted that the size of the encrypted message is twice the size of the original message. USER_B may decrypt the ciphered message C by first retrieving the transaction key K. This should be a relatively easy process for USER_B, since: K≡KU_bU≡(gKR
Elliptic curve cryptosystems (ECC) are commonly viewed as being secure for both commercial and government usage. According to the IEEE 1363-2000 standard, an RSA key of 1024 bits has security equivalent to an ECC with keys of 172 bits. The cost of complex mathematical operations increases significantly with the length of the input operands. For prime fields of characteristic p>3, the elliptic curve equation is given by E: y2=x3+ax+b(mod p).
The primary operation in an ECC is point multiplication C=kP, where P is a point (x, y) on the curve and k is an integer. The multiplication is performed using group operation. The operation in the Abelian group of points on an elliptic curve is called “point addition”. This operation adds two curve points yielding another point on the curve. Using an ECC for signatures involves the repeated application of the group law. The group law using affine coordinates is shown below:
These field operations are all modular operations, thus requiring modular multiplication to be used heavily.
As noted above, modular arithmetic operations are of great importance in encryption systems and methodologies. Exponentiation is performed as a number of squaring and multiplication operations depending on the length of the exponent. A generalized exponentiation algorithm (hereafter referred to as Algorithm 1) is shown below, with the objective being to compute X=YE:
In the above, k is the number of bits in the exponent E; E=ek−1, ek−2 . . . , e2, e1, e0; and ei is the ith bit of E The above algorithm can be easily modified for modular exponentiation by replacing the multiplication in the above algorithm with a modular multiplication, as shown below. The objective of the following algorithm (hereafter referred to as Algorithm 2) is to compute X=YE Mod N:
The modulo multiplication operation computes (A×B mod N), where A, B and N are k-bit integers. Modular multiplication is generally considered a difficult arithmetic operation to implement, since it involves both multiplication and division operations. The multiplication is performed either through first performing the multiplication operation and then performing the modular reduction operation through division; or through interleaving the reduction operations with the multiplication steps.
For k-bit operands, the first approach requires a k×k-bit multiplier with a 2k-bit output register followed by a 2k×k-bit divider. Thus, the hardware requirements of the first approach are quite excessive. In the second approach, the product is computed iteratively by accumulating one partial product term (2ibi×A) per iteration. The modular reduction operation is performed after each such iteration. The reduction step involves a trial subtraction of the modulus N from the running product P. The algorithm given below (hereafter referred to as Algorithm 3) shows the general procedure for this approach, where the trial subtractions keep the running product less than the modulus N. In this case, the adder size and the P register size are only (k+2). The two additional bits are to accommodate a sign bit and the left shift operation (P=2P). The second approach is thus more hardware efficient, but requires more additions and/or subtractions. It would be advantageous if only a few bits (the most significant bits) of P could determine the correct multiple of N to be subtracted from the running product P in order to avoid costly comparisons or trial subtractions. The objective of Algorithm 3 is to compute AB mod N:
For the past two decades, the dominant approach for performing modulo multiplication has been the Montgomery algorithm, which is characterized by the following: uses the least, instead of the most, significant bits of the running product to perform an addition, rather than a subtraction; performs a shift right operation on each iteration instead of a shift left; maps operands into another domain, processes them, and maps the result back to the normal domain, so that significant pre- and post-computations are necessary; and works only if N and 2k are coprime or relatively prime, i.e., gcd(N, 2k)=1. Algorithm 4, given below, shows a general Montgomery Product (hereafter referred to as the function “MonPro”) algorithm, in which R=2k; R−1 is the multiplicative inverse of R, i.e., RR−1 mod N=1; and N′ is defined where R×R−1−N×N′=1; i.e., N′=−N−1 mod R. The objective of Algorithm 4 is to compute MonPro(A, B, N):
The MonPro(A, B, N) algorithm does not directly yield the required result of AB mod N, but rather MonPro(A, B, N)=ABR−1 mod N. Accordingly, instead of operating on the inputs A and B directly, the MonPro algorithm operates on the N-residues of A and B. The N-residue of some number A is defined as Ā=(A×R)mod(N). The N-residue domain contains all the values between 0 and (N−1). Therefore, there is a one-to-one mapping between the elements of the N-residue domain and integers between 0 and (N−1). To compute the N-residue of A, the MonPro procedure is also used for this purpose as follows:
A
However, this requires the precomputation of R2 mod N. Accordingly, the modulo multiplication A-B mod N is computed as follows:
Precomputation of steps 1 and 2 above needs to be performed only once for a given system with a particular value of k and N. However, precomputations of steps 3 and 4 must be performed for each new set of MonPro operands. Thus, the operands A and B should first be mapped into the N-residue domain where A is mapped into Ā=AR mod N, and B is mapped into
For a single modular multiplication operation, the cost of precomputations and mapping to and from the N-residue domain is unacceptably excessive. However, for modulo exponentiation XE mod N, where modulo multiplication is performed repeatedly, this cost is tolerable since mapping is performed only once at the beginning to the N-residue domain and once at the end from the N-residue domain. No intermediate mapping is required and the exponentiation process is performed on the mapped N-residue input. The below algorithm (hereinafter referred to as Algorithm 5) shows the modulo exponentiation algorithm utilizing the MonPro procedure. The primary objective of Algorithm 5 is to compute X=YE mod N:
Algorithm 4 is a relatively inefficient implementation of the Montgomery multiplication method. A more efficient simplified radix 2 version is shown in the below algorithm (hereinafter referred to as Algorithm 6). In Algorithm 6, two addition operations are performed per iteration. Thus, the total number of additions per MonPro computation is (2k+1). Using a Carry Propagate Adder (CPA) with order(k) delay, denoted as O(k), the delay of one MonPro computation is O(2k2). Alternatively, if Carry Save Adders (CSAs) are used, the main MonPro loop will have a constant delay irrespective of the value of k. In this case, two CSAs will be required for the main loop, and a carry propagate adder will be required to both assimilate the result and perform the final correction step (If P>N Then P=P−N). With CSAs, the loop delay equals the delay of the two CSAs plus the delay of two AND gates (computing biA and p0N) plus the delay of latching the results into registers. Accordingly, with k loop iterations, the loop delay of one MonPro computation is O(2k).
The objective of Algorithm 6 is to compute MonPro(A, B, N).
Table I below summarizes the delay for Modulo Exponentiation where TCPA is the worst-case delay of a CPA and TCSA is the delay of a CSA.
None of the above methods or algorithms, taken either singly or in combination, is seen to describe the instant invention as claimed. Thus, a an apparatus and method for high-speed modular multiplication and division solving the aforementioned problems is desired.
The method for high-speed modulo multiplication is a method for multiplying integers A and B modulus N that is optimized for high speed implementation in an electronic device, which may be implemented in software, but is preferably implemented in hardware. The multiplication is performed on devices requiring no more than k+2 bits, where k is the number of significant bits in A, B, and N where the most significant bit of N must be 1. The method computes the running product biAW, where AW is either A when the previous running product is negative, or W when the previous running product is positive, W being a negative quantity designated the N-conjugate of A, which equals A−N if A−N is negative, or A−2N otherwise. On each iteration, the magnitude of the running product is reduced by a scaling factor no greater than 2N according to the state of the two most significant bits of the running product when carry propagate adders are used, or three bits of the running product carry and product sum when carry save adders are used.
When implemented by a carry propagate adder, the running product is simply summed by the adder. When implemented by a carry save adder, the product carry and the product sum are separately reduced according to the state of the sum of the three most significant bits of the product carry and product sum. With slight modification, the method can produce the quotient of A×B/N as well as AB (mod N).
These and other features of the present invention will become readily apparent upon further review of the following specification and drawings.
Similar reference characters denote corresponding features consistently throughout the attached drawings.
The present invention is directed towards an apparatus and method for high-speed modulo multiplication and division. In its simplest form, the method is directed towards a method for high-speed modulo multiplication. The method includes an algorithm that may be implemented in software, but is preferably implemented in hardware for greater speed. The apparatus includes a circuit configured to carry out the algorithm. The circuit may be incorporated into the architecture of a computer processor, into a security coprocessor integrated on a motherboard with a main microprocessor, into a digital signal processor, into an application specific integrated circuit (ASIC), or other circuitry associated with a computer, electronic calculator, or the like. The method may be modified so that the circuit may include carry propagate adders, or the circuit may include carry save adders. With additional modification, the method can not only perform modulo multiplication, but also simultaneous multiplication and division.
A primary application for the apparatus and method is in connection with networked computer or digital communication devices, where the method and circuitry provide for high speed performance of modular arithmetic operations involved in the encryption and decryption of messages, where the method and the circuitry provide increased speed for greater circuit efficiency, increased productivity, and lower network overload and costs.
Turning first to a method for high-speed modulo multiplication using carry propagate adders, the method is used when it is required to compute P=AB mod N, where the multiplicand A, the multiplier B, and the modulus N are all k-bit unsigned numbers. The modulus N is typically, for cryptographic algorithms, chosen to be a large odd number so that 2k−1<N≦2k−1. Thus, the smallest possible value of N=Nmin=2k−1+1; and the largest possible value of N=Nmax=2k−1.
The steps of the algorithm are shown below in Algorithm 7.
In Algorithm 7, the parameter W is the N-conjugate of A and is a negative quantity, and is the only parameter that needs to be precomputed. The product P is computed iteratively by simple addition and left-shifting of k-partial product terms (biA). The product is computed cumulatively so that the value of the running product P in each iteration is kept within k-bits by adding/subtracting a scaling quantity that is a multiple of the modulus (αN) so that it does not affect the final result (x mod N=(x±αN) mod N).
Whenever bi≠0, the add step (step c of Algorithm 7) will always reduce the magnitude of the running product P. This is done by adding either A or its N-conjugate (W), whichever has an opposite sign to P. The product P=AB mod N is represented in signed 2's complement format using k+2 bits, i.e., two additional bits are needed. One bit, Pk+1, is used as a sign bit while the other is required to accommodate the left shift operation (step b of Algorithm 7). This leads to area-efficient implementations with registers and adders that are only k+2 bits. Thus, the smallest allowed value of P is Pmin, which is equal to −2k+1; and the largest allowed value of P is Pmax, which is equal to 2k+1−1.
By adding/subtracting the proper multiple of N to/from the running product P, the scaling step (step d of Algorithm 7) guarantees that no overflow may occur as a result of the shift operation performed in step b. Thus, the objective of the scaling step is to obtain a scaled running product value Ps with a reduced magnitude so that its left-shifted value (step b of Algorithm 7) is within the allowed range, i.e., Pmin≦2Ps≦Pmax. Thus, the lower bound of the scaled running product, Ps(min), is −2k, and the upper bound of the scaled running product, Ps(max), is 2k−1. Further, the correction step (step e of Algorithm 7) requires no more than one addition/subtraction to get the correct result.
In the first step of the loop, the running product is left shifted by one bit, as indicated at block 320. The loop performs an addition, as indicated at step 330, for each bit in B that is a binary 1, beginning in the first iteration with the most significant bit of B. If the k+1 bit (the sign bit) in the running product register is a binary 1 (the partial sum is negative), then the addition at step 330 comprises adding A to the running product; otherwise, the N-conjugate of A (a negative integer) is added to the running product.
In the next step of the loop, the running product is scaled, as indicated at 340, to ensure that the result will be k-bits long. If the k+1 and k bits of the running product are both equal to 0 or both equal to 1, no scaling is necessary, except that when both of the bits are binary 1, N is added to the running product in the last iteration of the loop, i.e., for the least significant bit of B. If the k+1 and k bits of the running product are binary 0 and binary 1, respectively, then 2N is subtracted from the running product. If the k+1 and k bits of the running product are binary 1 and binary 0, respectively, then 2N is added to the running product.
The index is then decremented and the loop is reiterated until all bits in B have been tested.
Upon completion of k iterations through the loop, a correction may be made to the running product, if necessary, as indicated at step 350. If the k+1 bit of the running product is a binary 1, i.e., the running product is negative, then the modulus N is added to the running product, or if the running product is greater than the modulus, then the modulus N is subtracted from the running product. The output of the algorithm is the corrected running product P, which is equal to AB (mod A).
The scaling factor α is computed so that Ps(min)≦P+αN≦Ps(max). The scaling factor is fully defined by inspecting the two most significant bits (Pk+1, Pk) of the running product P. Thus, only four cases need to be considered, i.e., (Pk+1, Pk)=00, 01, 10 or 11.
For (Pk+1, Pk)=00 or 11, the magnitude of P fits within k-bits and, accordingly, can be left-shifted without risk of overflow. Thus, in these cases, the value of P is passed without any scaling, i.e., α=0. In the last iteration of the algorithm, however, N is added instead of zero if (Pk+1, Pk)=11 in order to improve the execution efficiency of the correction step (step e of Algorithm 7).
In the case where (Pk+1, Pk)=01, P is a large positive number with a 1 in the (k+1)th bit position and, accordingly, must be scaled down by adding a negative scaling quantity. Since the k least significant bits of Pare unknown, the scaling constant α (which is negative in this case) must satisfy the following two conditions:
Max(P)+αNmin≦Ps(max); and (a)
Min(P)+αNmax≧Ps(min). (b)
For the above condition (a), αNmin≦Ps(max)−Max(P), which can alternatively be expressed as α(2k−1+1)≦(2k−1)−(2k+1−1), so that α≦−2k/(2k−1+1). By defining δ1 as 2/(2k−1+1), α is finally expressed as α≦−2+δ1.
For the above condition (b), αNmax≧Ps(min)−Min(P), which can alternatively be expressed as α(2k−1)≧(−2k)−(2k), so that α≧−2k+1/(2k−1). By defining δ2 as 2/(2k−1), α is finally expressed as α≧−2−δ2 Thus, for (Pk+1, Pk)=01, the proper value of α is given by −2.
For the case where (Pk+1, Pk)=10, P is a large negative number with a magnitude of k+1 bits, and α is positive. Accordingly, P must be scaled up by adding a proper multiple of N. In this case, the scaling factor α must satisfy the following conditions:
Max(P)+αNmin≧Ps(max); and (c)
Min(P)+αNmax≦Ps(min). (d)
For the above condition (c), αNmin≦Ps(min)−Min(P), which can alternatively be expressed as α(2k−1+1)≦−2k−(−2k+1), so that α≦2k/(2k−1+1). By defining δ3 as 2/(2k−1+1), α is finally expressed as α≦2−δ3.
For the above condition (d), αNmax≦Ps(max)−Max(P), which can alternatively be expressed as α(2k−1)≦(2k−1)−(−2k+1+2k−1), so that α≦2k+1/(2k−1). By defining δ4 as 2/(2k−1), α is finally expressed as α≦2+δ4. Thus, for (Pk+1, Pk)=10, the proper value of a is 2.
It should be noted that without the magnitude reduction of the running product P resulting from the addition step (step c of Algorithm 7), it would not have been possible to find solutions for the scaling factor α in all cases using two bits. Further, it should be noted that whereas Montgomery's algorithm works only for odd moduli, Algorithm 7 works for both odd and even moduli. To show that the above scaling process also applies to even moduli, only the value of Nmin needs to be changed from (2k−1+1) to 2k−1. This will only affect conditions (a) and (d) where the value of δ1 and δ4 becomes zero. However, this does not alter the selected values of the scaling factors α, proving that the algorithm can work for even as well as odd moduli.
The operation of the algorithm can be illustrated by an example. The numbers used will be trivial for the sake of brevity. Suppose it is desired to find 2×3 (mod 4). Then A=2, B=3, and N=4. The number of bits, k, should be large enough to encompass the significant digits of A, B, and N. Thus, k=3 and, accordingly, the size of the running product is k+2=5 bits.
In the initialization step, Ps=00000 (the 0 at k+2 is the sign bit and the 0 at k+1 is an extra bit to accommodate the left shifts and prevent overflow). W=A−N=2−4=−2, which is expressed as 11110 in 2's complement. Finally, the index i for the selected bit of B is initialized to k−1=3−1=2.
In the first iteration of the loop, the left shift of Ps=00000, and since B is expressed as 011 in binary, b2=0, no addition is performed. Pk+1, Pk=00, so no scaling is done. Index i is decremented to a value of 1.
In the second iteration, the left shift of P is again 00000. Since b1=1 and Pk+1=0, P=P+W=00000+11110=11110. In the case statement, Pk+1, Pk is 11, so that no scaling is needed. The index/is decremented to 0. In the third iteration through the loop, the left shift produces P=11110, and since b0=1 and Pk+1=1, P=P+A=11110+00100=00010. In the case statement, Pk+1, Pk is 11, and since i=0, scaling requires that Ps=P+N=111110+000100=000010. In the correction step, Pk+1=0, and since Ps=2, Ps<N, so that no correction is required, and by the algorithm 2×3 (mod 4)=2. It is easily verified that the result is correct by performing the multiplication and division in base 10.
The clock period of circuit 10 is equal to the worst-case delay of the (k+2) CPA 18 plus the delay of the two multiplexers 14 and 16 plus the latching delay of the P-register 20. The clock period is dependent on the value of k, since the worst-case adder delay depends on the carry propagation delay through all of the (k+2) adder bits.
Algorithm 7 may be modified to yield a quotient resulting from dividing (A.B) by N; i.e., the modified algorithm implements a multiplier-divider which computes (A×B/N, yielding both a quotient Q and a remainder P, i.e., A×B=(Q×N)+P, where |P|<N. In the following Algorithm 8, the multiplier divider requires a k+2 bit adder and register, which is far more efficient than the SRT divider, which requires a 2k+2 bit adder and register:
Algorithm 8 is substantially the same as Algorithm 7, with the addition of Quotient Q and constant g. Q is initialized to 0 and g is initialized to 1 if A<N or to 2 if A>N. Q is left shifted on each iteration through the loop and incremented by g when the corresponding bit of B is equal to 1. Q is scaled whenever the running product P is, according to the rules set forth above. Q is corrected by decrementing Q by 1 when P is negative, or by adding 1 when P is greater than modulus N. It should be noted that whereas the above Algorithm 8 can yield both the remainder and the quotient, the Montgomery algorithm can only yield the remainder.
More efficient hardware implementations of Algorithm 7 are possible if carry save adders (CSAs) are utilized rather than the CPAs. The major advantage of this approach is getting a constant clock period, which is independent of the adder size, i.e., independent of k. In this case, the product P is represented in a redundant format as two signed components: a sum component PS and a carry component PC. Since the scale factors used in the scaling step depend on the most significant bits of P, a 3-bit CPA is used to add the three most significant bits (i.e., the (k+1)th, the kth, and the (k−1)th) of PS and PC. The resulting three sum bits Z2:0=PSk+1:k−1+PCk+1:k−1 are used to choose a proper scale factor in the scaling step. It should be noted that the resulting Z bits are not necessarily equal to the most significant bits of P; i.e., Pk+1:k−1. The computation error ε is given by ε=Pk+1:k−1−Z2:0, where 0≦ε<2k−1. Accordingly, Z2:0≦Pk+1:k−1≦Z2:0+ε, or, given an upper bound, Z2:0≦Pk+1:k−1≦Z2:0+001.
Given this upper bound of the error ε, the proper values of the scale factor α may be computed for various values of Z. The following Algorithm 9 is similar to Algorithm 7, but utilizes CSAs, as described above:
Similar to the scaling procedure shown above, the scaling factor α may also be computed for the CSA implementation so that the minimum and maximum ranges are described by Ps(min)≦P+αN≦Ps(max). The scale factor value is fully defined by inspecting the three sum bits (Z2Z1Z0). Accordingly, eight separate cases must be considered. In the following analysis, Nmin is set equal to 2k−1, rather than (2k−1+1), in order to guarantee that the algorithm works for both odd and even moduli. Thus, the only restriction is that N has a 1 in the most significant bit position.
In the first four cases, we consider Z2Z1Z0=XY0; where the following condition is satisfied: XY0≦Pk+1:k−1≦XY1, i.e., Z2Z1=Pk+1Pk, irrespective of the error value. In this case, the scale factor is the same as that computed in the CPA algorithm (Algorithm 7), irrespective of the values of X or Y Thus, we have:
Z2Z1Z0=000; α=0;
Z2Z1Z0=110; α=0;
Z2Z1Z0=010; α=−2; and,
Z2Z1Z0=100; α=2;
In the next case, we consider Z2Z1Z0=111. For maximum error, we may also consider Z2Z1Z0=111+001=000. In either of these situations, we have Z2Z1Z0ε{111, 000}, and no scaling is required, i.e., α=0. In the form given above, Z2Z1Z0=111, which implies that α=0.
In the sixth case we consider, Z2Z1Z0=001. Taking the maximum error into consideration, Z2Z1Z0ε{001, 010} and P is positive within the range of 2k−1≦P≦2k+2k−1−3. Under these conditions, the scale factor is negative and must satisfy the following conditions (where α is a negative quantity):
Max(P)+αNmin≦Ps(max); and (a)
Min(P)+αNmax≦Ps(min). (b)
The first condition can be rewritten as αNmin≦Ps(max)−Max(P), which can further be rewritten as αNmin≦(2k−1)−(2k+2k−1−3)=−2k−1+2. Or, if we define δ as 2−k+2, then α≦−1+δ, or α≦−1.
The second condition can be rewritten as αNmax≧Ps(min)−Min(P), which can further be rewritten as α(2k−1)≧−2k−2k−1=−1.5×2k; thus, we have α≧−1.5, or α≧−1. Accordingly, when Z2Z1Z0=001, the scale factor limits are −1≧α≧−1, i.e., α=−1.
In the seventh case, we consider Z2Z1Z0=101. Thus, taking the maximum error into consideration, Z2Z1Z0ε{101, 110}. P is negative with a value range of −2k+1+2k−1≦P≦−2k−1−3. The scale factor, in this situation, is positive and must satisfy the following conditions:
Max(P)+αNmax≦Ps(max); and (c)
Min(P)+αNmin≧Ps(min). (d)
The first condition, (c), can be rewritten as αNmax≦Ps(max)−Max(P), which can further be rewritten as αNmax≦(2k−1)−(−2k−1−3)=1.5×2k+2. Or, if we define δ as 3.5/(2k−1), then α≦1.5+δ, or α≦1 for k>3.
The second condition, (d), can be rewritten as αNmin≧Ps(min)−Min(P), so that α(2k−1)≧−2k−(−2k+1+2k−1)=2k−1. Thus, we have α≧1. Accordingly, when Z2Z1Z0=101, the scale factor limits are 1≧α≧1, i.e., α=1.
In the final case, we consider Z2Z1Z0=011. This case may only occur if PS and PC are either both negative or both positive quantities. In this case, if the error ε=000, i.e. Pk+1Pk=Z2 Z1=01, then the required scale factor is α=−2. However, if the error ε=001, then P is a large negative value with Pk+1PkPk−1=100 requiring a positive scale factor of α=2. This latter case (ε=001 and Z2Z1Z0=011) may only occur if both PS and PC are negative quantities. This condition is easily detected by testing that either PS<1, PC<1, or the carry-out bit Z3=1.
Table II (below) lists the derived values of the scale factor α for various combinations of Z2Z1Z0:
Operation of Algorithm 9 is similar to operation of Algorithm 7. The sum component and carry component, PS and PC, respectively, are initialized to 0 in (k+2)-bit long registers. The N-conjugate of the multiplicand, W, is computed in the same manner as in Algorithm 7, and the loop counter i is initialized to k−1. In the first step of the loop, the shifting step, both the PS and PC registers are shifted left by one bit.
In the next step of the loop, the addition step, the current bit of the multiplier (starting with the most significant bit) is tested to see if the bit is equal to one. To determine the sign of P, the 3-most significant bits of PS and PC are added using a carry propagate adder. The most significant bit of the sum indicates the sign of P If bi=1 and P is negative, then PS, PC and the multiplicand A are added using a carry-save adder, storing the sum component in PS and the carry component in PC. If P is positive, then PS, PC and W (the N-conjugate of the multiplicand A) are added using carry-save addition.
In the next step of the loop, the scaling step, the magnitude of the running product Pas represented by the sum component PS and carry component PC is reduced by an appropriate scaling factor. The case step is used to determine the proper scaling factor by adding the k+1, k, and k−1 bits of PS to the corresponding bits of PC using carry propagate addition and comparing the result to the chart in Algorithm 9. The scaling factor, PS, and PC are added together using carry-save addition. The resulting partial sum and partial carry are passed back in the loop to be shifted (Algorithm 9, step b) after decrementing the loop index.
After the last iteration, the next step is the assimilation step in which P is computed by adding the PS and PC registers using carry propagate addition. The final step is the correction step. If the result is negative, then N is added to the result. Otherwise, if P≧N, then N is subtracted from P until P is less than Nor equal to zero.
A moderately complex partial example will make operation of Algorithm 9 clear. It is desired to compute 14×83 (mod 100), so that A=14decimal=000001110, B=83=001010011, N=1100=001100100, and k=7. The size of the adders is k+2=9 bits. PS and PC are initialized to binary 000000000, W=14−100=−86=110101010 in 2's complement notation, and the counter is initialized to i=6.
On the first iteration through the loop, PS and PC remain zero after left shifting. Since the sixth bit of integer B is one (b6=1), and since P=0 (P is obtained by adding PS and PC using carry propagate addition), W is added to (PS,PC) so that PS=W, and PC=0 since there are no carry bits. Z2Z1Z0=110+000=110 (the k+1, k, and k−1 bits of PS are 110 and the k+1, k, and k−1 bits of PC are 000). By the chart, (PS,PC)=(PS,Pc)+N, so that PS=111001110 and PC=001000000. The counter is decremented to i=5 and the loop reverts to the shift step.
Upon shifting left by one bit, PS=110011100 and PC=010000000. In the add step, bs=0, so that no addition occurs. Z2Z1Z0=110+010=000, so that the scaling factor is zero and no scaling occurs. The counter is decremented to i=4, and program flow moves to the shift step.
Upon left shifting by one bit, PS=100111000 and PC=100000000. Since b4=1, and the sign of P is positive (the sign of P is obtained by adding the k+1, k, and k−1 bits of PS and PC), so that W is added to (PS,PC) and PS=010010010 and PC=101010000. Z2Z1Z0=010+101=111, so that the scaling factor is zero and no reduction is needed. The counter is decremented to i=3, and the loop continues in the same fashion through the remaining bits of the multiplier B. Assimilation and correction produce the final result, 14×83 (mod 100)=62.
It should be noted that whereas Montgomery's algorithm works only for odd moduli, Algorithms 7 and 9 work for both odd and even moduli. Further, the CSA algorithm (Algorithm 9) requires 3-bit carry propagate adders (CPAs) in order to determine the sign of Pas required by step (c), and to determine the value of Z2Z1Z0 used in the scaling step (d).
Table III (below) shows that, at most, two additions may be required during the correction step (Algorithm 9, step f) to get the final result under extreme values of P and N. More specifically, Table III illustrates the following:
(a) If the assimilated value of P (Algorithm 9, step e) is positive, up to one subtraction operation may be required;
(b) If the assimilated value of P(Algorithm 9, step e) is negative, up to two addition operations may be required;
(c) For the case of Z2Z1Z0=110, the bottom two rows of Table III show that even though the derived correction factor value of α=0 would properly scale the running product P, a correction factor of α=1 is preferred, since a following correction step would require only up to one addition as compared to two additions for α=0.
Similar to that shown above, with minor modification, Algorithm 9 can be made to work as a multiplier-divider, which computes (A×S/N), yielding both the quotient Q and the remainder P, such that A×B=Q×N+P, where |P|<N. This modification is shown in Algorithm 10, as follows:
Thus, CSA 118 performs the shift and add operations (steps b and c, respectively, in Algorithm 9), i.e., it computes 2Ps+2PC+biAW, where AW is chosen to be either the multiplicand A, its conjugate W, or zero. The value of AW is chosen based on the value of bi (the ith bit of B) and sign of the previously computed value of P (Q2 in
The sign (Q2) of the product P, which decides whether A or its N-conjugate W is to be used in the add step (step d of Algorithm 9), is computed after the product is scaled to fit into k-bits by the top 3-bit CLA 116. Table IV (below) shows the possible values of the output sum bits of the top 3-bit CLA 116 (Q2Q1Q0) and the corresponding sign of the product P. It is clear that Q2 may be used to determine the sign of P. The bottom 3-bit CLA 124 computes Z2Z1Z0, which is needed for the scaling step and input to multiplexer 112 to input the proper scaling factor to CSA 114.
It should be noted that multiplexers 110, 112 are provided with enable control to allow for all zero outputs. Further, to avoid pre-computation and storage of the scaling value (−N) and, accordingly, (−2N),
Contrary to Montgomery's algorithm, where N-residues of both A and B need to be pre-computed, the only quantity that needs to be pre-computed in Algorithms 7 or 9 is W=A−N, which is much simpler than the N-residue computation. It should be noted that the N-residue of x is defined as
In the embodiment of
The following Table V illustrates the delay of the modular multiplication of Algorithms 7 and 9 using the CPA and CSA methodologies, as described above:
It is to be understood that the present invention is not limited to the embodiments described above, but encompasses any and all embodiments within the scope of the following claims.