The disclosure pertains to cryptographic computing applications, more specifically to improving efficiency of cryptographic operations with a cryptographic engine capable of parallel and streaming computations.
The present disclosure will be understood more fully from the detailed description given below and from the accompanying drawings of various implementations of the disclosure.
Aspects of the present disclosure are directed to hardware cryptographic engines for improving computational efficiency and memory utilization in cryptographic operations that include, but are not limited to, public-key cryptography applications. More specifically, aspects of the present disclosure are directed to multi-lane cryptographic engines for efficient parallel and streaming processing of public key and private key operations, key generation, modular multiplication, Montgomery multiplication, modular inversion, Jacobi symbol computation, elliptic curve cryptographic operations, and numerous other cryptographic applications.
Various cryptographic applications may involve operations that are efficiently performed by offloading them from a main processor to a dedicated cryptographic engine (accelerator) that includes hardware circuits designed to improve speed and efficiency of arithmetic operations (multiplication, division, addition, etc.) and memory accesses. For example, in Rivest-Shamir-Adelman (RSA) public key/private key applications, large prime numbers p and q may be selected to generate a pair of a public (encryption) exponent e and a secret (decryption) exponent d such that e and d are inverse of each other modulo a certain number (e.g., modulo (p−1)·(q−1) or a lowest common multiplier of p−1 and q−1). The numbers e and N=p·q are revealed as part of the public key while p, q, and d are stored in secret as parts of the private key. A message m may be encrypted into a ciphertext c using modular exponentiation, c=me mod N, and can be deciphered using another modular exponentiation, m=cd mod N, and based on the private exponent d. To prevent unauthorized actors from recovering the private exponent d, the prime multipliers p and q are typically selected to be large numbers, e.g., 1024-bit numbers.
Some applications use elliptic curve cryptography that involves operations with points (x,y) on an elliptic curve, e.g., an elliptic Weierstrass curve, y2=x3+ax+b. Arithmetic operations (such as addition, doubling, and infinity operations) are defined via a set of geometric rules; e.g., a sum of three points on an elliptic curve is zero, P1+P2+P3=0, if the points P1, P2, P3 are located at the intersection of the elliptic curve with a straight line. The strength of the elliptic curve cryptography is based on the fact that for large values of k, a product Q=P·k can be practically anywhere on the elliptic curve. As a result, the inverse operation to determine an unknown value of (e.g., private key) k from a known public value Q can be a prohibitively difficult computational operation. In elliptic curve cryptography, it is typically sufficient to use numbers that are much smaller (e.g., 256-bit numbers) than numbers used in RSA applications.
Decryption and encryption operations often require a large number of arithmetic operations to be performed, which may take many clock cycles, especially when performed on low-bit microprocessors, such as smart card readers, wireless sensor nodes, and so on. Cryptographic engines (accelerators, co-processors) are specially designed circuits that execute specialized computationally intensive cryptographic operations more efficiently than a general purpose processor (e.g., CPU). Because in many applications (including network and cloud applications) cryptographic operations may constitute a significant portion of the total computational load, small and efficient cryptographic engines are highly desired.
Described in the instant disclosure are cryptographic engines that allow a high degree of parallelism during performance of cryptographic computations. In some implementations, a cryptographic engine may include at least four multiplication circuits capable of operating synchronously on different inputs (e.g., different multiplicands and different multipliers) or streaming inputs. For example, a multiplier or multiplicand of a multiplication performed by a particular circuit may have previously been used in multiplication operations performed by preceding circuit (such that consecutive circuits compute increasingly more significant bits of the product). The cryptographic engine may further have two or more addition circuits similarly capable of operating synchronously with each other. The addition circuits may receive inputs from other addition circuits and/or multiplications circuits and may further provide outputs as inputs to any of the multiplication circuits. The cryptographic engine may include two or more memory devices, such as random access memory (RAM) units that permit one read or one write operation per cycle, scratchpad (SP) memory units that permit one read and one write operation per cycle, flip-flop memory, and the like. In some implementations, a co-processor may facilitate efficient performance of inverse multiplication, Jacobi symbol computations, and the like, by performing operations that are not reduced to multiplications and/or additions.
The disclosed cryptographic engine may be used for a wide range of cryptographic operations. Each multiplication and each addition circuit may, at a given cycle, process an N-bit operand. The size of the operand may be different in different implementations. For the sake of specificity, implementations disclosed herein will sometimes be illustrated using an example cryptographic accelerator that operates on N=64 bit operands, but it should be understood that circuits configured to process operands of any other size (e.g., N=8, 16, 32, 128, etc.) may also be used. During a cycle of computations, a word (to be understood as a group of, e.g., N bits) of a multiplier and a word of a multiplicand may be processed by one of the multiplication circuits. A low N-bit word of the output may be stored (e.g., in a SP memory) as the accumulator value and a high N-bit word may be stored as a carry (e.g., in a buffer, such as a flip-flop memory device). The accumulator and the carry may subsequently be used during processing of other words of the multiplier and multiplicand (some of which may be processed by the same multiplication circuit while other words may be processed by other circuits).
Multiplication (and addition) operations performed by the circuits of the cryptographic engine may be modular operations defined on a ring of p elements (e.g., elements belonging to the interval of integers [0, p−1]). Reduction modulo p may be performed by the circuits subsequently to the performance of multiplication. Because calculations modulo p require finding a remainder of a (computationally heavy) division operation, in some implementations a Montgomery reduction may be used. To find AB mod p, the multiplier A and the multiplicand B can first be transformed into the Montgomery domain, A mod p→Ā=AR mod p, B mod p→
The system architecture 100 may further include an input/output (I/O) interface 104 to facilitate connection of the computer system 102 to peripheral hardware devices 106 such as card readers, terminals, printers, scanners, internet-of-things devices, and the like. The system architecture 100 may further include a network interface 108 to facilitate connection to a variety of networks (Internet, wireless local area networks (WLAN), personal area networks (PAN), public networks, private networks, etc.), and may include a radio front end module and other devices (amplifiers, digital-to-analog and analog-to-digital converters, dedicated logic units, etc.) to implement data transfer to/from the computer system 102. Various hardware components of the computer system 102 may be connected via a system bus 112 that may include its own logic circuits, e.g., a bus interface logic unit (not shown).
The computer system 102 may support one or more cryptographic applications 110-n, such as an embedded cryptographic application 110-1 and/or external cryptographic application 110-2. The cryptographic applications 110-n may be secure authentication applications, encrypting applications, decrypting applications, secure storage applications, and so on. The external cryptographic application 110-2 may be instantiated on the same computer system 102, e.g., by an operating system executed by the processor 120 and residing in the memory device 130. Alternatively, the external cryptographic application 110-2 may be instantiated by a guest operating system supported by a virtual machine monitor (hypervisor) executed by the processor 120. In some implementations, the external cryptographic application 110-2 may reside on a remote access client device or a remote server (not shown), with the computer system 102 providing cryptographic support for the client device and/or the remote server.
The processor 120 may include one or more processor cores having access to a single or multi-level cache and one or more hardware registers. In implementations, each processor core may execute instructions to run a number of hardware threads, also known as logical processors. Various logical processors (or processor cores) may be assigned to one or more cryptographic applications 110, although more than one processor core (or a logical processor) may be assigned to a single cryptographic application for parallel processing. A multi-core processor 120 may simultaneously execute multiple instructions. A single core processor 120 may typically execute one instruction at a time (or process a single pipeline of instructions). The processor 120 may be implemented as a single integrated circuit, two or more integrated circuits, or may be a component of a multi-chip module.
The memory device 130 may refer to a volatile or non-volatile memory and may include a read-only memory (ROM) 132, a random-access memory (RAM) 134, high-speed cache 136, as well as (not shown) electrically erasable programmable read-only memory (EEPROM), flash memory, flip-flop memory, or any other device capable of storing data. The RAM 134 may be a dynamic random-access memory (DRAM), synchronous DRAM (SDRAM), a static memory, such as static random-access memory (SRAM), and the like. Some of the cache 136 may be implemented as part of the hardware registers of the processor 120 In some implementations, the processor 120 and the memory device 130 may be implemented as a single field-programmable gate array (FPGA).
The computer system 102 may include a cryptographic engine 200 for fast and efficient performance of cryptographic computations, as described in more detail below. Cryptographic engine 200 may include processing and memory components, as described in more detail below. Cryptographic engine 200 may perform authentication of applications, users, access requests, in association with operations of the cryptographic applications 110-n or any other applications operating on or in conjunction with the computer system 102. Cryptographic engine 200 may further perform encryption and decryption of secret data.
Cryptographic engine 200 may further include a number of memory circuits, such as static random-access memory (SRAM), e.g., SRAM 240-1 and 240-2, and scratchpad memory (SP), such as 242-1, 242-2, and 242-3. Even though two SRAM and three SP are shown in
Each of the MUL units 220-n and ADD units 230-n may be a circuit that operates on N-bit words (e.g., 64-bit inputs or inputs of any other suitable size) and may have at least two inputs (indicated by horizontal arrows). Additionally, MUL units 220-1 . . . 220-3 may stream outputs (as well as inputs, in some instances) of multiplications performed by the respective circuits as inputs into subsequent MUL units 220-2 . . . 220-4 (as depicted by the downward arrows). For example, an output of MUL unit 220-1 may be provided to the next MUL unit 220-2 or to the ALU bus 232. From ALU bus 232 the outputs of multiplications may be delivered to any of the ADD units 230-n (or buffer 234) or any of the memory circuits (SRAM 240-n or SP 242-n). In some instances, when an addition operation involves a number that is not an output of a previous multiplication operation, an input into an addition operation may be delivered via bus 244 from one of the memory circuits 240-n or 242-n (as depicted by the upward arrow between bus 244 and ALU bus 232).
An additional ALU support unit 260 may include circuits that perform operations different from multiplications or additions. ALU support unit 260 may include a read-only memory (ROM) 262, which may store constants (such as modulus p, Montgomery radix R, numbers p′, R−1mod p, various other auxiliary numbers, such as powers of radix R, e.g., R2 mod p or modulo some other suitable modulus), various instructions for control unit 250, and so on. ALU support unit 260 may further include a random number generator (RNG) 264 for generation of random (or pseudorandom) numbers, an XOR unit 266 for performing XOR operations, a shift unit 268 to perform bit shifting and bit masking, a compare unit 270 to perform comparison of input numbers, a copy unit 272 for copying numbers, an arithmetic-to-Boolean and/or Boolean-to-arithmetic conversion (A2B/B2A) unit 274. The A2B/B2A unit 274 may be used for handling keys and other secret data that is stored in masked Boolean or masked arithmetic form (e.g., as a plurality of randomized values whose Boolean or arithmetic sum, difference, etc. represents a secret value). For example, A2B/B2A unit 274 may convert data stored in a Boolean-masked form to an arithmetic-masked form (if a cryptographic application is configured to process data in the latter form), and/or vice versa. ALU support unit 260 may also include other auxiliary units (circuits) performing various functions that may be used in operations of cryptographic engine 200.
is, generally, a 2N-word number A=A7 . . . . A0. MUL units 220-n may be configured to perform multiplication on N-bit input numbers, e.g., 128-bit words of integer data or 256-bit words of integer data.
During cycle 1, MUL unit 220-1 may receive the low (least significant) word of multiplier X0, and the low word of multiplicand, Y0, and compute the product X0·Y0. The low word of X0·Y0 represents the low word A0 of the product A and may be stored in one of memory circuits of the cryptographic engine 200 (or in an outside memory device). The high word of the product X0·Y0 may be stored in MUL unit 220-1 (e.g., in a flip-flop memory buffer associated with MUL unit 220-1) as a carry C into the operations of the next cycle. During (or prior to) cycle 2, MUL unit 220-1 may provide carry C and the low word of the multiplicand Y0 to MUL unit 220-2, load the next word of the multiplicand Y1 from memory, and multiply the previously loaded low word of the multiplier X0 by the new word of the multiplicand Y1. MUL unit 220-1 may then compute X0·Y1, buffer a new carry (the high word of X0·Y1) until the next cycle, and provide the accumulator value (the low word of X0·Y1) to MUL unit 220-2. The following notations are used in
During cycle 2, MUL unit 220-2 may load the next word of the multiplier X1 from the memory circuits, receive the low word of the multiplicand Y0 from MUL unit 220-1 (as well as the respective carry), as depicted schematically with the dashed arrow, and may further receive the accumulator value computed by MUL 220-1 unit during the same cycle 2. MUL unit 220-2 may add the received carry and the accumulator to the product X1·Y0. MUL unit 220-2 may buffer the high word of the result as a carry (to be passed on to MUL unit 220-3 in cycle 3), and may store the low word A1 as the next word of the product A. An addition unit, e.g., adder circuit 235 (or some other addition unit) may perform the addition operations described herein. In some implementations, the addition unit may be a multi-way addition circuit capable of adding more than two numbers per cycle; e.g., the addition unit may be capable of adding X1·Y0+carry+accumulator value in one operation. In some implementations, the addition unit may be configured to perform multiple consecutive additions of two numbers over one cycle (e.g., obtaining a first sum X1·Y0+carry during the first operation and then adding the accumulator value to the first sum during the second operation).
Similar computations may be performed in subsequent cycles. In cycle k, MUL 220-1 passes the multiplicand word Yk−2 (loaded during cycle k−1) and the carry (computed during the cycle k−1) to MUL 220-2 and loads the next multiplicand word Yk−1. Similarly, other multiplication units pass previously processed multiplicand words (and computed carries) to the next multiplication units. In addition, during cycle k≤M, MUL 220-k loads the multiplier word Xk−1 from memory and multiplies it by Y0. During cycle k, different multiplication units compute products Xj·Yk−j−1 with different j. Accumulator values are passed from the respective multiplication units to the accumulator unit, e.g., adder circuit 235.
At the end of cycle k≤M, the word Ak−1 of the product A is determined (and stored in one of the memory circuits). At the end of cycle k=M+1, the low word of the result of multiplication XM−1·Y1 (plus received carry and accumulator value) is passed onto adder circuit 235, depicted via a shaded box, which adds the carry from the last block of cycle M (as depicted by the top dotted line). The low word of the sum represents the word AM (e.g., A4, as depicted) of the final product A and is stored in one of the memory circuits (e.g., together with previously computed words Aj). The high word of the sum is retained in the adder (as depicted by the downward dotted arrow). At the end of each subsequent cycle, the adder adds a new carry (broken dotted line) and a new accumulator (solid arrow) to the previously stored high word, identifies the new low word as the next Aj of the final product A, buffers the new carry, and so on. At the end of the last cycle k=2M−1 (after computing the last multiplication XM−1·YM−1) both the high word and the low word of the last addition operation are stored as the last two words of the final product, A2M−1A2M−2 (e.g., A7A6). As depicted in
Each of MUL units 220-1 . . . 220-4 may include any number of processing elements (circuits), such as multiplication elements, addition elements, accumulator buffers, carry buffers, and the like. In some implementations, words X0, X1, . . . of the multiplier and words Y0, Y1, . . . of the multiplicand may be processed by respective MUL units 220-n in a systolic way. More specifically, each word may be subdivided into two or more sub-words and processed sequentially by two or more processing elements. For example, a 256-bit multiplier word X0 may be subdivided into four 64-bit sub-words x00, x01, X02, and x03 with a first processing element of MUL unit 220-1 processing (e.g., multiplying, buffering, passing, and adding carry and accumulator values) the first sub-word x00, a second processing element of MUL unit 220-1 processing the second sub-word x01, and so on. Similarly, a 256-bit multiplicand word Y0 may be subdivided into four 64-bit sub-words y00, y01, y02, and y03 with a first processing element of MUL unit 220-1 processing the first sub-word y00 during a first part of cycle 1 (in the notations of
Multiplication (and addition) units performing multiplication operations 201 illustrated in
During cycle 2, MUL unit 220-1 may provide carry C and the two low words of the multiplicand Y1 Y0 to MUL unit 220-2, load the next two words of the multiplicand Y3Y2, and multiply the previously loaded low word of the multiplier X0 by the new words of the multiplicand Y3 Y2. MUL unit 220-1 may then compute X0·Y3Y2, buffer a new carry (the high two words of X0·Y3Y2) until the next cycle, and provide the accumulator value (the low word of X0·Y3Y2) to MUL unit 220-2 (as indicated by the solid arrow). Additionally, during the same cycle 2, MUL unit 220-2 may load the next word of the multiplier X1 from one of the memory circuits, receive the low two words of the multiplicand Y1Y0 from MUL unit 220-1 (as well as the respective carry), as depicted schematically with the dashed arrow. MUL unit 220-2 may further receive the accumulator value computed by MUL 220-1 unit during the same cycle 2. MUL unit 220-2 may add the received two-word carry and the one-word accumulator to the product X1·Y1Y0. MUL unit 220-2 may buffer the two high words of the obtained result as a next carry (to be passed on to MUL unit 220-3 in cycle 3), and may store the low word A1 as the next word of the product A.
Similar streaming computations may be performed in subsequent cycles, as depicted. In cycle k, MUL unit 220-1 passes the two multiplicand words Y2k−3Y2k−4 (loaded during cycle k−1) and the two-word carry (computed during cycle k−1) to MUL 220-2 and loads the next two multiplicand words Y2k−1Y2k−2. Similarly, other multiplication units pass previously processed multiplicand words (and computed carries) to the next multiplication units. In addition, during cycle k≤M, MULunit 220-k loads the multiplier word Xk−1 from memory and multiplies it by Y1Y0. During cycle k, products Xj·Y2k−2j−1Y2k−2j−2 with different j are computed by different multiplication units. At the end of cycle k≤M, the word Ak−1 of the product A is determined (and stored in one of the memory circuits). At the end of cycle k=M+1, the low word of the result of multiplication XM−1·Y3Y2 (plus the received carry and accumulator value) is passed onto an adder circuit 235), depicted via a shaded box, which adds the carry from the last block of cycle M (as depicted by the top dotted line). The adder circuit 235 may be a processing sub-unit that is internal to MUL unit 220-4 (or some other MUL unit). The low two words of the sum represent the words AMAM−1 (e.g., A5A4 as depicted) of the final product A and are stored in one of the memory circuits (e.g., together with previously computed words Aj). The high word of the sum is retained in the adder (the vertical dotted arrow). At the end of each subsequent cycle, the adder adds a new two-word carry (broken dotted line) and a new one-word accumulator (solid arrow) to the previously stored high word, identifies the new two low word as the next two words of the final product A and so on. After cycle M+1 (after computing the last multiplication XM−1·YM−1YM−2) both the high word and the low word of the last addition operation are stored as the last two words of the final product, A2M−1A2M−2 (e.g., A7A6). Similarly to
In the example illustrated in
Referring back to
In some instances, modular (or Montgomery) reduction may be performed by the same multiplication unit that computes the original product, e.g., when a special prime modulus p is being used, such as one of Solinas primes (e.g., p=2192−264−1, p=2384−2128−296+232−1), Mersenne primes, Crandall primes, and other simple primes.
In some implementations, the cryptographic engine 200 is used for elliptic curve cryptographic (ECC) computations with Weierstrass curves, Brainpool curves, NIST curves, etc. ECC computations may involve multiplying a number represented by a base point P on an elliptic curve by a large number k (e.g., a private key). Finding the product P·k may be performed efficiently using one of the available ladder algorithms, such as the Montgomery ladder algorithm, the double-and-add algorithm, the Joye ladder algorithm, windowed algorithms, non-adjacent form algorithms, or any other suitable algorithms. These algorithms are executed by performing a number (of the order of log2 k) iterations by keeping track of working points, e.g., X1 and X2, and defining a set of conditional (upon a value of the next bit of the key k) rules that manipulate the working points. The manipulations may include (depending on a specific algorithm being used) one or more of: adding the working points X1 and X2, doubling one of the working points X1 or X2 while keeping the other working point intact, doubling one of the working points and then adding the other working point, etc., until (at the completion of the algorithm) one of the working points provides a representation of the target product P·k.
Each ladder algorithm may specify how coordinates of the working points change with each ladder step. In various implementations, coordinates can be scaled Jacobi coordinates. In some implementations, the algorithms track one or more auxiliary variables, such as a slope of the line associated with one or more of the working points, and so on. Each step may involve a number of operations (multiplications and additions) to update all (e.g., four or five) values being tracked. Cryptographic engine 200 of
The cryptographic engine 200 may also be used for modular inversion, namely for computing an inverse of one number x modulo another number y: z=x−1 mod y. The inverse number z multiplied by x equals 1, up to an integer multiple of y: z·x=1+s·y. According to the extended Euclidean algorithm, a two-component vector made of x and y may be expressed via a 2×2 matrix {circumflex over (M)} whose determinant is −1:
The off-diagonal element of the matrix then gives the target inverse number, {circumflex over (M)}12=x−1 mod y. The matrix M may be determined iteratively, by dividing y by x and identifying the quotient q0 and the remainder x1,
which may be, equivalently, expressed in matrix form:
via step matrix {circumflex over (M)}1. The process is continued by further dividing x by x1 and finding a new quotient qj and a new remainder xj, so that during j-th iteration: xj−2=qj−1·xj−1+xj−2, or in matrix form (with x0≡x),
The iterations stop when during a final (n-th) iteration it is determined that xn−2 is divisible by xn−1 (xn−2=qn−1·xn−1); the inverse number is then given by the off-diagonal matrix element of the product of all identified step matrices:
The binary Euclidean algorithm determines a greatest common divisor (GCD) of two numbers, x and y, while avoiding division operations (other than division by 2 or powers of 2, which may be performed by bit shifting). More specifically, if x and y are both even, GCD(x,y)=2·GCD(x/2,y/2). If x is even and y is odd, GCD(x,y)=GCD(x/2,y). If both x and y are odd, GCD(x, y)=GCD (|x−y|, min (x, y)). By iteratively repeating these steps, the numbers are progressively reduced until one of the numbers is zero, e.g., x=0, and the GCD is given by the other number, e.g., GCD (0, y)=y.
Cryptographic engine 200 may perform matrix multiplication as described above, with four MUL units 220-n computing matrix elements of the product {circumflex over (M)}j·{circumflex over (M)}j−1 in a parallel or streaming fashion. For example, the cryptographic engine 200 may first compute the first column of the product {circumflex over (M)}j·{circumflex over (M)}j−1, e.g., ({circumflex over (M)}j·{circumflex over (M)}j−1)11=({circumflex over (M)}j)11({circumflex over (M)}j−1)11+({circumflex over (M)}j)12({circumflex over (M)}j−1)21 and ({circumflex over (M)}j·{circumflex over (M)}j−1)21=({circumflex over (M)}j)21({circumflex over (M)}j−1)11+({circumflex over (M)}j)22({circumflex over (M)}j−1)21, using four MUL units 220-1 . . . 220-4 and store the computed matrix elements (e.g., in SRAM 240-1, 240-2, and/or SP 242-1, 242-1, etc.). If the cryptographic engine has more than four MUL units, then the matrix elements of the second column of the product {circumflex over (M)}j·{circumflex over (M)}j−1 may be computed in parallel in a similar manner; otherwise, the matrix elements may be computed over several cycles. The stored matrix elements are then used in subsequent iterations of product Πj=1n {circumflex over (M)}j computations. In some implementations, the size of the matrix elements may exceed the size of the operands of MUL units 220-n. In such implementations, the products, e.g., ({circumflex over (M)}j)11({circumflex over (M)}j−1)11 may be computed in a streaming fashion with a first portion of a multiplicand, e.g., ({circumflex over (M)}j)11, handled by MUL unit 220-1 and a second portion of the multiplicand handled by MUL unit 220-2, with portions of a multiplier, e.g., ({circumflex over (M)}j−1)11, streamed through both MUL units 220-1 and 220-2. Similarly, the product ({circumflex over (M)}j)12({circumflex over (M)}j−1)21 may be computed by MUL units 220-3 and 220-4. Accordingly, computation of matrix element ({circumflex over (M)}j·{circumflex over (M)}j−1), may take several cycles of cryptographic engine 200 with matrix element ({circumflex over (M)}j·{circumflex over (M)}j−1)21 computed during the following several cycles. In some implementations, even when the size of the matrix elements does not exceed the size of the operands of MUL units 220-n, computations of the products, e.g., ({circumflex over (M)}j)11({circumflex over (M)}j−1)11, may still be performed by two MUL units 220-n, with one MUL unit computing the corresponding product, and the next MUL unit performing Montgomery modular reduction of the computed product.
The cryptographic engine 200 can also be used for computation of Jacobi (and Legendre) symbols. A Legendre symbol
indicates whether x is a quadratic residue modulo prime number y; namely, the Legendre symbol
is +1 if there exists a number z whose square modulo y is equal to x: z2=x mod y. The Legendre symbol is −1 if no such number z exists (and the Legendre symbol is 0 if x is divisible by y). The Jacobi symbol
extends the definition of the Legendre symbol to non-prime numbers y and amounts to a product of Legendre symbols for all prime factors of y. The Jacobi and Legendre symbols are frequently used in cryptographic applications, e.g., for generation (and primality testing) of prime number candidates. A quadratic reciprocity theorem expresses a Jacobi symbol
via its swapped counterpart
Because, by definition, y mod x<x, such swapping results in a Jacobi symbol having smaller arguments. Repeating the swapping operation until the top number is 0 or 1 (1 is a quadratic residue modulo any number), and using known rules for the change of sign of the Jacobi symbol during each swapping, the value of the target Jacobi symbol
may be determined.
The above method of computing the Jacobi symbol using the quadratic reciprocity leads to a large number of subtraction and swapping operations. Alternatively, the binary Euclidean algorithm (similar to the one used to find a greatest common divisor of two numbers) may be used, which amounts to a set of the following rules. If x is even, it can be replaced x→x/2, with the ensuing symbol
to be multiplied by an appropriate factor (more specifically, (−1)(y
with the vector
denoting the iterated
Jacobi symbol. The computation of the total transformation matrix can be performed on the cryptographic engine as a product of multiple step matrices {circumflex over (M)}j, using the streaming processing, as described above in relation to the modular inversion.
Computation of the Jacobi symbols using the binary Euclidean algorithm involves a substantial number of subtraction and swapping operations. Additionally, while division by 2 and subtraction of the denominator from the numerator may be performed in a streaming fashion, with low-words of the numerator and the denominator processed before the high words, the swapping operation depends on which number, x or y, is greater than the other number, which depends on the highest non-zero word of each number. To avoid delaying computations until the highest words are determined, in some implementations, cryptographic engine 200 may compute the Jacobi symbol(s) using a method that exploits some concepts of the 2019 Bernstein-Yang algorithm for modular inversion. More specifically, the numerical comparisons of x and y may be replaced with a uniformity tracker δ that indicates a degree of uniformity to which matrix {circumflex over (M)}j is reducing the numbers x and y.
The uniformity tracker & starts at initial value of zero and its absolute value |δ| increases or decreases in increments of one, per iteration. If the numerator x is even, the uniformity tracker δ is incremented by one while the numerator is halved:
If the numerator x is odd, the update step depends on the sign of the uniformity tracker δ. If the uniformity tracker δ is negative or zero, δ≤0, the uniformity tracker δ is incremented by one while the numerator is replaced with the mean of the numerator x and the denominator y:
or in matrix form
If the uniformity tracker δ is positive, δ>0, the uniformity tracker δ is decremented by one and the sign of the ensuing value is reversed, the Jacobi symbol is swapped, and the new denominator is one half of the difference of the old denominator and the old numerator:
or in matrix form
In addition to the change of numbers, as expressed by the latter rule, the Jacobi symbol flips its sign when y′>0 and x′<0. (There is no additional sign flipping when x is even or when the uniformity tracker δ is negative.) Case 3 may be visualized as a 90-degree rotation in the xy-plane: x→y, y→−x followed by the Case 2 transformation. The number of times the sign of the Jacobi symbol is to be flipped is determined by the number of times the element ({circumflex over (M)}−1)11 of the transformation matrix has changed signs, which may be tracked by setting a sign counter. Because the Jacobi symbol is periodic with the value of the counter modulo 4, a counter may be a 2-bit counter. The counter may additionally track the number of times (a current) value y (y′, etc.) has changed signs and add this number of times to the value stored in the counter.
Bernstein and Yang observed that k first steps of computations of the matrix {circumflex over (M)}1 . . . k: =Πj=1k {circumflex over (M)}j may be performed based on k least significant bits of the numbers x and y. For example, k=32 (or some other number) first steps of computation of the matrix {circumflex over (M)}1 . . . k may be performed before the computed matrix is applied to x and y. The numbers x and y may then be updated by multiplication of {circumflex over (M)}1 . . . k, thus obtaining x′ and y′. The same procedure may be then repeated starting with updated x′ and y′. Such an iterative procedure may be performed on the cryptographic engine 200 using input data streaming, as described above in conjunction with
In some implementations, the co-processor 410 may use the least significant bit (LSB) of x and the LSB of y and compute a step matrix {circumflex over (M)}1 and apply the matrix {circumflex over (M)}1 to x and y, and to the current value of the batch matrix. This process may be repeated k times. In some implementations, the co-processor 410 may use the two LSB of x and computes a double-step matrix {circumflex over (M)}2·{circumflex over (M)}1, e.g., using a look-up table, and applies the double-step matrix to x and y, and to the current value of the batch matrix. This process may be repeated k/2 times, at each iteration building the batch matrix by multiplying it by an additional (single or double) step matrix. Also, during each of k (or k/2) iterations, the co-processor 410 may use the next most significant bits of x and y (e.g., 3 or 4 bits in total) to determine whether the transformations used change the sign of the Jacobi symbol, and/or whether element 11 of the transformation matrix and/or y has become negative. The co-processor 410 may then update the sign counter (e.g., a 2-bit sign counter, as described above).
At the completion of k (or k/2) iterations, the co-processor 410 may provide the computed coefficients of the batch matrix {circumflex over (M)}1 . . . k to ALU 210 (e.g. MUL units 220-n) which may apply the batch matrix to the numbers x and y to obtain the updated numbers, e.g., x′ and y′. Subsequently, the co-processor 410 may compute the next batch of step matrices, e.g., {circumflex over (M)}k+1 . . . 2k. The next batch may be computed in parallel with ALU 210 applying batch {circumflex over (M)}1 . . . k, and so on. When k+2 LSB of updated numbers x′ and y′ have become available, ALU 210 may provide these bits to the co-processor 410 and the co-processor 410 may begin computations of the next batch of step matrices, {circumflex over (M)}k+1 . . . 2k. When the sign of y becomes available, ALU 210 may provide the sign of y to the co-processor 410 and the co-processor 410 may update the sign counter, if indicated by the sign. At the conclusion of all iterations, the Jacobi symbol may be read from the sign counter of the co-processor 410, whereas a greatest common divisor (GCD) of x and y, as well as modular inverse y mod x, are stored (as different elements of matrix {circumflex over (M)}−1) in memory circuits of ALU 210. In the instances where GCD is greater than 1, the Jacobi symbol is zero; otherwise the Jacobi symbol is given by the value in the sign counter.
A cryptographic processor that performs methods 500, 600, and 700 may include a plurality of four or more multiplication circuits (e.g., MUL units 220-n). The cryptographic processor may further include a plurality of two or more addition circuits (e.g., ADD units 230-n). Each of the plurality of the addition circuits may be communicatively coupled (e.g., via one or more buses) to at least one of the multiplication circuits. In some implementations, each of the plurality of the addition circuits is coupled to all multiplication circuits. In some implementations, some or all of the multiplication circuits may be configured to perform modular multiplication and some or all of the addition circuits may be configured to perform modular addition. In some implementations, some or all of the multiplication circuits may be configured to perform Montgomery multiplication.
The cryptographic processor may further include a memory system having two or more memory units. Each of the memory units may be communicatively coupled to at least one of the multiplication circuits and at least one of the addition circuits. One or more of the memory units may be double-port memory units capable of performing a read operation and a write operation within a same cycle of cryptographic processor operations.
At block 520, the cryptographic processor may, during a second cycle, obtain a second plurality of multiplication products. Each of the second plurality of multiplication products may be obtained by a respective multiplication circuit of at least a subset of the plurality of multiplication circuits and may be based on a multiplier or a multiplicand used, during the first cycle, by a different multiplication circuit. For example, with reference to
At least some of the first plurality of multiplication products or the second plurality of multiplication products may be obtained using multipliers loaded from the memory circuits. For example, during cycle 2, multiplier X1 and multiplicand Y3Y2 may be loaded from the memory circuits (while multiplier X0 is loaded during a previous cycle and multiplicand Y1Y0 is passed from MUL unit 220-1 to MUL unit 220-2). Similarly, during cycle 3, multiplier X2 may be loaded from one of the memory circuits.
At block 530, the cryptographic processor may use at least one of the plurality of addition circuits to perform an addition operation using at least one of the first plurality of multiplication products and at least one of the second plurality of multiplication products. For example, with reference to
In some implementations, each of the first plurality of multiplication products and the second plurality of multiplication products may be modular multiplication products. In some implementations, each of the second plurality of multiplication products may be obtained by a Montgomery reduction of a respective multiplication product of the first plurality of multiplication products. For example, while MUL unit 220-1 may be computing a multiplication product during the first cycle, MUL unit 220-2 may be performing (during the second cycle) the Montgomery reduction of the computed product.
At block 620, method 600 may continue with the first multiplication circuit (e.g., MUL unit 220-1) determining a first product (e.g., X0·Y0) of the first multiplier and the first multiplicand. At block 630, the cryptographic processor (e.g., using instructions of control unit 250 depicted in
At block 640, the cryptographic processor may load (e.g., in conjunction with cycle 2, as depicted in
At block 670, method 600 may continue with the second multiplication circuit (e.g., MUL unit 220-2) determining a third product (e.g., X1. Y0) of the second multiplier (e.g., X1) and the first multiplicand (e.g., Y0). At block 680, method 600 may continue with one of the addition circuits (e.g., adder circuit 235) computing a sum of addends. The addends may include: i) a first predetermined number of low bits (e.g., N bits) of the second product (e.g., X0·Y1), ii) the third product (e.g., X1·Y0), and iii) a second predetermined number (e.g., 2N bits or N bits) of high bits of the first product (e.g., X0·Y0). At block 690, the cryptographic processor may store the first predetermined number of low bits of the sum (e.g., accumulator value A1) in a memory unit (e.g., SRAM 240-1 or SP 242-1). Additionally, the cryptographic processor may store the second predetermined number of high bits of the sum in at a second memory unit (e.g., ADD unit 230-2 or a buffer memory of MUL unit 220-2). The stored high bits may be used (e.g., as a carry) in a subsequent cycle of computations.
It will be understood that Jacobi symbols also include, as a special case, the Legendre symbols, which may also be computed using method 700. Method 700 may involve performing, by a cryptographic processor, a plurality of iterations to identify the result of the respective modular operation. The cryptographic processor may include a plurality of multiplication circuits and a co-processor, each performing a portion of operations of method 700.
At block 710, method 700 may include iteratively determining, by the co-processor, a plurality of k step matrices, wherein each of the plurality of step matrices is based on a respective subset of k least significant bits of the first number and the second number. For example, step matrices {circumflex over (M)}1 and {circumflex over (M)}2 may be based on the least significant bit and the second least significant bid of each of the first number x and the second number y. At block 720, method 700 may continue with the co-processor determining a tracking matrix as a product of the computed step matrices, e.g., {circumflex over (M)}1 . . . k=Πj=1k {circumflex over (M)}j. At block 730, method 700 may continue with the plurality of multiplication circuits modifying numbers x and y using matrix multiplication with the tracking matrix, e.g.,
As indicated by block 740, the co-processor may determine a first number of times an element of the tracking matrix {circumflex over (M)}1 . . . k, iteratively modified, becomes negative. As indicated by block 750, at each iteration, the co-processor may further determine a second number of occurrences that the second number, iteratively modified (e.g., y), becomes negative. For example, if the step matrices M; obey certain mathematical properties, the number of times that the sign of the element of {circumflex over (M)}1 . . . k changes may be the same as (or one less than) the number of times that the sign of the second number (e.g., y) changes. This property, in conjunction with the final signs of the element of {circumflex over (M)}1 . . . k and the second number, may be used to determine the number of times the second number changes sign (e.g., becomes negative) based on the number of times the element of {circumflex over (M)}1 . . . k changes sign. At block 760, method 700 may identify the result of the modular operation using the modified first number, the modified second number, the first determined number of times and/or the second determined number of times to identify the result of the modular operation. For example, if the modular operation involves a computation of a Jacobi symbol, the sign of the result of the operation may be changed if the first number of occurrences or the second number of occurrences is odd.
Example computer system 800 may include a processing device 802 (also referred to as a processor or CPU), a main memory 804 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), etc.), a static memory 806 (e.g., flash memory, static random access memory (SRAM), etc.), and a secondary memory (e.g., a data storage device 818), which may communicate with each other via a bus 830.
Processing device 802 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, processing device 802 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 802 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. In accordance with one or more aspects of the present disclosure, processing device 802 may be configured to execute instructions implementing methods 500 and 600 of a streaming multiplication performed on a cryptographic processor operating in accordance with one or more aspects of the present disclosure and method 700 of determining results of certain modular operations using the cryptographic processor.
Example computer system 800 may further comprise a network interface device 808, which may be communicatively coupled to a network 820. Example computer system 800 may further comprise a video display 810 (e.g., a liquid crystal display (LCD), a touch screen, or a cathode ray tube (CRT)), an alphanumeric input device 812 (e.g., a keyboard), a cursor control device 814 (e.g., a mouse), and an acoustic signal generation device 816 (e.g., a speaker).
Data storage device 818 may include a computer-readable storage medium (or, more specifically, a non-transitory computer-readable storage medium) 828 on which is stored one or more sets of executable instructions 822. In accordance with one or more aspects of the present disclosure, executable instructions 822 may comprise executable instructions implementing methods 500 and 600 of a streaming multiplication performed on a cryptographic processor operating in accordance with one or more aspects of the present disclosure and method 700 of determining results of certain modular operations using the cryptographic processor.
Executable instructions 822 may also reside, completely or at least partially, within main memory 804 and/or within processing device 802 during execution thereof by example computer system 800, main memory 804 and processing device 802 also constituting computer-readable storage media. Executable instructions 822 may further be transmitted or received over a network via network interface device 808.
While the computer-readable storage medium 828 is shown in
Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “identifying,” “determining,” “storing,” “adjusting,” “causing,” “returning,” “comparing,” “creating,” “stopping,” “loading,” “copying,” “throwing,” “replacing,” “performing,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Examples of the present disclosure also relate to an apparatus for performing the methods described herein. This apparatus may be specially constructed for the required purposes, or it may be a general purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic disk storage media, optical storage media, flash memory devices, other type of machine-accessible storage media, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The methods and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description below. In addition, the scope of the present disclosure is not limited to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other implementation examples will be apparent to those of skill in the art upon reading and understanding the above description. Although the present disclosure describes specific examples, it will be recognized that the systems and methods of the present disclosure are not limited to the examples described herein, but may be practiced with modifications within the scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense. The scope of the present disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US22/37024 | 7/13/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63203464 | Jul 2021 | US | |
63261165 | Sep 2021 | US |