The disclosure pertains to cryptographic computing applications and, more specifically, to improving efficiency of cryptographic operations with cryptographic engines having systolic processing arrays capable of performing parallel and streaming computations.
The present disclosure will be understood more fully from the detailed description given below and from the accompanying drawings of various implementations of the disclosure.
Aspects of the present disclosure are directed to cryptographic engines and methods of using said cryptographic engines for improving computational efficiency and memory utilization in cryptographic operations that include, but are not limited to, public-key cryptography applications. More specifically, aspects of the present disclosure are directed to multi-lane cryptographic engines with systolic architecture for efficient multiplication of numbers of various sizes, modular multiplication, Montgomery multiplication and reduction, and other operations used in cryptographic applications.
Various cryptographic computations may involve operations that are efficiently performed by offloading them from a main processor to a dedicated cryptographic engine (accelerator) that includes hardware circuits designed to improve speed and efficiency of arithmetic operations (multiplication, division, addition, etc.) and memory accesses. For example, in Rivest-Shamir-Adelman (RSA) public key/private key applications, large prime numbers p and q may be selected to generate a pair of a public (encryption) exponent e and a secret (decryption) exponent d such that e and d are inverse of each other modulo a certain number (e.g., modulo (p−1). (q−1) or a lowest common multiplier of p−1 and q−1). The numbers e and N=p·q are revealed as part of the public key while p, q, and d are stored in secret as parts of the private key. A message m may be encrypted into a ciphertext c using modular exponentiation, c=me mod N, and can be deciphered using another modular exponentiation, m=cd mod N, based on the private exponent d. To prevent unauthorized actors from recovering the private exponent d, the prime multipliers p and q are typically selected to be large numbers, e.g., 1024-bit numbers.
Some applications use elliptic curve cryptography that involves operations with points (x,y) on an elliptic curve, e.g., an elliptic Weierstrass curve, y2=x3+ax+b. Arithmetic operations (such as addition, doubling, and infinity operations) are defined via a set of geometric rules; e.g., a sum of three points on an elliptic curve is zero, P1+P2+P3=0, if the points P1, P2, P3 are located at the intersection of the elliptic curve with a straight line. The strength of the elliptic curve cryptography is based on the fact that for large values of k, a product Q=P·k can be practically anywhere on the elliptic curve. As a result, the inverse operation to determine an unknown value of (e.g., private key) k from a known public value Q can be a prohibitively difficult computational operation. In elliptic curve cryptography, it is typically sufficient to use numbers that are much smaller (e.g., 256-bit numbers) than numbers used in RSA applications.
Decryption and encryption operations often require a large number of arithmetic operations being performed, which may take many clock cycles, especially when performed on low-bit microprocessors, such as smart card readers, wireless sensor nodes, and so on. Cryptographic engines (accelerators, co-processors) are specially designed collections of circuits that execute specialized computationally intensive cryptographic operations more efficiently than a general purpose processor (e.g., a central processing unit). Because in many applications (including network and cloud applications) cryptographic operations may constitute a significant portion of the total computational load, small and efficient cryptographic engines are highly desired.
In applications, cryptographic engines are often called on to operate on numbers of different sizes. For example, the same cryptographic engine may provide computational support for cryptographic applications that use the RSA algorithm (with large, e.g., 1024-bit inputs) whereas other applications use ECC algorithms (with smaller, e.g., 256-bit inputs). Multiplication of large numbers may be more efficiently performed by splitting large numbers into segments (words) and multiplying the large numbers word by word with accumulator values and carries propagated through various word multiplications, e.g., as in the schoolbook algorithm. For example, two 1024-bit input numbers X and Y may be segmented into sets of sixteen 64-bit words {Xj} and {Yj} and processed through sixteen multiplication circuits connected into a systolic array, each word of the multiplier Xj being handled by a specific multiplication circuit and each word of the multiplicand Yk streamed into and out of each (and into the next) multiplication circuit. When smaller, e.g., 256-bit, numbers are processed by such an array of multiplication circuits, the multiplication operations may be complete by the first four multiplication circuits, but the data may still have to be streamed through the remaining twelve multiplication circuits. Such streaming slows down the speed of the computations, makes the pass-through circuits unavailable for other multiplication operations, and increases power consumption.
Described in the instant disclosure are cryptographic engines that allow increased flexibility in handling multiplications (and other operations) of numbers of different sizes. Described herein is a segmented systolic array (SSA) having multiple processing elements, e.g., computational units that may include multiplication circuits, addition circuits, memory buffers, and other components (such as special prime units). The systolic array may be partitioned into multiple (e.g., N) processing lanes having multiple (e.g., n) processing elements. Each processing lane may have an independent data input and data output. Each processing lane may receive data input directly from a preceding lane and provide data output directly into a subsequent lane. Each processing lane may have a control unit that can configure operations performed by the respective lane and a buffer that can store outputs of the lane in the instances where the outputs are to be used by a subsequent lane while the subsequent lane is finishing ongoing operations. Also described are example operations, e.g., multiplications, modular multiplications, Montgomery reductions, which may be performed on a SSA (although various other operations can also be performed using the disclosed SSA). For example, multiplication of small (e.g., 256-bit) numbers may be handled by a single processing lane, which may output and store the obtained results without affecting processing by other processing lanes. Multiplication of larger (e.g., 512-bit or 1024-bit) numbers may be performed by multiple processing lanes, e.g., two, three, or more adjacent processing lanes.
The system architecture 100 may further include an input/output (I/O) interface 104 to facilitate connection of the computer system 102 to peripheral hardware devices 106 such as card readers, terminals, printers, scanners, internet-of-things devices, and the like. The system architecture 100 may further include a network interface 108 to facilitate connection to a variety of networks (Internet, wireless local area networks (WLAN), personal area networks (PAN), public networks, private networks, etc.), and may include a radio front end module and other devices (amplifiers, digital-to-analog and analog-to-digital converters, dedicated logic units, etc.) to implement data transfer to/from the computer system 102. Various hardware components of the computer system 102 may be connected via a system bus 112 that may include its own logic circuits, e.g., a bus interface logic unit (not shown).
The computer system 102 may support one or more cryptographic applications 110-n, such as an embedded cryptographic application 110-1 and/or external cryptographic application 110-2. The cryptographic applications 110-n may be secure authentication applications, encrypting applications, decrypting applications, secure storage applications, and so on. The external cryptographic application 110-2 may be instantiated on the same computer system 102, e.g., by an operating system executed by the processor 120 and residing in the memory device 130. Alternatively, the external cryptographic application 110-2 may be instantiated by a guest operating system supported by a virtual machine monitor (hypervisor) executed by the processor 120. In some implementations, the external cryptographic application 110-2 may reside on a remote access client device or a remote server (not shown), with the computer system 102 providing cryptographic support for the client device and/or the remote server.
The processor 120 may include one or more processor cores having access to a single-level or multi-level cache and one or more hardware registers. In implementations, each processor core may execute instructions to run a number of hardware threads, also known as logical processors. Various logical processors (or processor cores) may be assigned to one or more cryptographic applications 110, although more than one processor core (or a logical processor) may be assigned to a single cryptographic application for parallel processing. A multi-core processor 120 may simultaneously execute multiple instructions. A single-core processor 120 may typically execute one instruction at a time (or process a single pipeline of instructions). The processor 120 may be implemented as a single integrated circuit, two or more integrated circuits, or may be a component of a multi-chip module.
The memory device 130 may refer to a volatile or non-volatile memory and may include a read-only memory (ROM) 132, a random-access memory (RAM) 134, high-speed cache 136, as well as (not shown) electrically erasable programmable read-only memory (EEPROM), flash memory, flip-flop memory, or any other device capable of storing data. The RAM 134 may be a dynamic random-access memory (DRAM), synchronous DRAM (SDRAM), a static memory, such as static random-access memory (SRAM), and the like. Some of the cache 136 may be implemented as part of the hardware registers of the processor 120. In some implementations, the processor 120 and the memory device 130 may be implemented as a single field-programmable gate array (FPGA).
The computer system 102 may include a cryptographic engine 200 for fast and efficient performance of cryptographic computations, as described in more detail below. Cryptographic engine 200 may include processing and memory components, as described in more detail below. Cryptographic engine 200 may facilitate exchange of secret data, authentication of applications, users, access requests, and the like, in association with operations of the cryptographic applications 110-n or any other applications operating on or in conjunction with the computer system 102. Cryptographic engine 200 may further perform encryption and decryption of secret information.
Each processing lane may include a number of processing elements (PE). For conciseness, shown are four PEs within each processing lane, even though processing lane may have any number n of processing elements (e.g., more or less than four). For example, as depicted, PL 220 includes PE 222, PE 224, PE 226, and PE 228; PL 230 includes PE 232, PE 234, PE 236, and PE 238; PL 240 includes PE 242, PE 244, PE 246, and PE 248; and PL 250 includes PE 252, PE 254, PE 256, and PE 258. Each processing element may be capable of performing a multiplication on a k-bit multiplier and an l-bit multiplicand (also referred herein as words). For example, in one implementation, k=l=64. In another implementation, k=32 and l=64. A word upon which a processing element operates may be a complete number or a portion of a larger number that is being processed (concurrently and/or sequentially, as described in more detail below) by multiple processing elements and multiple processing lanes. Unidirectional solid arrows in
As depicted, each processing lane may receive input data from bus 212 and output data into bus 212. Data received by a first processing element of each processing lane be processed and passed to the next processing element of the same processing lane. Although not depicted (for the sake of reader's convenience), data may be received by any of the subsequent processing elements directly from bus 212, and not only from a preceding processing element. For example, during a first cycle of computations, data may be received by PE 222 of PL 220 from bus 212. The received data may include a word of a multiplier X and a word of a multiplicand Y. PE 222 may perform multiplication (in some implementations, modular multiplication) of the received words and store a low word of the product in an accumulator circuit (e.g., buffer) while passing a high (carry) word to the next processing element, e.g., PE 224. PE 222 may additionally pass the used multiplicand word to the downstream PE 224. During the next cycle, PE 224 may receive from bus 212 a new word of the multiplicand and multiply the previously received word of the multiplier by the new word of the multiplicand. In the meantime, PE 224 may load the next word of the multiplier X and multiply the loaded word of the multiplier by the word of the multiplicand passed by PE 222. Other processing elements of PL 220 may operate in a similar fashion by streaming data (e.g., multiplicand words, accumulator values, carry values, etc.) to downstream processing elements, with words of the multipliers loaded and retained by various processing elements and words of the multiplicands loaded by an upstream processing elements and passed to downstream processing elements. In some implementations, words of both the multiplier and the multiplicand may be loaded from memory prior to each cycle of computations.
Some or all processing lanes may include a lane buffer for temporary storage of outputs. For example, PL 220 may include lane buffer 229; PL 230 may include lane buffer 239; PL 240 may include lane buffer 249; and PL 250 may include lane buffer 259. Lane buffers may be utilized when the output of a processing lane is used as an input into the next processing lane (e.g., output of PL220 used as an input into PL 230) rather than stored in memory 280, for example, in instances where the next processing lane is finishing a previous computation and is not yet ready to process inputs from the preceding lane.
Some or all processing lanes may include a lane control unit (LCU) for controlling operations within the respective processing lane and directing data flow between various processing elements and other components of the lane. For example, PL 220 may include LCU 221; PL 230 may include LCU 231; PL 240 may include LCU 241; and PL 250 may include LCU 251. For example, LCU 221 may determine that PL 220 is to multiply a first 128-bit number by a second 128-bit number and may only use PE 222 and PE 224 for the multiplication operations (on 64-bit operands) while designating PE 226 and PE 228 as pass-through elements. On the other hand, LCU 231 may determine that PL 230 is to multiply a third 256-bit number by a fourth 256-bit number and may use all four PEs of PL 230 for the respective multiplication operations.
Memory 280 of cryptographic engine 200 may include a number of memory units (circuits), such as any number of static random-access memory (SRAM) units 282 and any number of scratchpad (SP) units 284. Each SRAM 282 may be a single-port memory unit configure to load one word or store one word, per cycle. Each SP unit 284 may be a two-port memory unit configured to load one number and store one number, per cycle.
Bus 212 may include a number of data communication lines (data bus) for transferring data (input and output numbers) between the aforementioned components of cryptographic engine. Additionally, bus 212 may include an address bus for communicating signals that identify source and destination of data. Bus 212 may also include a control bus, e.g., lines for communicating control signals from a control unit 290. Control unit 290 may include a clock to maintain cycles of computations and memory access operations. Control unit 290 may store instructions to the cryptographic engine to perform various cryptographic computations. Control unit 290 may determine which processing lanes are to perform a particular operation and may further determine an order of such operations. For example, control unit 290 may identify that cryptographic engine 200 is to perform a multiplication of two 512-bit numbers and direct PL 220 and PL 230 to perform the multiplication, while PL 240 and PL 250 may remain idle (or perform multiplications of some other numbers). As another example, control unit 290 may identify that cryptographic engine 200 is to perform a multiplication of two 1024-bit numbers and direct all four PLs 220-250 to perform the multiplication. As another example, control unit 290 may determine that PL 220 and PL 240 are to perform multiplications while PL 230 and PL 250 are to perform Montgomery reduction of the outputs of PL 220 and PL 240, as described in more detail below in relation to
An additional ALU support unit 260 may include circuits that perform operations different from multiplications or additions. ALU support unit 260 may include a read-only memory (ROM) 262, which may store constants (such as modulus p, auxiliary number s Montgomery radix R, inverse radix, R−1 mod p, various other auxiliary numbers, such as powers of radix R, e.g., R2 mod p or modulo some other suitable modulus, etc.) and various instructions to be used by control unit 290, and so on. ALU support unit 260 may further include a random number generator (RNG) 264 for generation of random (or pseudorandom) numbers, an XOR unit 266 for performing XOR operations, a shift unit 268 to perform bit shifting and bit masking, a compare unit 270 to perform comparison of input numbers, a copy unit 272 for copying numbers, an A2B/B2A unit 274, as well as any other auxiliary units (circuits) performing a function that may be used in operations of the cryptographic engine 200.
A multiplication circuit 330 may process the received words of the multiplier and multiplicand. If a word of the multiplier has m bits and the word of the multiplicand has M bits, the output of multiplication circuit 330 may be an (M+m)-bit word. An addition circuit 340 may process the output of multiplication circuit 330 and may further add an accumulator (“accumulator in”) and a carry (“carry in”) from one or more of the preceding circuits. The resulting (M+m)-bit word may be split between a carry buffer 350 (which may be a flip-flop memory or any other suitable memory device) and an accumulator buffer. For example, the high M-bit word of the result may be stored in carry buffer 350 while the low m-bit word of the result may be stored in an accumulator buffer 360. The content of accumulator buffer 360 may then be passed on (e.g., at the beginning of the next computational cycle) to a next processing element that processes the words of the same significance. The content of carry buffer 350 may be passed on (“carry out”) to a processing element that processes words of a higher significance, as described in more detail below in relation to
In some implementations, an operation performed by cryptographic engine 200 may be a modular multiplication that uses one of special prime moduli p, such as one of Solinas primes (e.g., p=2192−264−1, p=2384−2128−296+232−1), Mersenne primes, Crandall primes, and other simple primes. In such implementations, as depicted with dashed arrows, modular reduction may be performed for each word of the result (product) without waiting for other words of higher significance to be processed. For example, the last processing element that completes computations of the k-th least significant word of the result, may perform modular reduction of said word using a special prime unit 370. Special prime values p are represented by bits of 0 that are separated by 31 or more bits of 0. As a result, modular reduction may be performed with one of the known algorithms that use several additions and subtractions, which may be implemented with addition circuits and shifting circuits (e.g., linear feedback shift register) that are part of special prime unit 370. An output of modular reduction performed by special prime unit 370 may be added by an addition circuit 342 and output as a new carry value. In those instances where processing element 300 computes an intermediate value of a word of the result, output data may be directed to accumulator buffer 360 and used in the next cycle (e.g., by other processing elements).
For the sake of illustration but not limitation, operations depicted in
is, generally, a 16-word number A=A15 . . . A0, each word having m bits.
The following notations are used in
During cycle 1, PE 222 may receive the low (least significant) word X0 of multiplier, and two low words Y1Y0 of multiplicand, and compute the product X0·Y1Y0, which is (generally) a three-word number. The low word of X0·Y1Y0 represents the low word A0 of the product A and may be stored in one of memory units (as depicted schematically by symbol A0 next to PE 222 box in cycle 1). The high two words of the product X0·Y1Y0 may be stored (buffered) in PE 222 as a carry (e.g., in carry buffer 350 in
During cycle 2, PE 222 may provide the stored carry and two low words Y1Y0 of the multiplicand to PE 224, load the next two words Y3Y2 of the multiplicand, and multiply the previously loaded low word X0 of the multiplier by the new words Y3Y2 of the multiplicand. PE 222 may then compute X0·Y3Y2, buffer a new carry (two high words of X0·Y3Y2) until the next cycle (e.g., in accumulator buffer 360) and provide the accumulator value (the low word of X0·Y3Y2) to PE 224 (as indicated by the solid arrow). Additionally, during the same cycle 2, PE 224 may load the next word X1 of the multiplier from the memory and receive two words Y1Y0 of the multiplicand from PE 222 (as well as the respective carry), as depicted schematically with the dashed arrow. PE 224 may further receive the accumulator value computed by PE 222 during the same cycle 2. PE 224 may then add the received two-word carry and one-word accumulator to the computed product X1·Y1Y0. PE 224 may buffer the high two words of the obtained result as the next carry (to be passed on to PE 226 in cycle 3), and may store a low word A1 of the result as the next word of the product A. In some implementations, the addition operation performed by PE 224 may be done by a multi-way addition circuit (e.g., addition circuit 340) capable of adding more than two numbers per cycle; e.g., adding X1·Y1Y0+carry+accumulator value in one operation. In some implementations, the addition unit may be configured to perform multiple consecutive additions of two numbers over one cycle, e.g., obtaining a first sum X1·Y1Y0+carry during the first operation and then adding the accumulator value to the first sum during the second operation (or in any other order).
Similar streaming computations may be performed in subsequent cycles, as depicted. In cycle k, PE 222 passes two words Y2k−3Y2k−4 of the multiplicand (loaded during cycle k−1) and one-word carry (computed during cycle k−1) to PE 224 and loads the next two words Y2k−1Y2k−2 of the multiplicand. Similarly, other PEs pass previously processed multiplicand words (and computed carries) to the next PE. In addition, during cycle k≤M, loads the multiplier word Xk−1 from memory and multiplies it by Y1Y0. During cycle k, products Xj·Y2k−2j−1Y2k−2j−2 with different j are computed by different PEs. Because there are twice as many words of the multiplier to load as there are PEs in PL 220, computations do not stop after the processing reaches the last PE 228 of PL 220. For the next three cycles, computations are shared by PL 220 and PL 230, with multiplicand words, accumulators, and carries streamed from PL 220 to PL 230. Starting from cycle 8, processing is performed solely by PL 230.
At the end of each cycle k≤8, the word Ak−1 of the product A is determined (and stored in one of the memory circuits). At the end of cycle k>8, the low word of the result of multiplication X7·Y3Y2 (plus the received carry and accumulator value) may be passed to an addition circuit that may add the carry from the last block of cycle 8 (as depicted by the downward dashed arrow). The low two words of the sum represent the words A9A8 of the final product A and are stored in memory (e.g., together with previously computed words Aj). The high word of the sum is retained in the addition circuit. At the end of each subsequent cycle, the addition circuit adds a new two-word carry from the previous cycle (vertical dashed arrows) and a new one-word accumulator (horizontal solid arrows) to the previously stored high word, identifies the new two low words as the next two words of the final product A and so on. After cycle 11 (upon computing the last multiplication X7·Y7Y6) both the high word and the low word of the last addition operation are stored as the last two words of the final product, A15A14.
In the example illustrated in
As depicted in
Operations illustrated in
The systolic array architecture illustrated in
For example, in a synchronous memory access system, in which equal number of words of multiplicand and multiplier are loaded, each processing element may include (or have access to) a synchronizer buffer (not shown in
As can be seen from Table 1, during cycle 1, multiplier word Xo is loaded into buffer of PE 222, multiplier word X1 is loaded into buffer of PE 224, and multiplicand words Y1 and Y0 are loaded into PE 222 for processing, e.g., multiplication X0·Y1Y0. (In some implementations, the multiplicand words Y1 and Y0 may first be loaded into a staging register of PE 222 prior to processing). During cycle 2, multiplier word X2 is loaded into buffer PE 222, multiplier word X3 is loaded into buffer of PE 224, multiplicand words Y3 and Y2 are loaded into PE 222, and multiplier word X1 is moved from buffer of PE 224 to processing by PE 224 (multiplication X1·Y1Y0). Similarly, during cycle 3, multiplier word X2 is moved from buffer of PE 222 into PE 226, multiplier word X3 is moved from buffer of PE 224 into buffer of PE 228, and multiplicand words Y5 and Y4 are loaded into PE 222. During cycle 4, multiplier word X3 is moved from buffer of PE 228 to processing by PE 228 (multiplication X3·Y1Y0), and so on. A similar loading sequence may be followed for other processing elements not shown in Table 1. As a result, multiplier words are delivered to every second processing element (e.g., PE 224, PE 228, etc.) one cycle before the words are used for multiplication (with buffers holding data for one cycle), whereas multiplier words are delivered to other processing elements (e.g., PE 222, PE 226, etc.) during the same cycle in which the words are used in multiplications.
Depicted with brackets, e.g., [X0], [X1], are multiplier words that may optionally be loaded as shown, as the corresponding values are not used by the respective (or subsequent) processing elements. For example, [X0] may be loaded (e.g., for the uniformity of the data flow) or not loaded (for reduced power consumption) into buffer of PE 226 during cycle 2 with X0 not used by PE 226 (or other downstream PEs). While Table 1 indicates one possible way of buffering data for gear ratio 1:2 operations, it should be understood that multiple other data management schemes may achieve similar functionality. For example, instead of using single-word buffers with every processing element, in some implementations, double-word buffers may be used with every second processing element (e.g., PE 224, PE 228, etc.).
Computations performed by the processing lanes and processing elements illustrated in
Because computations modulo p require finding a remainder of a (computationally heavy) division operation, in some implementations a Montgomery reduction may be used. To find A=X·Y mod p, the multiplier X and the multiplicand Y can first be transformed into the Montgomery domain, X mod p→
Using the Montgomery representation, any number of consecutive multiplications (and additions/subtractions) may be performed directly in the Montgomery domain without the need to perform any division operations (other than bit shifting) with only the final output transferred back from the Montgomery domain. Such a transformation may be performed as one additional Montgomery reduction.
For the sake of illustration but not limitation, operations depicted in
in which both the multiplicand and the multiplier may be numbers in the Montgomery representation. (Bars over the letters, indicating the Montgomery representation, are being omitted for the sake of conciseness). Based on the computed product A, a reduction factor
is computed. As described in more detail below, computation of the reduction factor B may be split (for additional efficiency) between PL 220 and PL 230. (Multiplications used for determining words of B are depicted with shaded blocks.) Based on the computed reduction factor B, a product B·p is computed. Finally, an addition circuit (which may be a part of one of the processing elements, e.g., PE 238, or a separate addition circuit) computes the sum A+B·p and reduces the computed sum by radix R, e.g., by bit shifting, to remove the log2 R least significant bits of the sum (which have value 0).
The operations involved in computations of the product A=X·Y are performed similarly to operations of
where the words indicated by strikethroughs are inconsequential and may be omitted. For example, during computation of A3·s1s0, the high word of the auxiliary number s need not be loaded (or a null word may be loaded) and the same multiplication may be performed as A3·s0.
In some implementations, all six multiplications in the computation of B mod r4 may be performed by PL 230. This may extend the total process of Montgomery reduction by an additional cycle. Also, in such implementations, PL 230 is performing significantly more computations (e.g., six multiplications) than PL 220. To enhance the uniformity of the flow of data, in some implementations (as depicted in
More specifically, the low word B0 may be computed in two multiplications, A0·s1s0 and A0·s3s2 (e.g., as the low word of the sum of these two products). These two multiplications may be performed during a cycle (e.g., cycle 3) that is subsequent (e.g., immediately after) a cycle in which word A0 is computed (e.g., cycle 2). As depicted, multiplication A0·s3s2 may be performed by PL 220 while multiplication A0·s1s0 may be performed by PL 230. Similarly, two multiplications, A1·s1s0 and A1·s3s2 that determine the next word B1 may be performed in the cycle (e.g., cycle 4) that is after a cycle in which word A1 is computed. Multiplication A1·s3s2 may be performed by PL 220 while multiplication A1·s1s0 may be performed by PL 230. As depicted, to facilitate passage of multiplicands between PEs within each processing lane, the four multiplications that have s1s0 as multiplicands may be performed by PL 230 while the two multiplications that have s3s2 as multiplicands may be performed by PL 220. Additionally, the multiplicand s3s2 may be loaded into PE 222 and passed through the PEs of PL 220, similarly to other multiplicands (e.g., Yj+1Yj and pj+1pj). The first two operations with the multiplicand s3s2 may be null multiplications: 0·s3s2. Some data may be passed between PL 220 and PL 230, e.g., accumulator value and carry obtained by PE 226 during computation of A0·s3s2 may be passed to PE 232. Similarly, accumulator value and carry obtained by PE 228 during computation of A1·s3s2 may be passed to PE 234, as depicted by the respective arrows.
The word B0 is determined by PE 232 in cycle 3; the word B1 is determined by PE 234 in cycle 4; the word B2 is determined by PE 236 in cycle 5; and the word B3 is determined by PE 238 in cycle 6. The determined words Bj may be retained in the multiplier buffers of the respective PEs and used in the next (e.g., four) cycles with different multipliers pj+1pj of the modulus. The product B·p determined by PL 230 may then be added to the value A determined by PL 220 and the reduction modulo radix R may be perform (e.g., by bit shifting).
In some implementations, the multiplier X may be longer than four words (with each word representing a size of a portion of the multiplier that a processing element can handle per cycle), e.g., 4k, with some integer k>1. In such implementations, the multiplication operation may be performed in k iterations. In each iteration, four words of the multiplier may be processed, an accumulator value may be stored, and a Montgomery reduction (e.g., by R=2r where r is the number of bits in the four words) may be performed. Each iteration may be performed by one PL (e.g., for special primes) or two PLs (e.g., for general primes), with the next iteration performed by the next one or two PLs, and so on.
A cryptographic engine or processor that performs methods 600 and 700 may include a systolic array having a plurality of processing lanes. In a systolic array, various data, such as operands (e.g., words of multiplier and multiplicand), accumulator values, carry values, and other lane outputs, may be passed along a direction that may be set by a control unit of the cryptographic processor, e.g., from PL 220 to PL 230, from PL 230 to PL 240, and from PL 240 to PL 250 (or vice versa), as shown in
Each PE may be configured to multiply two numbers to obtain a multiplication product of the two numbers. In some implementations, the two numbers may include a 32-bit number and a 64-bit number, a 64-bit number and a 128-bit number, two 32-bit numbers, two 64-bit numbers, two 128-bit numbers, or any other suitable numbers. In some implementations, each PE may include an addition circuit (e.g., addition circuit 340 in
The control unit of the cryptographic processor may cause one or more input numbers to be selectively input into any of the plurality of PLs. For example, numbers X and Y may be input into PL 220 while numbers U and V may be input into PL 230. In some instances, numbers X and Y may be input into PL 220 and number U may be input into PL 230 while number Y is passed to PL 230 from PL 220. Similarly, the control unit may cause one or more output numbers to be selectively output by any of the plurality of PLs. For example, in some instances, the product X·Y may be output by PL 220 and stored in the memory. In other instances, the product X·Y may be passed to PL 230 for further processing, and in yet other instances, one part (e.g., a low word) of the product X·Y may be stored in the memory while another part (e.g., a high word) of the same product may be passed to PL 230 for further processing. In some implementations, the systolic array may include N PLs and may be configured (during performance of some tasks) to perform M parallel multiplication operations. More specifically, each set of N/M PLs may be performing a respective one of the M parallel multiplication operations.
At block 620, method 600 may continue with processing a first set of words of the multiplier (e.g., X0, X1, X2, X3) using a first PL of the plurality of PLs, wherein each PE of the first PL is processing a respective word of the first set of words of the multiplier. For example, PE 222 in
At block 640 method 600 may include processing sequentially each word of the multiplicand by each PE of the first PL. For example, as illustrated in
At block 650, method 600 may continue with obtaining, based on the processing of the first set of words (e.g., X0, X1, X2, X3) of the multiplier by the first PL and the processing of each word Yj of the multiplicand by the first PL, a product of the multiplier and the multiplicand. In the instances of the joint multiplication operations, obtaining the product of the multiplier and the multiplicand may be further based on the processing of the second set of words (e.g., X4, X5, X6, X7) of the multiplier by the second PL and the processing of each word Yj of the multiplicand by the second PL. The product of the multiplier and the multiplicand may be represented with a set of accumulator words (e.g., A0, A1, . . . ) determined by various PLs and PEs.
In some implementations, at optional block 660, method 600 may include performing a Montgomery reduction of the obtained product of the multiplier and the multiplicand. For example, in those instances where a first subset of PLs (which may include one or more PLs) performed a multiplication operation (e.g., in conjunction with blocks 610-650), a second subset of PLs may perform the Montgomery reduction (or any other suitable way of performing a modular reduction) of the obtained product number. For example, PLs 220 and 230 may obtain a product of an eight-word multiplier X and a multiplicand Y (of an arbitrary length) and PLs 240 and 250 may determine a Montgomery-reduced value of the obtained product.
Method 700 may continue, at block 740, with computing, using the reduction factor, a Montgomery-reduced product of the first number and the second number. For example, the product of the first number and the second number (e.g., A) may be added to the product of the reduction factor times a modulus number p and reduced by a Montgomery radix R: (A+B·p)/R. In some implementations, as illustrated with callout box 742, during computation of the Montgomery-reduced product of the first number and the second number, each word of the reduction factor (e.g., B) or each word of a modulus number (e.g., p) may be processed by a designated, for a respective word, PE of the second set of the plurality of PEs (e.g., PL 230). For example, as depicted in
Example computer system 800 may include a processing device 802 (also referred to as a processor or CPU), a main memory 804 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), etc.), a static memory 806 (e.g., flash memory, static random access memory (SRAM), etc.), and a secondary memory (e.g., a data storage device 818), which may communicate with each other via a bus 830.
Processing device 802 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, processing device 802 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 802 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. In accordance with one or more aspects of the present disclosure, processing device 802 may be configured to execute instructions facilitating implementation of method 600 of a multiplication and method 700 of a Montgomery reduction performed on a cryptographic processor that operates in accordance with one or more aspects of the present disclosure.
Example computer system 800 may further comprise a network interface device 808, which may be communicatively coupled to a network 820. Example computer system 800 may further comprise a video display 810 (e.g., a liquid crystal display (LCD), a touch screen, or a cathode ray tube (CRT)), an alphanumeric input device 812 (e.g., a keyboard), a cursor control device 814 (e.g., a mouse), and an acoustic signal generation device 816 (e.g., a speaker).
Data storage device 818 may include a computer-readable storage medium (or, more specifically, a non-transitory computer-readable storage medium) 828 on which is stored one or more sets of executable instructions 822. In accordance with one or more aspects of the present disclosure, executable instructions 822 may comprise executable instructions implementing method 600 of a multiplication and method 700 of a Montgomery reduction performed on a cryptographic processor that operates as described above.
Executable instructions 822 may also reside, completely or at least partially, within main memory 804 and/or within processing device 802 during execution thereof by example computer system 800, main memory 804 and processing device 802 also constituting computer-readable storage media. Executable instructions 822 may further be transmitted or received over a network via network interface device 808.
While the computer-readable storage medium 828 is shown in
Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “identifying,” “determining,” “storing,” “adjusting,” “causing,” “returning,” “comparing,” “creating,” “stopping,” “loading,” “copying,” “throwing,” “replacing,” “performing,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Examples of the present disclosure also relate to an apparatus for performing the methods described herein. This apparatus may be specially constructed for the required purposes, or it may be a general purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMS, EEPROMs, magnetic disk storage media, optical storage media, flash memory devices, other type of machine-accessible storage media, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The methods and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description below. In addition, the scope of the present disclosure is not limited to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other implementation examples will be apparent to those of skill in the art upon reading and understanding the above description. Although the present disclosure describes specific examples, it will be recognized that the systems and methods of the present disclosure are not limited to the examples described herein, but may be practiced with modifications within the scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense. The scope of the present disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
This application is a 371 application of International Application No. PCT/US2022/037206, filed Jul. 14, 2022, which claims the benefit of U.S. Provisional Patent Application No. 63/203,469, filed Jul. 23, 2021, which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/037206 | 7/14/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63203469 | Jul 2021 | US |