This disclosure relates to area-efficient circuitry of an integrated circuit to perform iterative modular multiplication.
This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present disclosure, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it may be understood that these statements are to be read in this light, and not as admissions of prior art.
Integrated circuits are found in numerous electronic devices and provide a variety of functionality. Many integrated circuits include arithmetic circuit blocks to perform arithmetic operations such as addition and multiplication. For example, a digital signal processing (DSP) block may supplement programmable logic circuitry in a programmable logic device, such as a field programmable gate array (FPGA). Programmable logic circuitry and DSP blocks may be used to perform numerous different arithmetic functions.
As cryptographic and blockchain applications become increasingly prevalent, integrated circuits are increasingly used to compute very large combinatorial functions. Verifiable delay functions (VDFs), for example, are used in blockchain and cryptocurrency operations. Cryptographic puzzles, such as the CSAIL2019, also involve solving large number of intrinsically sequential computations - computations that cannot be parallelized - with each iteration performing a very large arithmetic operation. Existing VDFs are too slow in central processing units (CPUs). And application-specific integrated circuits (ASICs) cannot readily keep up with the rapidly changing VDF specifications for various blockchain and cryptocurrency applications. Yet FPGA solutions, while flexible, are very logic-intensive in a way that introduces timing closure problems and are very power hungry.
Various aspects of this disclosure may be better understood upon reading the following detailed description and upon reference to the drawings in which:
One or more specific embodiments will be described below. In an effort to provide a concise description of these embodiments, not all features of an actual implementation are described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers’ specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
When introducing elements of various embodiments of the present disclosure, the articles “a,” “an,” and “the” are intended to mean that there are one or more of the elements. The terms “comprising,” “including,” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. Additionally, it should be understood that references to “one embodiment” or “an embodiment” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features.
Verifiable delay functions (VDFs) have gained widespread use in cryptocurrency and blockchain applications. Many VDFs employ modular multiplication operations. Polynomial modular multiplication involves two parts - multiplication and modular reduction, sometimes also referred to as multiplicative expansion and division reduction. This disclosure describes a circuit that can calculate modular multiplication of any suitable precision (e.g., even many thousands of bits) using a multi-cycle implementation. This may allow any size of this type of operation to be implemented in a field-programmable gate array (FPGA). As such, it may outperform other processors such as central processing units (CPUs) and graphics processing units (GPUs) by many orders of magnitude. It is several times more arithmetically efficient (e.g., with respect to a number of operations per normalized precision squared) and about 10x as power efficient as any other FPGA solution presently known.
This solution benefits from many innovations, including:
Thus, the iterative multiplicative reduction circuit of this disclosure may provide more flexibility, higher performance, and lower power. The flexibility of the circuit may allow it to scale to any suitable future value or size of the modulus. Higher performance is also gained — several times more arithmetically dense than any known algorithm. Lower power consumption is also achieved, reaching about 10x lower power compared to the next fastest method.
In a configuration mode of the integrated circuit system 12, a designer may use an electronic device 13 (e.g., a computer) to implement high-level designs (e.g., a system user design) using design software 14, such as a version of INTEL® QUARTUS® by INTEL CORPORATION. The electronic device 13 may use the design software 14 and a compiler 16 to convert the high-level program into a lower-level description (e.g., a configuration program, a bitstream). The compiler 16 may provide machine-readable instructions representative of the high-level program to a host 18 and the integrated circuit system 12. The host 18 may receive a host program 22 that may control or be implemented by the kernel programs 20. To implement the host program 22, the host 18 may communicate instructions from the host program 22 to the integrated circuit system 12 via a communications link 24 that may include, for example, direct memory access (DMA) communications or peripheral component interconnect express (PCIe) communications. In some embodiments, the kernel programs 20 and the host 18 may configure programmable logic blocks 110 on the integrated circuit system 12. The programmable logic blocks 110 may include circuitry and/or other logic elements and may be configurable to implement a variety of functions in combination with digital signal processing (DSP) blocks 120.
The designer may use the design software 14 to generate and/or to specify a low-level program, such as the low-level hardware description languages described above. Further, in some embodiments, the system 10 may be implemented without a separate host program 22. Thus, embodiments described herein are intended to be illustrative and not limiting.
An illustrative embodiment of a programmable integrated circuit system 12 such as a programmable logic device (PLD) that may be configured to implement a circuit design is shown in
Programmable logic the integrated circuit system 12 may contain programmable memory elements. Memory elements may be loaded with configuration data (also called programming data or configuration bitstream) using input-output elements (IOEs) 102. Once loaded, the memory elements each provide a corresponding static control signal that controls the operation of an associated functional block (e.g., LABs 110, DSP 120, RAM 130, or input-output elements 102).
In one scenario, the outputs of the loaded memory elements are applied to the gates of metal-oxide-semiconductor transistors in a functional block to turn certain transistors on or off and thereby configure the logic in the functional block including the routing paths. Programmable logic circuit elements that may be controlled in this way include parts of multiplexers (e.g., multiplexers used for forming routing paths in interconnect circuits), look-up tables, logic arrays, AND, OR, NAND, and NOR logic gates, pass gates, etc.
The memory elements may use any suitable volatile and/or non-volatile memory structures such as random-access-memory (RAM) cells, fuses, antifuses, programmable read-only-memory memory cells, mask-programmed and laser-programmed structures, combinations of these structures, etc. Because the memory elements are loaded with configuration data during programming, the memory elements are sometimes referred to as configuration memory, configuration random-access memory (CRAM), or programmable memory elements. Programmable logic device (PLD) 100 may be configured to implement a custom circuit design. For example, the configuration RAM may be programmed such that LABs 110, DSP 120, and RAM 130, programmable interconnect circuitry (i.e., vertical channels 140 and horizontal channels 150), and the input-output elements 102 form the circuit design implementation.
In addition, the programmable logic device may have input-output elements (IOEs) 102 for driving signals off the integrated circuit system 12 and for receiving signals from other devices. Input-output elements 102 may include parallel input-output circuitry, serial data transceiver circuitry, differential receiver and transmitter circuitry, or other circuitry used to connect one integrated circuit to another integrated circuit.
The integrated circuit system 12 may also include programmable interconnect circuitry in the form of vertical routing channels 140 (i.e., interconnects formed along a vertical axis of the integrated circuit 100) and horizontal routing channels 150 (i.e., interconnects formed along a horizontal axis of the integrated circuit 100), each routing channel including at least one track to route at least one wire. If desired, the interconnect circuitry may include pipeline elements, and the contents stored in these pipeline elements may be accessed during operation. For example, a programming circuit may provide read and write access to a pipeline element.
Note that other routing topologies, besides the topology of the interconnect circuitry depicted in
The integrated circuit system 12 may be programmed to perform a wide variety of operations, including the iterative modular multiplication of this disclosure. As mentioned above, the iterative modular multiplication has a wide variety of uses, including many relating to cryptography, cryptocurrency, and blockchain applications. One example use case is as a solver of a crypto-puzzle known as CSAIL2019. Since this is a particularly challenging crypto-puzzle, this disclosure will describe many ways in which the iterative modular multiplication of this disclosure may be used to work to solve the CSAIL2019 puzzle. Indeed, since the iterative modular multiplication of this disclosure provides a dramatic step toward solving the CSAIL2019 puzzle, the iterative modular multiplication of this disclosure is well suited for numerous other commercial cryptography, cryptocurrency, and blockchain applications.
The cryptographic puzzle is specified as the compute of 22
N=4748097547272012866175034130616773885051260744920056444867106196360710424558147654252707604941012311775892012567579064620536874633385055919001167621577710311366072057029421705135684303934811390137937802096433163959216892351184826691180016055198866796536230085523200683549066995672155839042282955591568494603061113292039044753843846484807112228389204239581712931108919820250218586352043897306238872025378193141111507426311444613498736315614218304761735541626997839036517728000688394015610618179768868342070395100147620295616695834440894241147905565567808298149024668527045239650145862092904119412874007763041042314287604772876861294417664020832796209135587181826458235580003825823724235800850160284850809737200983703552179354691863876044443377822439834079313578029085658078575731290244778595615229472411326831502667425768520006371752963274296294506063182258064362048788338392528266351511304921847854750642192694541125065873977.
While the full puzzle requires a solution for t = 256, CSAIL is also interested in solutions for t = 2k for 56/2
k < 56. These intermediate solutions are called “milestone versions of the puzzle”. The iterative modular multiplication circuit of this disclosure has been tested and shown to be remarkably effective, reaching 21 milestone solutions in just the first six months of operation.
Before describing the iterative modular multiplication circuit of this disclosure, a method known as the Ozturk method will be discussed briefly. While the Ozturk method of modular multiplication may work well in some circumstances, it may be ineffective under certain conditions, such as when the method is used with large word sizes (e.g., greater than 1024-bit word size). As will be discussed in the next section, there are a number of ways to manage this. First, this disclosure will explore high performance multi-cycle approaches, which may fit into FPGAs more readily. Secondly, this disclosure will describe more efficient ways of implementing the Ozturk algorithm. Indeed, DSP blocks are intrinsically more efficient than soft logic, as the functionality is already in ASIC form. This gives us a strong basis for restating the Ozturk approach from a table-based to arithmetic-based reduction operation. This section will briefly review the Ozturk approach and the next section will describe the iterative modular multiplication of this disclosure that uses the embedded FPGA DSP resources for a more efficient and higher performance result. The DSP-based approach of the iterative modular multiplication of this disclosure may be further implemented as an efficient multi-cycle version.
The large integer multiplication is implemented as a polynomial multiplication. The inputs A and B are unsigned integers represented in polynomial form as d + 1 radix R = 2w+1 digits. Note that there is a one bit overlap between consecutive digits.
From the radix-R digit notation, the polynomial notation (x = 2^w) follows:
Here Ai, Bi are the coefficients of the polynomials, and correspond to the radix-R digits from the original representation. This is highlighted in
The product P of two degree d polynomials A and B is a degree 2d polynomial, which may undergo modular reduction to reduce back to a degree d polynomial.
Knowing that x=2^w, the subproduct alignments are such that: the middle part
overlaps over
where k + l = i + j + 1, and the high part
overlaps over
where k + l = i + j + 2.
These alignments can be observed in
A set of w-bit wide additions aligned on the column output sums 240 may be performed, creating modified polynomial coefficients 228 such that their maximum widths do not exceed w+1. This is accomplished by a level of short adders 242 that sum the lower w bits of Di (Di mod 2w) with the bits having weights larger than 2w from Di-1 (Di-1 » w). This propagation may only be implemented for i
2, as for i = 0 does not produce any carry-out.
This level of adders 242 is depicted on the bottom of
The product P 228 can be written in polynomial form:
with Ci holding on 2w+1 bits.
The second part of the modular multiplication 220 is modular reduction 224, which involves reducing the product P 228 (e.g., the polynomial output previously generated) by mod N.
In the context of modular exponentiation, an exact (e.g., non-reducible) M may not necessarily be required, but rather any equivalent M is sufficient, as long as it meets a number of properties. One of these is that it be sufficiently easy to obtain, another is that the output have the same form as the input of the polynomial multiplication inputs 200.
The following property is used for obtaining M:
We split P in two parts:
Next, the high part is composed of d + 1 radix 2^(w+1) digits. For each digit, the reduced value mod N is tabulated:
Additionally, each Mi can be viewed as a degree-d polynomial, with coefficients Mi,j radix 2w digits. This allows for the following rewrite:
This results again in column-based summations, as shown in
Note that the output is still in redundant form, and a full width addition (which would be expensive in both area and latency) may be required in order to return the output in standard form. But this may not matter when w + 1 is chosen to be the same width as the multiplier for each polynomial element (e.g., match 27-bit multipliers found in the DSP blocks 120). The columns of the next iteration are still w bits wide, and any additional word-growth of each column because the coefficients are w +1 instead of w bits wide are contained by the column width (e.g., w bit) additions at the end of the iteration. Hence, the maximum coefficient size will always remain at w bits.
While the method of
Equation (13) may be rewritten as:
Here, xi mod N is a constant. It could be precomputed. Then Mi is calculated by simply multiplying Ci by that constant. That could be done by using a layer of DSP blocks 120, as shown in
Many blockchain and cryptographic applications, such as those providing the impetus for CSAIL2019, may use a larger value N than many VDF problems addressed in prior works. A fully parallel N-bit modular square operation for CSAIL2019 does not fit into even the biggest FPGAs available today.
An iterative approach saves FPGA area, but it also increases latency, and therefore could reduce overall design performance. A straightforward iterative modular multiplication mapping uses an iterative multiplication block followed by an iterative modular reduction block. The overall operation latency is therefore a sum of the multiplication latency and the modular reduction latency.
Instead, an improved iterative modular multiplication method has been developed where the iterative multiplication and the iterative modular reduction work in parallel. This is made possible by performing polynomial multiplication starting with most significant words — in effect, performing polynomial multiplication from left to right.
An example overview of a modular multiplication 270 is shown in
In the modular multiplication 270 of
Indeed, the first iteration of modular reduction 224 may start immediately after the first iteration of the polynomial multiplication 272. Consequently, the overall modular multiplication 270 operation latency is only slightly greater than the latency of a regular, non-modular multiplication.
The method may be summarized below:
Here, W-bit inputs A and B are subdivided into n limbs (e.g., bytes, words). On every iteration (loop index i is decremented from n-1 down to 0), A is multiplied by Bi by a rectangular multiplier to produce an n + 1 limb rectangular product. The lower n limbs of the product are stored in variable M to use in the next iteration. The upper limb of that product (Pn) is sent to the multiplier-based modular reduction circuit where it reduced modulo N. The reduced value is then fed into the running accumulator S. Upon completion of the loop, one modular reduction is done in order to reduce M mod N, before constructing the final result Z.
Using this method, A∗B mod N may be calculated in n + 1 latency cycles (e.g., assuming multiplication and reduction each take one cycle) using n times less resources when compared with a fully parallel implementation. The direction of compute is from the most significant bits of B, or from left to right, rather than the classical (pen-and-paper) right-to-left multiplication.
A hardware implementation of the iterative modular multiplier 270 is shown in
In the example of
Based on a selection signal Start, the selection circuitry 290 and 292 may provide new input polynomials into registers 294 and 296, respectively, or may maintain the polynomials A and B. The polynomial B is split into 8 limbs, with each limb having 16 coefficients (most significant limb has only the 9 least significant coefficients populated, with the rest tied to zero). A limb shifter 298 may shift the limbs of the polynomial B so that the most significant bit (MSB) limb 300 is multiplied each iteration.
A polynomial multiplier 302 component multiplies iteratively A by the limbs of B, starting from the most significant one, as previously explained in Algorithm 1. For the first iteration, the product flows through a polynomial adder 304 unaltered and gets split into an upper part Pn stored in a first register 306 and a lower part {Pn-1, ..., P0} stored in a second register 308, which may be shifted by a shifter 310 (e.g., < < 16) back on the second input of the adder 304, to be summed with the next partial product ABi-1.
For each iteration, the high part of the sum Pn is propagated to the modular reduction 274 component. The modular reduction 274 component includes a circuit to perform DSP-based modular reduction 312 and a circuit to perform lookup table (LUT)-based modular reduction 314. The DSP-based modular reduction 312 outputs a 121-coefficient result that is fed into a polynomial accumulator 316 that includes a polynomial adder 318 and an accumulator register 320 that stores an accumulated value S. Selection circuitry 322 may pass either a value of 0 or all but the most significant bit of the most significant limb of {Pn-1, ..., 0} into the polynomial adder 318. On the last iteration, all but the most significant bit of the most significant limb of {Pn-1, ..., P0} also get added into S. The most significant bit of Pn-1 is passed through the LUT-based modular reduction 314, and gets added into S as well. The 3120-bit range offered by the 121-coefficient polynomials ensures that at the output of the DSP-based modular reduction 312, no overflow can happen in the most significant coefficient by summing up 17 3104-bit terms. Even considering the 8 iterations involved in performing the full modular multiplication, it would not grow the most significant limb contribution above 3111 bits, which is lower than 3120.
Any suitable modular reduction circuits 312 and 314 may be used. Because the relative weight of the Pn term changes with every iteration, the value that is to be reduced consequently changes. There is a similar challenge for multiplicative reduction in that every iteration involves a different constant. Yet here, the cost of tabulation may be much less and, in some cases, may be absorbed by the FPGA DSP blocks 120 themselves. One example is shown in
In actual cryptocurrency applications, there is a race to finish first. Calculation speed (over many billions of calculations) has an outsized effect on how quickly results are obtained. Overclocking, however, can introduce errors. To account for this, the iterative multiplicative reduction circuit of this disclosure may be overclocked without long-term problems due to a very fast method for checking the value of intermediate results.
Consider an example in which the circuit is overclocked by 10%, making the calculations from the circuit 10% faster, but causing the circuit to throw up an error every 1,000,000 calculations. This means (since the circuit is being operated iteratively) that all following calculations will be wrong (e.g., the desired answer is after the billions of iterations). But if the correctness is calculated offline (e.g., using a processor running in the FPGA such as a Nios processor, using a processor running apart from the FPGA, using a separate computing system such as the host 18 of
Returning to the example where the overclock is 10% and backups occur 2×10,000 iterations every 1,000,000 clock cycles. In so doing, iteration performance is reduced by 2%, but throughput is increased by 10%, which means the circuit is still 8% faster overall. In an actual implementation, the iterative multiplicative reduction circuit may be overclocked even more (e.g., by 20% or more in some cases).
For very long-running computations it is very useful to be able to detect errors early on. If an error goes undetected then all computations performed after the error, which may be many years of computations, would be useless. On the other hand, with an error detection mechanism in place, the hardware can be safely overclocked — run using a clock frequency larger than what is reported by the design software (e.g., the Timing Analyzer of the Quartus software by Intel Corporation) — and rely on an error detection mechanism to catch errors caused by overclocking. In the unlikely event of an error, the system may be able to simply revert to a checkpoint state as a starting point. When the checkpoint state is saved every few minutes, if an error is detected, the system may revert to a starting point only several minutes old. This is insignificant in the case of what can be multi-month or multi-years runs.
Numerous approaches may be used to detect an error. In one example, the following approach may be used: instead of doing calculations modulo N, calculations modulo N′=NP may be performed, where P = 4294963787 is a 32-bit prime that produces the longest possible cycle L = (P - 3)/2 = 2147481892. Conversion of a value modulo N′ to a value modulo N involves taking the remainder modulo N of that value. Thus, operating mod N′ provides a way to check for errors in the calculations at any moment in time. The process involves comparing the result modulo P with the expected value as shown below. Note that K in Algorithm 2 represents the total number of modular multiplications (squarings) done so far.
Running the error detection algorithm takes just a couple of seconds on a CPU. A probability of an undetected error is 1/2147481892, which is extremely small.
If there is no error (decision block 356), the calculations may continue to run and/or the stored checkpoint may be identified as error-free (block 358). If there is an error (decision block 356), however, the current run may be stopped (block 360), the most recent error-free checkpoint (e.g., the previous checkpoint) may be retrieved (block 362), and the run may be restarted using the values (e.g., output, index) from the most recent error-free checkpoint (block 364).
Recent designs for large modular multiplication contain a datapath organized as a “simple loop” as shown on
An improved way to introduce pipelining into a very deep combinatorial design for FPGA may use inner loops. Indeed, the iterative multiplicative reduction circuit 270 shown in
For example, denote by X the inner loop iteration count (e.g., the execution stays in the inner loop for X=8 clock cycles), while completing one iteration of the outer loop takes an additional Y iterations. The total number of clock cycles per iteration is therefore X+Y. Adding an extra pipeline stage into the outer loop (Y➔Y+1) increases the total number of clock cycles per iteration by 1 (X+Y➔X+Y+1). The relative increase in clock cycles required to compute one modular multiplication (e.g., outer loop) can be expressed as C1=(X+Y+1)/(X+Y). Adding an additional pipeline stage in the outer loop using an additional register 390, as shown by a schematic diagram 394 of
This approach may be applied to the two-level loop datapath of the iterative multiplicative reduction circuit 270, as shown in
A CSAIL solver using the pipelined iterative multiplicative reduction circuit of
One goal in implementing this circuit is to create a regularly placed design. In one example, this was achieved by a combination of explicit (e.g., placing DSP blocks in columnar groups) and implicit (e.g., the directed pipelining method introduced in the earlier section) methods. An example floorplan implanting the pipelined iterative multiplicative reduction circuit of
In addition to the resources shown in the table, there are also an additional 6573 ALMs and 14 M20K memory blocks used to construct the entire iterative modular implementation. The presented architecture used a 350 MHz clock. As can be observed from Table 2, the proposed circuit balances the resources between the multiplication and reduction components well (e.g., both arithmetic logic units (ALMs) and digital signal processing (DSP) bocks). In addition, the circuit is highly energy efficient. Indeed, this approach has significantly more arithmetic efficiency (e.g., normalized latency) of the next nearest known method and much more energy efficiency.
The solver started running in February 2022, and found 21 milestone solutions (from t = 228 to t = 248) in the first 6 months.
The solver is using 350 MHz clock. One squaring operation takes 19 clock cycles, which gives 54 nanoseconds per squaring operation. The actual run times and estimated run times for future solutions is given in Table 3.
These results were reported to the MIT CSAIL team and confirmed as correct.
To achieve even higher performance results, the current iteration time may be reduced. Indeed, while the example architecture that has been implemented uses an 8-cycle iteration, which is driven by the number of resources (e.g., the 4150 DSP Blocks) on the mid-size device mentioned above, larger FPGAs may be used, with over 12 K DSP Blocks on some of the larger devices.
Several variations may significantly improve these current results. Using a device with more DSP blocks (e.g., an FPGA with twice the number of DSP Blocks) may enable doubling the throughput. In fact, the same FPGA family used to achieve the results above also has members with 3x the DSP Blocks. Although this would not evenly divide into the iteration granularity, a mixed (e.g., multiplicative and table based) approach may be employed. As both types of reduction calculations remap portions of values that are outside the N width modulus width back into that space, they will be compatible with each other. A relatively small amount of soft logic is being used compared to the amount of DSP circuitry, so a reasonably routable solution may be realized. One caveat is that reducing the number of iterations may increase the critical path, especially in the summation of the partial product columns (e.g., including both the multiplication and reduction portions of the implementation), which may impact the operating frequency negatively.
Moreover, the FPGA may be overclocked to achieve even faster results. For example, the FPGA may be overclocked by 10%. With a low logic use, and almost no memory blocks in the design, the power consumption is lower than typical for a full chip design of this size. As such, this is not near the thermal limits of this device. Moreover, there is a robust error checking method to continuously verify our results. There are several different possibilities to further develop overclocking: both tuning the methods of this disclosure as well as others. For example, the operating frequency may be boosted by 25% over the reported value by monitoring power (which may increase with increased frequency, thereby increasing temperature, and in turn reducing the thermal margin). The power-based performance improvement may be a slowly varying parameter. The continued correct operation of the circuit can be monitored by the error checking methodology explained earlier in this disclosure.
The circuit discussed above may be implemented on the integrated circuit system 12, which may be a component included in a data processing system, such as a data processing system 500, shown in
The data processing system 500 may be part of a data center that processes a variety of different requests. For instance, the data processing system 500 may receive a data processing request via the network interface 506 to perform encryption, decryption, machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, digital signal processing, or other specialized tasks.
The techniques and methods described herein may be applied with other types of integrated circuit systems. For example, the programmable routing bridge described herein may be used with central processing units (CPUs), graphics cards, hard drives, or other components.
While the embodiments set forth in the present disclosure may be susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and have been described in detail herein. However, the disclosure is not intended to be limited to the particular forms disclosed. The disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure as defined by the following appended claims.
The techniques presented and claimed herein are referenced and applied to material objects and concrete examples of a practical nature that demonstrably improve the present technical field and, as such, are not abstract, intangible or purely theoretical. Further, if any claims appended to the end of this specification contain one or more elements designated as “means for [perform]ing [a function]...” or “step for [perform]ing [a function]...”, it is intended that such elements are to be interpreted under 35 U.S.C. 112(f). However, for any claims containing elements designated in any other manner, it is intended that such elements are not to be interpreted under 35 U.S.C. 112(f).
EXAMPLE EMBODIMENT 1. Circuitry comprising:
EXAMPLE EMBODIMENT 2. The circuitry of example embodiment 1, wherein the first input value and the second input value comprise a plurality of limbs, and wherein the polynomial multiplication circuitry multiplies the first input value to the second input value from a most significant limb to a least significant limb.
EXAMPLE EMBODIMENT 3. The circuitry of example embodiment 2, wherein the polynomial multiplication circuitry generates the first component of the product as a partial product corresponding to multiplying the most significant limb of the first input value to the most significant limb of the second input value.
EXAMPLE EMBODIMENT 4. The circuitry of example embodiment 1, wherein the circuitry is implemented in programmable logic and digital signal processing (DSP) blocks of a field programmable gate array (FPGA).
EXAMPLE EMBODIMENT 5. The circuitry of example embodiment 1, wherein the modular reduction circuitry performs modular reduction by multiplicative modular reduction to generate a modular reduction result that is a sum of multiple individual multiplicative reduction results.
EXAMPLE EMBODIMENT 6. The circuitry of example embodiment 5, wherein the modular reduction circuitry comprises a lookup table having entries that are used in the multiplicative modular reduction.
EXAMPLE EMBODIMENT 7. The circuitry of example embodiment 6, comprising a state register that stores an operation read address for the lookup table as an index in a current iteration.
EXAMPLE EMBODIMENT 8. The circuitry of example embodiment 7, wherein the state register comprises an embedded memory of a digital signal processing (DSP) block of field programmable gate array (FPGA) circuitry.
EXAMPLE EMBODIMENT 9. The circuitry of example embodiment 1, comprising clock circuitry to operate at an overclocked frequency.
EXAMPLE EMBODIMENT 10. The circuitry of example embodiment 1, wherein the polynomial multiplication circuitry and the modular reduction circuitry are pipelined with multiple levels of pipelining using a plurality of intermediate registers, wherein different groups of the registers operate on different clocks.
EXAMPLE EMBODIMENT 11. An article of manufacture comprising one or more tangible, non-transitory, machine-readable media storing instructions to program a programmable logic device with a system design comprising:
EXAMPLE EMBODIMENT 12. The article of manufacture of example embodiment 11, wherein the modular reduction circuitry comprises digital signal processor (DSP)-based modular reduction circuitry and lookup table (LUT)-based modular reduction circuitry.
EXAMPLE EMBODIMENT 13. The article of manufacture of example embodiment 11, wherein the multiplication circuitry and the modular reduction circuitry comprise a plurality of pipeline registers.
EXAMPLE EMBODIMENT 14. A method comprising:
EXAMPLE EMBODIMENT 15. The method of example embodiment 14, wherein iteratively performing multiplication and modular reduction operations is carried out using a first integrated circuit and the error checking is performed using a different integrated circuit.
EXAMPLE EMBODIMENT 16. The method of example embodiment 14, wherein the second of the plurality of iterations occurs multiple iterations after the first of the plurality of iterations and wherein the error checking is performed at a lower clock speed than the multiplication and modular reduction operations.
EXAMPLE EMBODIMENT 17. The method of example embodiment 14, wherein iteratively performing multiplication and modular reduction operations is carried out using the integrated circuitry, wherein the integrated circuitry is overclocked.
EXAMPLE EMBODIMENT 18. The method of example embodiment 14, wherein the error checking is performed not on modulo N, where N is the current output, but rather on modulo N′=NP, wherein the value modulo N′ is converted to a value modulo N by taking a remainder modulo N of that value and comparing the result modulo P with an expected value.
EXAMPLE EMBODIMENT 19. The method of example embodiment 14, wherein the first output is stored in memory on the integrated circuitry on which the multiplication and modular reduction operations are iteratively performed.
EXAMPLE EMBODIMENT 20. The method of example embodiment 14, wherein the first output is stored in memory of a computing system distinct from the integrated circuitry on which the multiplication and modular reduction operations are iteratively performed.
This application claims priority to U.S. Provisional Application No. 63/409,201 filed Sep. 22, 2022, titled “INTEGRATED CIRCUIT ARCHITECTURE FOR A CRYPTOGRAPHIC PUZZLE SOLVER,” which is incorporated herein by reference in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63409201 | Sep 2022 | US |