The present invention generally relates to cryptography techniques; more specifically, the present invention relates to a cryptosystem processor with utilizing split-radix Discrete Galois Transformation (DGT).
The rapid development of quantum computers and quantum algorithms such as Shor's algorithm threatens the security basis of conventional public-key cryptosystems such as RSA and ECC. The urgent need to replace the conventional public-key cryptosystem with quantum-resistant cryptography, or the so-called post-quantum cryptography (PQC), drives the attention of researchers and standards organizations. In July 2020, the round 3 candidates of the NIST PQC competition were announced, and four public-key algorithms were disclosed as the finalists. The finalists include three lattice-based cryptography (i.e., CRYSTALS-KyberKEM, NTRU, SABER), and one code-based cryptography (i.e., Classic McEliece). As stated by the NIST, the PQC competition can evaluate the submissions by different criteria, including security, cost, and algorithm & implementation characteristics.
To evaluate the PQC candidates on the criteria of cost, there have been lots of published articles comparing different implementations of candidates. Within the supplementary functions of the proposed PQC schemes, the most time-consuming parts are the polynomial multiplication and the hash functions. The recent hardware and software/hardware co-design works offload the polynomial multiplication and the hash functions to dedicated hardware accelerators.
The polynomial multiplication is computationally intensive. Recent solutions to polynomial multiplication can be in quasi-linear time O (n log n) using the number-theoretic transforms (NTTs) when we treat the polynomial multiplication as a discrete convolution problem. It is remarkable that the polynomial multiplication over different polynomial rings should be treated carefully. For instance, the polynomial multiplication over q[x]/xn+1 can be treated as negative wrapped convolution between the vectors of the coefficients of input polynomials. The polynomials ring q[x]/xn+1 are widely used in many lattice-based cryptography algorithms because of the high computing efficiency. For instance, the cryptosystems CRYSTALS-KyberKEM, SABER, and Luybashevsky's public-key cryptosystem employ the aforementioned polynomial rings in their cryptography algorithm.
In this regard, the further related works are provided as follows. KyberKEM is one of the final round key encapsulation mechanisms in the NIST post-quantum cryptography competition. NTT, as the computing bottleneck of KyberKEM, has been widely studied.
There are several works concerning the implementation of Crystals-KyberKEM. Software implementation of KyberKEM has been studied. One of the related works proposed a memory-efficient high-speed optimization of KyberKEM on ARM Cortex-M4 core. Furthermore, the side-channel defence of KyberKEM was treated, which is also based on ARM Cortex-M4 core. Likewise, one of the related works studied the software optimization of KyberKEM on a high-performance platform.
Pure hardware implementations of KyberKEM have been investigated. The related work is a compact hardware implementation of KyberKEM for the third round submission in NIST PQC competition on Xilinx Artix-7 FPGA platform. The related work proposed an implementation for both FPGA and ASIC design with an improvement in polynomial sampling cores. One of the related works proposed the high-performance implementation of KyberKEM, NTRU and Saber with a novel Polynomial Vector Multiplication Unit (PVMU) design. One of the related works concerns the side-channel protection for KyberKEM in pure hardware.
On the other hand, software/hardware co-design implementations of KyberKEM have been investigated. One of the related works proposed an ASIC crypto-processor based on RISC-V architecture supporting Crystals-KyberKEM, Crystals-Dilithium, FrodoKEM, NewHope, and qTesla for the second round submission in NIST PQC competition, which was extended to FPGA platform. The related work proposed the integration of instruction sets for finite field arithmetic operations in a RISC-V processor, supporting PQC algorithms including KyberKEM and NewHope. The related work integrated the vectorized modular arithmetic operations and NTT computation in a RISC-V processor and presents the ASIC and FPGA implementation result, supporting PQC algorithms including KyberKEM, Saber and NewHope.
The literature regarding improving the polynomial multiplication in hardware is also being discussed. Several works focusing on the implementation of the NTT computation have been investigated. One of the related works proposed a low-complexity NTT/INTT algorithm, absorbing the pre-process and post-process into NTT and INTT, respectively. In the related work, a parallel architecture is proposed for high-speed NTT design. Other related works proposed NTT-based polynomial multiplication architectures for KyberKEM on FPGA.
Currently, there are several challenges in the field of cryptography, particularly related to computing complexity. The aforementioned methods may not sufficiently reduce the computing complexity in a smooth manner, leading to difficulties in improving performance. Therefore, there is a need to develop robust transformation algorithms that can effectively address these challenges.
It is an objective of the present invention to provide an apparatus and a method to address the aforementioned shortcomings and unmet needs in the state of the art. In accordance with one aspect of the present invention, a cryptosystem processor for operating split-radix discrete Galois transformation/inverse discrete Galois transformation is provided. The cryptosystem processor includes a twiddle factor memory, at least one split radix discrete Galois transformation/inverse discrete Galois transformation butterfly unit (SRDGT BFU), and a stream permutation network (SPN). The twiddle factor memory is instantiated by dual-port read only memory (ROM) and has a first ZETA port and a second ZETA port. The at least one SRDGT BFU has six input ports and four output ports and switchable among operation in a discrete Galois transformation (DGT) mode, an inverse discrete Galois transformation (IDGT) mode, or a component-wise multiplication (CWM) code, in which two of the input ports electrically communicate with the first ZETA port and the second ZETA port, respectively. The SRDGT BFU is configured to read and write two data points when working under the DGT or IDGT mode and is configured to read and write four data points when working under the CWM mode. The SPN electrically communicates with the SRDGT BFU and has a first dual-port block random access memory (BRAM), a second dual-port BRAM, and a third dual-port BRAM which serve as memory caches configured to store polynomial, wherein the SPN is configured to support the required number of data points reading or writing per cycle in the DGT mode, IDGT mode, or the CWM mode.
In accordance with another one aspect of the present invention, a split radix DGT apparatus is provided. The split radix DGT apparatus includes a cryptosystem processor, a RAM module, and an input/output part. The RAM module electrically communicates with the cryptosystem processor and is configured to store polynomials and to pass the polynomials into the cryptosystem processor. The input/output part is configured to work as an input/output buffer for the architecture of the split radix DGT apparatus.
By the configuration of the present invention, a novel DGT algorithm that leverages the split-radix method is provided. It is an objective to reduce computing complexity while maintaining the transform length. The algorithm achieves lower computing complexity without compromising the transform length, making it a more efficient alternative to existing NTT algorithms when implemented in software or hardware. Additionally, the configuration of the present invention ensures efficient processing through a fully-pipelined scheduling technique, facilitated by a dedicated stream permutation network. The applied approach of the present invention enables smooth data flow during the transformation process, enhancing overall performance.
To optimize hardware utilization, the configuration of the present invention introduces a compact and unified split-radix DGT processor. This processor shares multipliers among the nine working modes of the split-radix DGT algorithm, reducing hardware requirements and allowing for parameterization based on specific computing tasks. By integrating different operations and designing compact hardware modules, these processors streamline operations and enhance overall system efficiency. The combination of the split-radix DGT algorithm and the efficient processor design results in improved performance, resource optimization, and reduced hardware complexity.
Embodiments of the invention are described in more details hereinafter with reference to the drawings, in which:
In the following description, a cryptosystem with utilizing split-radix Discrete Galois Transformation (DGT) and the likes are set forth as preferred examples. It will be apparent to those skilled in the art that modifications, including additions and/or substitutions may be made without departing from the scope and spirit of the invention. Specific details may be omitted so as not to obscure the invention; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.
To make the present invention understandable, notations and basic operations are stated before embodiments.
Table I provides the mathematical notations used in the present disclosure. q is sued to denote the polynomial ring q[x]/xn+1 defined over the field q, where q is a prime integer.
CRYSTALS-KyberKEM is a key-encapsulation mechanism with Adaptive Chosen Ciphertext Attack (IND-CCA2) security. The security of KyberKEM is based on the hardness of the learning-with-errors problem in module lattices (i.e., MLWE problem). To construct an IND-CCA2-secure KEM, CRYSTALS-Kyber uses the slightly tweaked Fujisaki-Okamoto (FO) transform to transfer a Chosen Plaintext Attack (IND-CPA) secure Public-Key Encryption (PKE) scheme, which is called as CRYSTALS-KyberPKE. The parameter sets for CRYSTALS-KyberKEM is shown in Table II. The key generation, encryption, and decryption of the CRYSTALS-KyberPKE are defined as follows, with following the definition of the help functions CBD, Parse, Compress, NTT, and INTT in:
NTT is a variant of Discrete Fourier Transform (DFT) by changing the complex number field into finite field q. Given a polynomial of length n, the length-n NTT (noted as NTTn) is defined as Âj=NTTn(A)j=Σi=0n-1Aiωnij mod q, where 0≤j<n. ωn(mod q) denotes the primitive n-th root of unit over q or twiddle factor of length-n NTT. ωn (mod q) exists when q≡1 (mod n).
The inverse NTT (INTT) can be performed by replacing the twiddle factor of length-n NTT ωn (mod q) by ωn−1 (mod q), and multiplying the scalar factor n−1 (mod q) after the summation. The length-n INTT(noted as INTTn) is defined as Ai=INTTn(Â)i=n−1Σj=0n-1Âjωn−ij mod q, where 0≤i<m.
According to fast Fourier transform and convolution algorithms, polynomial multiplication over q can be solved efficiently by negative wrapped convolution (NWC) when the prime parameter q satisfies 2n|(q−1). NWC can introduce pre-processing before NTT and post-processing after INTT. In order to reduce the computing complexity of NWC, a related work proposed low-complexity NTT and INTT algorithms (noted as LC NTT/INTT) by merging the pre and post processing into NTT and INTT without additional modular multiplication. Based on the related work, the number of modular multiplications in the LC NTT/INTT algorithms is
Starting from the second-round submission of Crystals-KyberKEM, the parameter set (n,q) is selected as (256,3329) as shown in Table II. Given that n|(q−1) but 2n|(q−1), the aforementioned NWC via NTT cannot be applied directly. A variant of NTT proposed can be adopted to apply the NWC in Crystals-Kyber KEM. Such a variant is based on the observation that, when polynomial F is factored into a product F=GH over the finite field q, an isomorphism by the Chinese remainder theorem is provided:
Since x256+1=Πi=0127 (x2−ω2562i+1) gives the primitive 256-th roots of unity ω256, the definition of NTT working on 3329[x]/x256+1 (noted as NTT2563329) is given by:
where 0≤i<128 and Â2i+1=Σj=0127A2j+1ω256(2i+1)j and Â2i=Σj=0127A2j ω256(2i+1)j. Now both Â2i+1 and Â2i can get solved by the length-128 low-complexity NTT. As for the inverse transform, two length-128 low-complexity INTTs can be used to reconstruct A(x) from Â2i+1 and Â2i. The NWC working on 3329[x]/x256+1 can also get solved as:
INTT2563329(NTT2563329(a)∘NTT2563329(b)).
Such component-wise multiplication is defined as:
where 0≤i<128. Thus, polynomial multiplication over 3329[x]/x256+1 is performed by four length-128 low-complexity NTTs, one length-128 component-wise multiplication, and two length-128 low-complexity INTTs.
Consider A(x)∈q. Let
and z=xm=√{square root over (xn)}≡√{square root over (−1)}(mod q). Then, A(x) is rewritten as:
where Āi=(Ai+mz+Ai), 0≤i<m. It is notable that the Āi∈q[z]/z2+1, which is isomorphic to GF(q2). Given 0≤i,j<m, some arithmetic operations over q[z]/z2+1 are defined as:
The ζm∈q[z]/z2+1 can be defined, such that ζmm=z. Such ζm exists when 4m|(q−1). It is observed that {ζm4i+1,∀0≤i<m} is a set of solutions of the equation xm=z on q[z]/z2+1, indicating {ζm4i+1, ∀0≤i<m} fulfills the following properties:
where k is a power-of-two integer that is smaller than m. Thus, the set of twiddle factors in Discrete Galois Transform (DGT) is defined as {ζm4i+1, ∀0≤i<m}. For a length-m polynomial Ā, whose entities Āi∈q[z]/z2+1, the definition of length-m DGT (noted as DGTm) can be
multiplications in q[z]/z2+1 are needed in DGTm and IDGTm, respectively. Recall that the addition in q[z]/z2+1 involves no modular multiplication while each multiplication in q[z]/z2+1 includes three modular multiplications using the Karatsuba method. The number of modular multiplication(s) in DGTm and IDGTm can be
respectively.
According to a related work, the length-n NWC can also be solved via DGT as:
IDGTm(DGTm(
where
Thus, length-n NEW can be performed as two length-m DGTs after pre-processing, one length-m point-wise multiplication, and one length-m IDGT following by post-processing.
In the present invention, split-radix DGT and inverse Discrete Galois Transform (IDGT) is proposed to reduce the computing complexity. In the present disclosure, an approach is proposed to integrate the split radix and decimation-in-time (DIT) into the low-complexity DGT algorithm, while using split radix and decimation-in-frequency (DIF) to derive IDGT. These novel split-radix DGT/IDGT algorithms inherit the advantages of small multiplication number from split radix nature and the short transformation length from DGT/IDGT, which enable low complexity NWCs.
The low-complexity DGT is derived in the split radix and decimation-in-time (DIT). Given a length-m polynomial Ā, whose entities Āi∈q[z]/z2+1. The derivation is started by splitting the summation of DGT into three groups according to the index of
where 0≤j<m. The degree-m DGT can be decomposed into
is set as
For
is rewritten in terms of
j, and
(Eq. 3A) represents the asymmetric DIT butterfly computation for split-radix DGT. It is noted there are two boundary cases at m=2 and m=4. When m=2, the DGT problem
And, when m=4, the DGT problem
The details of the proposed split-radix DIT DGT are shown in Algorithm 1. The split-radix DGT butterfly in (Eq. 3A) is observed as being asymmetric, and different butterflies can be processed at boundary cases m=2,4. It is recommended to decompose the asymmetric butterfly operations as well as the butterfly operations in boundary cases into the similar butterfly operators. In the proposed algorithm of the present invention, four butterfly operators shown in
The low complexity IDGT is derived in the split radix and decimation-in-frequency (DIF) nature. Given a length-m polynomial
The derivation of
For
is substituted into
with defining
It is found that (Eq. 5) is equivalent to the length
of
To construct the other two subproblems of length
namely
again, the derivation is started from the definition of IDGT with post-processing, but splitting the summation to four groups according to the index of
For
is substituted into
With Defining that:
The one can simplify (Eq. 7) as:
It is found that (Eq. 9) are equivalent to the length
of
can also be constructed similar to (Eq. 7)-(Eq. 9). In summary, the split-radix DIF IDGT butterfly operations are defined as:
It is noted the two boundary cases at m=2,4. When m=2, the IDGT problem Āi can be solved as:
And, when m=4, the IDGT problem is solved as:
Complexity analysis on the split radix DGT/IDGT is provided herein.
To analyze the computation cost of split-radix DIT DGT, one can set up the recurrent equations based on the asymmetric split-radix butterflies and the two boundary cases. The number of modular multiplication and modular addition in a length-m DGT is defined as M(m) and A(m), respectively. Given the size of each sub-problems in (Eq. 3A) is m/4, one can find that m/2 additions over q[z]/z2+1 and m/2 multiplications over q[z]/z2+1 are needed in the first stage of split-radix DIT DGT butterfly computation. The second stage of the DGT butterfly computation involved m additions over q[z]/z2+1 but no multiplication. In summary, 3m/2 additions over q[z]/z2+1 and m/2 multiplications over q[z]/z2+1 are required to compute the length-m/4 sub-problem of split-radix DIT DGT. Recall that each addition over q[z]/z2+1 is separated into 2 modular additions, and each multiplication over q[z]/z2+1 involves 5 modular additions and 3 modular multiplications when using Karatsuba algorithm. Accordingly, 11m/2 modular additions and 3m/2 modular multiplications in GF(q) are required for the length-m/4 sub-problem of split-radix DIT DGT. Recall that when m=2, the DGT problem consists of 2 additions over q[z]/z2+1 and 1 multiplication over q[z]/z2+1 as shown in (Eq. 3B). Accordingly, 9 modular additions and 3 modular multiplications in GF(q) are required when m=2. Similarly, 31 modular additions and 9 modular multiplications are required when m=4.
Similar to the split-radix DIT DGT, the recurrence equations based on the asymmetric split-radix DIF IDGT butterfly and the two boundary cases (i.e., (Eq. 10), (Eq. 11), and (Eq. 12)) can be set up to analyze the computation cost. Observing that the split-radix DIF IDGT butterfly and the two boundary cases requiring the same number of multiplication and addition over q[z]/z2+1 as in DGT, the cost of the split-radix DIT DGT and the split-radix DIF IDGT can be represented in terms of modular multiplications M(m) and modular additions A(m) by the following recurrences:
such that,
Having the above analysis,
The split-radix DGT and IDGT can also save 9.6% of modular multiplications compared to the low-complexity NTT and INTT. Similarly, the split-radix DGT/IDGT needs one less stage than the low-complexity NTT/INTT. The reason is that a length-n NTT/INTT is equivalent to a
(which means the transform size is halved in DGT/IDGT compared to NTT/INTT). Additionally, as DIT is applied in DGT and DIF is used in IDGT, no bit-reordering on the coefficients is required.
The split-radix DGT and IDGT can be applied to solve the polynomial multiplication on q[x]/xn+1 when 2n|(q−1) and n is a power of 2, and it is a more efficient variant as comparing with the classic DGT/IDGT. The split-radix DGT and IDGT can also provide a shorter transform length and need one less stage comparing with the other NTT/INTT algorithms. Thus, the split-radix DGT/IDGT of the present invention is competitive in the design of high-performance NWC architecture.
In the present invention, an architecture design is provided as well, which refers to an apparatus of cryptosystem with utilizing split-radix DGT/IDGT.
As afore mentioned, the CRYSTALS-KyberKEM adopted the parameter set (n,q) as (256,3329), which can divide the length-256 NTT into two length-128 NTTs of odd-index terms and the even-index terms, respectively. Considering using DGT to replace the length-128 NTT in CRYSTALS-KyberKEM, the pack operation (e.g., as shown in (Eq. 0)) is required to pack the odd-index terms and the even-index terms from q into q[z]/z2+1. Therefore, the DGT in CRYSTALS-KyberKEM consists of two length-64 DGTs for the odd-index terms and the even-index terms (i.e., which are noted as odd polynomial and even polynomial in this disclosure, respectively). Additionally, the available twiddle factor ζm in KyberKEM can be set as ∂64=1+737*z.
As shown in
The unified SRDGT BFU 112 is designed to compute DGT and IDGT in iterative nature.
The SRDGT BFU 112 of the present invention is designed to support nine working modes to implement the SRDGT butterfly as shown in
Among the nine working modes of the SRDGT BFU 112, four are for the iterative DGT (DGT 0-1, DGT 0-2, DGT 0-3, and DGT 1 as shown in
The CWM is defined as
By using the Karatsuba-based CWM approach, the number of multiplications over q[z]/z2+1 can be obtained via:
The Karatsuba-based CWM approach can be used for computation under using two working modes in the SRDGT BFU 112 of the present invention, namely CWM 0 and CWM 1. The computation of (Eq. 13) is mapped to the data flow of BFU as:
The detailed dataflow and working mechanism of the SRDGT BFU 112 are stated in
The multiplier over q[z]/z2+1 is provided herein.
Notice that each multiplication in (Eq. 13) multiplication over q[z]/z2+1. Therefore, a multiplier over q[z]/z2+1 is required to compute this operation.
The DSP48E1 slice in Xilinx FPGA consists of one multiplier and two adders. Since all the operators in DSP48E1 is programmable by fully utilizing these high performance hardware resources, one can design a high throughput multiplier over q[z]/z2+1. In the present disclosure, the Karatsuba algorithm is adopted, with given
As can be seen from (Eq. 15), there are three multiplications, four additions, two subtractions, and two modular reductions in each multiplication over q[z]/z2+1. In the present invention, mapping the whole computations in (Eq. 15) is provided into three DSP48E1, as shown in
Stream Permutation Network and Fully Pipelined Scheduling are provided herein.
In the present invention, the stream permutation network (SPN) and the data scheduling plan are designed to support two main goals for single SRDGT BFU (i.e., the SRDGT BFU 112 as afore described): (1) SPN can satisfy the bandwidth requirement of the SRDGT BFU; and (2) the schedule of SPN can ensure a fully pipelined working mode of DGT/IDGT.
The above goals can be achieved based on at least three features observable from
Based on the above features, in some embodiments, the SPN 116 can support the required/desired numbers of data points reading/writing per cycle in the DGT/IDGT/CWM mode. For example, the SPN 116 can support 2/2/4 data points reading per cycle in the DGT/IDGT/CWM mode, respectively. Similarly, the SPN 116 can also support 2/2/4 data points writing per cycle in the DGT/IDGT/CWM mode, respectively.
In order to satisfy the SPN data width requirement, three true dual-port block RAM (BRAM), namely MEM 0, MEM 1, and MEM 2, as shown in
When the SRDGT BFU 112 works in DGT/IDGT mode, the two read ports of one BRAM and the two write ports of another BRAM are enabled.
As shown in
In some embodiments, when the SRDGT BFU 112 works in CWM mode, there are 4 data points received by the SRDGT BFU 112 and 4 data points output from the SRDGT BFU 112 in each cycle.
The main challenge of implementing a fully pipelined iterative DGT/IDGT lies in the data dependency between adjacent transform stages. The fully pipelined scheduling plan of the present invention is specific for the DGT/IDGT in KyberKEM, consisting of two length-64 DGTs. The two length-64 DGTs are interspersed and processed alternately to eliminate the data dependency between adjacent transform stages.
The cycle count of the fully pipelined SRDGT BFU (i.e., the SRDGT BFU 112) and the state-of-the-art LC NTT are analyzed and compared then. The SRDGT BFU requires 2×64/2×log2 64=384 cycles for the length-64 DGTs of odd and even polynomials, and no pipelined bubble exists. Calculating the same length-128 NTT. LC NTT requires 128/2×log2 128=448 cycles of odd and even polynomials, with additional 64 cycles of pipelined bubbles to write the results back to BRAMs. The above comparison demonstrates the advantages of the halved transform length DGT, the data scheduling plan, and the fully pipelined architecture of the present invention.
The extensions to multiple BFUs are provided herein. In KyberKEM, a higher security level requires more DGT computation tasks. In some embodiments, in order to support multiple tasks simultaneously, the extension to multiple the SRDGT BFUs 112 is available, as shown in
More features regarding hardware architecture of KyberKEM are provided herein.
In some embodiments, KyberKEM involves key generation, encapsulation, and decapsulation. In the present invention, hardware architecture is provided as shown in
Referring to
The CBD module 212 and the reject sampling module 214 can be configured to perform sampling in the functions CBDη and Parse, respectively. The compress and decompress modules 220, 222 are responsible for the compress and decompress of ciphertext, respectively. The encode module 218 is configured to transfer the data format from the byte array to the coefficients of a polynomial, and the decode module 216 transfers the coefficients of a polynomial back to the byte array. The encode and decode modules 218 and 216 are modified from the open-source code. The Keccak module 210 is configured to compute the functions of SHAKE128, SHAKE256, SHA3-256, and SHA3-512. The functionality of the Keccak module 210 is expanded from the open-source code, and it will take 24 clock cycles to execute 24 rounds in the function KECCAK-f.
The bandwidth matching carrying through the architecture is used to increase the area time efficiency. In addition, the entire structure is divided into three parts, with different data bit widths for different parts. The advantage of setting bandwidth matching in different parts is the overall hardware latency, and the consumed resources can trade off based on the security level. The data bandwidth is 64 bits, 48 bits, and 48×k bits in the I/O part, the sample/serialization part, and the DGT part, respectively, where k is the security level parameter of KyberKEM and equals to the scalability parameter in the split radix DGT module 230 (i.e., the cryptosystem processor 100) as afore defined. The I/O part includes the input and output FIFOs, working as the input/output buffer of the architecture. In the sample/serialization part, the byte array from input FIFO can be sent to the Keccak module 210 to sample and the decode module 216 to de-serialize into 48-bit width. The compress module 220 is able to accept the 48-bit-width data from the encode module 218 and serialize it to 64-bit width data for the output FIFO. In some embodiments, the KyberKEM-Split Radix DGT apparatus 200 may further include a data register 232, which electrically communicates with the compress module and is configured to store the 64-bit width data from the compress module 220. The RAM module 224 is configured to store the sampling polynomials from the CBD module 212 and the reject sampling module 214 and the decompressed polynomials from the decompress module 222 (i.e., the decompress module 222 can be configured to decompress polynomials). The byte write function of the Xilinx BRAM instance can be used in the RAM module 224 to facilitate the flexibility of the write bandwidth. In some embodiments, when k polynomials for DGT/IDGT/CWM are ready, the SRDGT module with the split radix DGT module 230 (i.e., the cryptosystem processor 100) will load these k polynomials and process them simultaneously. In some embodiments, the KyberKEM-Split Radix DGT apparatus 200 may further include control units, the input and output FIFOs.
In present invention, the just-in-time strategy is applied to minimize the memory footprint. The just-in-time strategy means that the sampling polynomials are generated based on the requirement of the succeeding computation. For example, the strategy is applied for the data generated by reject sampling module 214. The reject sampling module 214 samples the output from the Keccak module 210 under the uniform distribution. The output of the reject sampling module 214 is stored in the RAM module 224, including  in key generation and ÂT in encryption, and can get passed to the SRDGT module with the split radix DGT module 230 (i.e., the cryptosystem processor 100) until k polynomials are ready. Each of these polynomials in the cases can be used only once. Thus, the memory space can be overwritten by the following k polynomials based on the just-in-time strategy, and the memory space reserved can get reduced from k2 polynomials to k polynomials.
The implementation results and comparisons are provided herein.
The hardware design of KyberKEM-SRDGT of the present invention has been synthesized and implemented using Vivado 2019.2 design suite on Xilinx XC7A200 (Artix-7) FPGA device, with all the building blocks implemented in hardware.
Regarding split-radix DGT module results and comparisons, the hardware resource utilization and the latency specification of the SRDGT module are shown in Table V in
Due to the careful placement of registers and the usage of high-speed DSP48E1 slides in Artix-7 FPGA, the SRDGT module is able to operate at a frequency of 239 MHz. Another merit of the SRDGT algorithm and the architecture is the relatively small cycle count. Specifically, the DGT, IDGT and CWM computations require 384, 384, and 132 cycles, respectively, for length-256 polynomial multiplication. And the latency of DGT, IDGT and CWM are 1.6 μs, 1.6 μs, and 0.55 μs, respectively.
In comparison to the SW implementation, the cycle count of the SRDGT architecture achieves a speedup of 20.1×, 24.3×, 211.2× for NTT (DGT), INTT(IDGT), and CWM, respectively. In comparison to the HW/SW implementations, the SRDGT hardware achieves more than 32.4× speedup for NTT (DGT) computation. Besides, some related works use 1.86× and 1.81× more LUTs than design of the present disclosure in a similar FPGA platform, respectively.
The state-of-the-art HW implementations are divided into two groups depending on whether the CWM is supported. The hardware in some related works support CWM. One of the related works has a higher NTT ATP ratio in LUT, BRAM, and DSP compared to the architecture of the present invention, indicating the high efficiency of our architecture. The architecture of the present invention still has lower cycle counts because the transform size is halved, and only six stages are required in our split radix DGT and IDGT, with the full-pipelined working nature provided by the SPN. One of the related works also presented a unified butterfly unit for NTT, INTT, and CWM. However, taking advantage of the novel split-radix DGT algorithm of the present invention, the cycle count of the present invention is only 384/512=75% of the counts in the previous work for NTT (DGT), and only 132/256=51.6% of the counts in the previous work for CWM. The architecture of the present invention outperforms the NTT ATP and CWM ATP compared to the previous work except for the NTT-DSP ATP because of the compact design in the unified BFU of the previous work. One of the related works proposes three different configurations to trade off the hardware resources and speed. The architecture of the present invention outperforms all these configurations concerning the LUT-NTT and BRAM-NTT ATP, while their work can have a better DSP-NTT ATP. Besides, the architecture of the present invention outperforms the CWM ATP ratios for LUT, BRAM, and DSP compared to the related works.
The clock cycle counts of NTT (DGT) and INTT (IDGT) are used, unlike directly using the ATP of the NTT and CWM when comparing the architecture of the present invention with related works for fairness since these works do not support CWM while the architecture of the present invention uses additional hardware resources for CWM.
Regarding KyberKEM results and comparisons, an ATP ratio is the normalized product of FPGA resources and the total time by setting the architecture of the present invention as baseline.
In the KyberKEM architecture with the SRDGT module of the present invention, all the dimensions k defined in KyberKEM specification are supported. The data bandwidth of implementation of the present invention is set to 64 bits. The design of the present invention achieves more than 227.6× speedup when compared with one of the related works, which is software implementation on ARM Cortex-M4. Compared with the HW/SW co-design in one of the related works, the architecture of the present invention achieves at least 43.8× speedup and 340.5/333.5/163.8× smaller LUT-Time ATP, BRAM-Time ATP, and DSP-Time ATP, respectively, among all the security level of KyberKEM.
The architecture of the present invention is compared with the related pure hardware implementations. For all the security levels of KyberKEM, the hardware corresponding to the architecture of the present invention obtains at least 1.0×, 2.1×, 2.8×, and 10.7× speedup compared with some of the related works, respectively. Compared with one of the related works, the architecture of the present invention utilizes 1.0/0.6/0.4×ATP in Kyber-512-CCA, but only 0.7/0.5/0.3×ATP in Kyber-1024-CCA, in terms of LUT-Time ATP, BRAM-Time ATP, and DSP-Time ATP, respectively. The reason may be that the KyberKEM architecture of the present invention can benefit more from the Split Radix DGT module at a lower security level. Nevertheless, at a higher security level, the schedule bottleneck can be Keccak and Reject Sample (modules), but not the Split Radix DGT (module). This fact will cause the total cycle gap between the design of the architecture of the present invention and one of the related works to decrease gradually, namely from 1.3× to 1.0× total cycles from Kyber-512-CCA to Kyber-1024-CCA.
As discussed above, the development of quantum computers threatens the security of the conventional public-key cryptography algorithms. CRYSTALS-KyberKEM is one of the leading algorithms in the ongoing NIST Post-Quantum Cryptography (PQC) competition. As a lattice-based cryptographic scheme, the efficiency of CRYSTALS-KyberKEM is dependent on the polynomial multiplication over Rq or equivalently NWC.
In the present disclosure, the implementation of DGT with the split-radix method is explored, thereby providing a higher level of parallelism compared to the LC NTT and less computational complexity compared to classic DGT. The architecture of split-radix DGT module of the present invention can support DGT, IDGT and CWM specific for KyberKEM, and outperforms the state-of-the-arts on NWC modules. In the meantime, KyberKEM architecture with split-radix DGT module of the present invention is configured to support all the security levels of KyberKEM.
The architecture of the present invention can increase performance and hardware efficiency than the state-of-the-arts. In some experiments, there are specifically only 35.7 μs, 47.6 μs, and 68.6 us required for Kyber-512-CCA. Kyber-768-CCA and Kyber-1024-CCA, respectively.
ROM, RAM, and other logical components can be realized through a well-designed layout that incorporates a range of physical components. In some embodiments, this includes the incorporation of passive elements like resistors, inductors, and capacitors, which are crucial for regulating current flow and stabilizing voltage levels. In some embodiments, active elements such as transistors and integrated circuits are arranged in specific embodiments to amplify and control signals within the system. In some embodiments, the layout is further supported by the presence of interconnecting wires and conductive traces, which enable seamless transmission of signals between components.
The functional units and modules of the apparatuses and methods in accordance with the embodiments disclosed herein may be implemented using computing devices, computer processors, or electronic circuitries including but not limited to application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), microcontrollers, and other programmable logic devices configured or programmed according to the teachings of the present disclosure. Computer instructions or software codes running in the computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.
All or portions of the methods in accordance to the embodiments may be executed in one or more computing devices including server computers, personal computers, laptop computers, mobile computing devices such as smartphones and tablet computers.
The embodiments may include computer storage media, transient and non-transient memory devices having computer instructions or software codes stored therein, which can be used to program or configure the computing devices, computer processors, or electronic circuitries to perform any of the processes of the present invention. The storage media, transient and non-transient memory devices can include, but are not limited to, floppy disks, optical discs, Blu-ray Disc, DVD, CD-ROMs, and magneto-optical disks, ROMs, RAMs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.
Each of the functional units and modules in accordance with various embodiments also may be implemented in distributed computing environments and/or Cloud computing environments, wherein the whole or portions of machine instructions are executed in distributed fashion by one or more processing devices interconnected by a communication network, such as an intranet, Wide Area Network (WAN), Local Area Network (LAN), the Internet, and other forms of data transmission medium.
The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art.
The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated.