CRYPTOSYSTEM WITH UTILIZING SPLIT-RADIX DISCRETE GALOIS TRANSFORMATION

TECHNICAL FIELD

The present invention generally relates to cryptography techniques; more specifically, the present invention relates to a cryptosystem processor with utilizing split-radix Discrete Galois Transformation (DGT).

BACKGROUND

The rapid development of quantum computers and quantum algorithms such as Shor's algorithm threatens the security basis of conventional public-key cryptosystems such as RSA and ECC. The urgent need to replace the conventional public-key cryptosystem with quantum-resistant cryptography, or the so-called post-quantum cryptography (PQC), drives the attention of researchers and standards organizations. In July 2020, the round 3 candidates of the NIST PQC competition were announced, and four public-key algorithms were disclosed as the finalists. The finalists include three lattice-based cryptography (i.e., CRYSTALS-KyberKEM, NTRU, SABER), and one code-based cryptography (i.e., Classic McEliece). As stated by the NIST, the PQC competition can evaluate the submissions by different criteria, including security, cost, and algorithm & implementation characteristics.

To evaluate the PQC candidates on the criteria of cost, there have been lots of published articles comparing different implementations of candidates. Within the supplementary functions of the proposed PQC schemes, the most time-consuming parts are the polynomial multiplication and the hash functions. The recent hardware and software/hardware co-design works offload the polynomial multiplication and the hash functions to dedicated hardware accelerators.

The polynomial multiplication is computationally intensive. Recent solutions to polynomial multiplication can be in quasi-linear time O (n log n) using the number-theoretic transforms (NTTs) when we treat the polynomial multiplication as a discrete convolution problem. It is remarkable that the polynomial multiplication over different polynomial rings should be treated carefully. For instance, the polynomial multiplication over custom-character _q[x]/xⁿ+1 can be treated as negative wrapped convolution between the vectors of the coefficients of input polynomials. The polynomials ring _q[x]/xⁿ+1 are widely used in many lattice-based cryptography algorithms because of the high computing efficiency. For instance, the cryptosystems CRYSTALS-KyberKEM, SABER, and Luybashevsky's public-key cryptosystem employ the aforementioned polynomial rings in their cryptography algorithm.

In this regard, the further related works are provided as follows. KyberKEM is one of the final round key encapsulation mechanisms in the NIST post-quantum cryptography competition. NTT, as the computing bottleneck of KyberKEM, has been widely studied.

There are several works concerning the implementation of Crystals-KyberKEM. Software implementation of KyberKEM has been studied. One of the related works proposed a memory-efficient high-speed optimization of KyberKEM on ARM Cortex-M4 core. Furthermore, the side-channel defence of KyberKEM was treated, which is also based on ARM Cortex-M4 core. Likewise, one of the related works studied the software optimization of KyberKEM on a high-performance platform.

Pure hardware implementations of KyberKEM have been investigated. The related work is a compact hardware implementation of KyberKEM for the third round submission in NIST PQC competition on Xilinx Artix-7 FPGA platform. The related work proposed an implementation for both FPGA and ASIC design with an improvement in polynomial sampling cores. One of the related works proposed the high-performance implementation of KyberKEM, NTRU and Saber with a novel Polynomial Vector Multiplication Unit (PVMU) design. One of the related works concerns the side-channel protection for KyberKEM in pure hardware.

On the other hand, software/hardware co-design implementations of KyberKEM have been investigated. One of the related works proposed an ASIC crypto-processor based on RISC-V architecture supporting Crystals-KyberKEM, Crystals-Dilithium, FrodoKEM, NewHope, and qTesla for the second round submission in NIST PQC competition, which was extended to FPGA platform. The related work proposed the integration of instruction sets for finite field arithmetic operations in a RISC-V processor, supporting PQC algorithms including KyberKEM and NewHope. The related work integrated the vectorized modular arithmetic operations and NTT computation in a RISC-V processor and presents the ASIC and FPGA implementation result, supporting PQC algorithms including KyberKEM, Saber and NewHope.

The literature regarding improving the polynomial multiplication in hardware is also being discussed. Several works focusing on the implementation of the NTT computation have been investigated. One of the related works proposed a low-complexity NTT/INTT algorithm, absorbing the pre-process and post-process into NTT and INTT, respectively. In the related work, a parallel architecture is proposed for high-speed NTT design. Other related works proposed NTT-based polynomial multiplication architectures for KyberKEM on FPGA.

Currently, there are several challenges in the field of cryptography, particularly related to computing complexity. The aforementioned methods may not sufficiently reduce the computing complexity in a smooth manner, leading to difficulties in improving performance. Therefore, there is a need to develop robust transformation algorithms that can effectively address these challenges.

SUMMARY OF INVENTION

It is an objective of the present invention to provide an apparatus and a method to address the aforementioned shortcomings and unmet needs in the state of the art. In accordance with one aspect of the present invention, a cryptosystem processor for operating split-radix discrete Galois transformation/inverse discrete Galois transformation is provided. The cryptosystem processor includes a twiddle factor memory, at least one split radix discrete Galois transformation/inverse discrete Galois transformation butterfly unit (SRDGT BFU), and a stream permutation network (SPN). The twiddle factor memory is instantiated by dual-port read only memory (ROM) and has a first ZETA port and a second ZETA port. The at least one SRDGT BFU has six input ports and four output ports and switchable among operation in a discrete Galois transformation (DGT) mode, an inverse discrete Galois transformation (IDGT) mode, or a component-wise multiplication (CWM) code, in which two of the input ports electrically communicate with the first ZETA port and the second ZETA port, respectively. The SRDGT BFU is configured to read and write two data points when working under the DGT or IDGT mode and is configured to read and write four data points when working under the CWM mode. The SPN electrically communicates with the SRDGT BFU and has a first dual-port block random access memory (BRAM), a second dual-port BRAM, and a third dual-port BRAM which serve as memory caches configured to store polynomial, wherein the SPN is configured to support the required number of data points reading or writing per cycle in the DGT mode, IDGT mode, or the CWM mode.

In accordance with another one aspect of the present invention, a split radix DGT apparatus is provided. The split radix DGT apparatus includes a cryptosystem processor, a RAM module, and an input/output part. The RAM module electrically communicates with the cryptosystem processor and is configured to store polynomials and to pass the polynomials into the cryptosystem processor. The input/output part is configured to work as an input/output buffer for the architecture of the split radix DGT apparatus.

By the configuration of the present invention, a novel DGT algorithm that leverages the split-radix method is provided. It is an objective to reduce computing complexity while maintaining the transform length. The algorithm achieves lower computing complexity without compromising the transform length, making it a more efficient alternative to existing NTT algorithms when implemented in software or hardware. Additionally, the configuration of the present invention ensures efficient processing through a fully-pipelined scheduling technique, facilitated by a dedicated stream permutation network. The applied approach of the present invention enables smooth data flow during the transformation process, enhancing overall performance.

To optimize hardware utilization, the configuration of the present invention introduces a compact and unified split-radix DGT processor. This processor shares multipliers among the nine working modes of the split-radix DGT algorithm, reducing hardware requirements and allowing for parameterization based on specific computing tasks. By integrating different operations and designing compact hardware modules, these processors streamline operations and enhance overall system efficiency. The combination of the split-radix DGT algorithm and the efficient processor design results in improved performance, resource optimization, and reduced hardware complexity.

BRIEF DESCRIPTION OF DRAWINGS

Embodiments of the invention are described in more details hereinafter with reference to the drawings, in which:

FIG. 1 shows Table I and Table II for illustrating mathematical notations and parameters according to one aspect of the present disclosure;

FIG. 2 shows the data flow and the butterfly of an 8-point split-radix DGT according some embodiments of the present invention;

FIG. 3 shows Algorithm 1 for split-radix DGT according some embodiments of the present invention;

FIG. 4 shows Algorithm 2 for split-radix DGT according some embodiments of the present invention;

FIG. 5 shows Table III for illustrating comparison on the number of modular operations according to one aspect of the present disclosure;

FIG. 6 depicts architecture of a cryptosystem processor for operating the split-radix DGT/IDGT according some embodiments of the present invention;

FIG. 7A and FIG. 7B illustrate detailed block diagrams of the SRDGT BFU according some embodiments of the present invention;

FIG. 7C shows Table IV for illustrating nine modes for the SRDGT BFU according to one aspect of the present disclosure;

FIG. 8 illustrate an exemplary architecture of multiplier over custom-character _q[z]/z²+1 according to one aspect of the present disclosure;

FIG. 9 depicts exemplary scheduling of memory operations for the SRDGT BFU according to one aspect of the present disclosure;

FIG. 10 depicts exemplary scheduling of memory operations for the CWM mode of according to some embodiments of the present invention;

FIG. 11 depicts architecture of a Split Radix DGT apparatus according some embodiments of the present invention.

FIG. 12. shows Table V for illustrating an implementation result for the comparison; and

FIG. 13 depicts Table VI for showing the hardware resource utilization and the latency of the hardware system according to one aspect of the present disclosure.

DETAILED DESCRIPTION OF THE INVENTION

In the following description, a cryptosystem with utilizing split-radix Discrete Galois Transformation (DGT) and the likes are set forth as preferred examples. It will be apparent to those skilled in the art that modifications, including additions and/or substitutions may be made without departing from the scope and spirit of the invention. Specific details may be omitted so as not to obscure the invention; however, the disclosure is written to enable one skilled in the art to practice the teachings herein without undue experimentation.

To make the present invention understandable, notations and basic operations are stated before embodiments. FIG. 1 shows Table I and Table II for illustrating mathematical notations and parameters according to one aspect of the present disclosure.

Table I provides the mathematical notations used in the present disclosure. custom-character _qis sued to denote the polynomial ring _q[x]/xⁿ+1 defined over the field _q, where q is a prime integer.

CRYSTALS-KyberKEM is a key-encapsulation mechanism with Adaptive Chosen Ciphertext Attack (IND-CCA2) security. The security of KyberKEM is based on the hardness of the learning-with-errors problem in module lattices (i.e., MLWE problem). To construct an IND-CCA2-secure KEM, CRYSTALS-Kyber uses the slightly tweaked Fujisaki-Okamoto (FO) transform to transfer a Chosen Plaintext Attack (IND-CPA) secure Public-Key Encryption (PKE) scheme, which is called as CRYSTALS-KyberPKE. The parameter sets for CRYSTALS-KyberKEM is shown in Table II. The key generation, encryption, and decryption of the CRYSTALS-KyberPKE are defined as follows, with following the definition of the help functions CBD, Parse, Compress, NTT, and INTT in:

- KeyGen (⋅): Key generation samples s and e from centered binomial distribution (CBD), and Â from uniform distribution (Parse). The public key pk=(ρ, {circumflex over (t)}) and secret key sk=ŝ are returned where ρ is the random seed and {circumflex over (t)}=Â∘ŝ+ê.
- Enc(pk, M): Encryption samples r, e₁and E₂from CBD, and Â from Parse. The ciphertext ct=(Compress(u),Compress(V)) is returned where u=INTT(Â^T∘{circumflex over (r)})+e₁and V=INTT({circumflex over (t)}^T∘{circumflex over (r)})+E₂+M.
- Dec(sk,ct): Decryption returns the recovered message M=Compress(V−INTT(ŝ^T∘û) where u and V are decompressed from ct.

NTT and Inverse NTT (INTT):

NTT is a variant of Discrete Fourier Transform (DFT) by changing the complex number field into finite field custom-character _q. Given a polynomial of length n, the length-n NTT (noted as NTT_n) is defined as Â_j=NTT_n(A)_j=Σ_i=0^n-1A_iω_n^ijmod q, where 0≤j<n. ω_n(mod q) denotes the primitive n-th root of unit over _qor twiddle factor of length-n NTT. ω_n(mod q) exists when q≡1 (mod n).

The inverse NTT (INTT) can be performed by replacing the twiddle factor of length-n NTT ω_n(mod q) by ω_n⁻¹(mod q), and multiplying the scalar factor n⁻¹(mod q) after the summation. The length-n INTT(noted as INTT_n) is defined as A_i=INTT_n(Â)_i=n⁻¹Σ_j=0^n-1Â_jω_n^−ijmod q, where 0≤i<m.

Polynomial Multiplication Via NTT:

According to fast Fourier transform and convolution algorithms, polynomial multiplication over custom-character _qcan be solved efficiently by negative wrapped convolution (NWC) when the prime parameter q satisfies 2n|(q−1). NWC can introduce pre-processing before NTT and post-processing after INTT. In order to reduce the computing complexity of NWC, a related work proposed low-complexity NTT and INTT algorithms (noted as LC NTT/INTT) by merging the pre and post processing into NTT and INTT without additional modular multiplication. Based on the related work, the number of modular multiplications in the LC NTT/INTT algorithms is

$\frac{n}{2} \log_{2} n .$

Starting from the second-round submission of Crystals-KyberKEM, the parameter set (n,q) is selected as (256,3329) as shown in Table II. Given that n|(q−1) but 2n|(q−1), the aforementioned NWC via NTT cannot be applied directly. A variant of NTT proposed can be adopted to apply the NWC in Crystals-Kyber KEM. Such a variant is based on the observation that, when polynomial F is factored into a product F=GH over the finite field custom-character _q, an isomorphism by the Chinese remainder theorem is provided:

$ℤ_{q} [x] / (F) ≅ ℤ_{q} [x] / (G) \times ℤ_{q} [x] / (H) .$

Since x²⁵⁶+1=Π_i=0¹²⁷(x²−ω₂₅₆²ⁱ⁺¹) gives the primitive 256-th roots of unity ω₂₅₆, the definition of NTT working on custom-character ₃₃₂₉[x]/x²⁵⁶+1 (noted as NTT₂₅₆³³²⁹) is given by:

${{NTT}_{256}^{3329} (A)}_{i} = A (x) \mod (x^{2} - ω_{256}^{2 i + 1})$

$\begin{matrix} = x \sum_{j = 0}^{127} A_{2 j + 1} ω_{256}^{(2 i + 1) j} + \sum_{j = 0}^{127} A_{2 j} ω_{256}^{(2 i + 1) j} \\ = x {\hat{A}}_{2 i + 1} + {\hat{A}}_{2 i}, \end{matrix}$

where 0≤i<128 and Â_2i+1=Σ_j=0¹²⁷A_2j+1ω₂₅₆^(2i+1)jand Â_2i=Σ_j=0¹²⁷A_2jω₂₅₆^(2i+1)j. Now both Â_2i+1and Â_2ican get solved by the length-128 low-complexity NTT. As for the inverse transform, two length-128 low-complexity INTTs can be used to reconstruct A(x) from Â_2i+1and Â_2i. The NWC working on custom-character ₃₃₂₉[x]/x²⁵⁶+1 can also get solved as:

INTT₂₅₆³³²⁹(NTT₂₅₆³³²⁹(a)∘NTT₂₅₆³³²⁹(b)).

Such component-wise multiplication is defined as:

$A (x) \circ B (x) \mod (x^{2} - ω_{2 5 6}^{2 i + 1}) = (x Â_{2 i + 1} + Â_{2 i}) (x {\hat{B}}_{2 i + 1} + {\hat{B}}_{2 i}) \mod (x^{2} - ω_{2 5 6}^{2 i + 1}),$

where 0≤i<128. Thus, polynomial multiplication over custom-character ₃₃₂₉[x]/x²⁵⁶+1 is performed by four length-128 low-complexity NTTs, one length-128 component-wise multiplication, and two length-128 low-complexity INTTs.

Negative Wrapped Convolution Via DGT:

Consider A(x)∈ custom-character _q. Let

$m = \frac{n}{2}$

and z=x^m=√{square root over (xⁿ)}≡√{square root over (−1)}(mod q). Then, A(x) is rewritten as:

$\begin{matrix} A (x) = A_{n - 1} x^{n - 1} + A_{n - 2} x^{n - 2} + \dots + A_{1} x^{1} + A_{0} = (A_{n - 1} x^{m} + A_{m - 1}) x^{m - 1} + \dots + (A_{m} x^{m} + A_{0}) = (A_{n - 1} z + A_{m - 1}) x^{m - 1} + \dots + (A_{m} z + A_{0}) = {\bar{A}}_{m - 1} x^{m - 1} + \dots + {\bar{A}}_{1} x^{1} + {\bar{A}}_{0}, & (Eq . 0) \end{matrix}$

where Ā_i=(A_i+mz+A_i), 0≤i<m. It is notable that the Ā_i∈ custom-character _q[z]/z²+1, which is isomorphic to GF(q²). Given 0≤i,j<m, some arithmetic operations over _q[z]/z²+1 are defined as:

$\begin{matrix} Addition : {\bar{A}}_{i} + {\bar{A}}_{j} = (A_{i + m} + A_{j + m}) z + (A_{i} + A_{j}); & (Eq . 1) \end{matrix}$

$Multiplication : {\bar{A}}_{i} \circ {\bar{A}}_{j} = (A_{i} A_{j} - A_{i + m} A_{j + m}) z + (A_{i} A_{j + m} + A_{i + m} A_{j}) .$

The ζ_m∈ custom-character _q[z]/z²+1 can be defined, such that ζ_m^m=z. Such ζ_mexists when 4m|(q−1). It is observed that {ζ_m⁴ⁱ⁺¹,∀0≤i<m} is a set of solutions of the equation x^m=z on _q[z]/z²+1, indicating {ζ_m⁴ⁱ⁺¹, ∀0≤i<m} fulfills the following properties:

$Symmetry : ζ_{m}^{4 (i + \frac{m}{2}) + 1} = ζ_{m}^{4 i + 1} ζ_{m}^{2 m} = (- 1) ζ_{m}^{4 i + 1};$

$Periodicity : ζ_{m}^{4 (i + m) + 1} = ζ_{m}^{4 i + 1} ζ_{m}^{4 m} = ζ_{m}^{4 i + 1};$

$Scalability : ζ_{m / k}^{4 (i / k) + 1} = ζ_{m / k}^{4 (i / k)} ζ_{m / k}^{1} = ζ_{m}^{4 i + k};$

$Semi - symmetry : ζ_{m}^{4 (i + \frac{m}{4}) + 1} = ζ_{m}^{4 i + 1} ζ_{m}^{m} = ζ_{m}^{4 i + 1} z;$

where k is a power-of-two integer that is smaller than m. Thus, the set of twiddle factors in Discrete Galois Transform (DGT) is defined as {ζ_m⁴ⁱ⁺¹, ∀0≤i<m}. For a length-m polynomial Ā, whose entities Ā_i∈ custom-character _q[z]/z²+1, the definition of length-m DGT (noted as DGT_m) can be Â_j=DGT_m(Ā)_j=Σ_i=0^m-1(Ā_iζ_mⁱ)ζ_m^4ji, where 0≤j<m. Similarly, the definition of length-m IDGT (noted as IDGT_m) can be IDGT_m(Â)_i=m⁻¹ζ_m⁻ⁱΣ_j=0^m-1(Âζ_m^−4ji), where 0≤i<m. According to a related work, one can perform the DGT_mand IDGT_malgorithms similar to the classic NTT and INTT, by replacing the arithmetic operations in custom-character _qwith arithmetic operations in _q[z]/z²+1 defined in (Eq. 1), which means

$\frac{m}{2} \log_{2} m + m and \frac{m}{2} \log_{2} m + 2 m$

multiplications in custom-character _q[z]/z²+1 are needed in DGT_mand IDGT_m, respectively. Recall that the addition in _q[z]/z²+1 involves no modular multiplication while each multiplication in _q[z]/z²+1 includes three modular multiplications using the Karatsuba method. The number of modular multiplication(s) in DGT_mand IDGT_mcan be

$3 \frac{m}{2} \log_{2} m + 3 m and 3 \frac{m}{2} \log_{2} m + 6 m,$

respectively.

According to a related work, the length-n NWC can also be solved via DGT as:

IDGT_m(DGT_m(A)∘DGT_m(B)),

where

$m = \frac{n}{2} .$

Thus, length-n NEW can be performed as two length-m DGTs after pre-processing, one length-m point-wise multiplication, and one length-m IDGT following by post-processing.

In the present invention, split-radix DGT and inverse Discrete Galois Transform (IDGT) is proposed to reduce the computing complexity. In the present disclosure, an approach is proposed to integrate the split radix and decimation-in-time (DIT) into the low-complexity DGT algorithm, while using split radix and decimation-in-frequency (DIF) to derive IDGT. These novel split-radix DGT/IDGT algorithms inherit the advantages of small multiplication number from split radix nature and the short transformation length from DGT/IDGT, which enable low complexity NWCs.

The Proposed Split Radix DGT:

The low-complexity DGT is derived in the split radix and decimation-in-time (DIT). Given a length-m polynomial Ā, whose entities Ā_i∈ custom-character _q[z]/z²+1. The derivation is started by splitting the summation of DGT into three groups according to the index of Â as follows:

${\hat{\overline{A}}}_{j} = \sum_{i = 0}^{\frac{m}{4} - 1} {{\overline{A}}_{4 i + 1} (ζ_{m}^{4 j + 1})}^{4 i + 1} + \sum_{i = 0}^{\frac{m}{2} - 1} {{\overline{A}}_{2 i} (ζ_{m}^{4 j + 1})}^{2 i} + \sum_{i = 0}^{\frac{m}{4} - 1} {{\overline{A}}_{4 i + 3} (ζ_{m}^{4 j + 1})}^{4 i + 3} = ζ_{m}^{4 j + 1} \sum_{i = 0}^{\frac{m}{4} - 1} {{\overline{A}}_{4 i + 1} (ζ_{m / 4}^{4 j + 1})}^{i} + \sum_{i = 0}^{\frac{m}{2} - 1} {{\overline{A}}_{2 i} (ζ_{m / 2}^{4 j + 1})}^{i} + ζ_{m}^{3 (4 j + 1)} \sum_{i = 0}^{\frac{m}{4} - 1} {{\overline{A}}_{4 i + 1} (ζ_{m / 4}^{4 j + 1})}^{i},$

where 0≤j<m. The degree-m DGT can be decomposed into

$two degree - \frac{m}{4} D G T s and one degree - \frac{m}{2} D G T .$

Namely,

$\sum_{i = 0}^{\frac{m}{2} - 1} {{\bar{A}}_{2 i} (ζ_{m / 2}^{4 j + 1})}^{i} as {\hat{\overline{W}}}_{j}, \sum_{i = 0}^{\frac{m}{4} - 1} {{\bar{A}}_{4 i + 1} (ζ_{m / 4}^{4 j + 1})}^{i}$

is set as {circumflex over (X)}_j, and

$\sum_{i = 0}^{\frac{m}{4} - 1} {{\bar{A}}_{4 i + 3} (ζ_{m / 4}^{4 j + 1})}^{i}$

is set as Ŷ, then:

$\begin{matrix} {\hat{\overline{A}}}_{j} = {\hat{\overline{W}}}_{j} + ζ_{m}^{4 j + 1} {\hat{\overline{X}}}_{j} + ζ_{m}^{3 (4 j + 1)} {\hat{\overline{Y}}}_{j} & (Eq . 2) \end{matrix}$

For

$\begin{matrix} 0 \leq j < \frac{m}{4}, & (Eq . 2) \end{matrix}$

is rewritten in terms of Ŵ_j,

$\hat{{\overline{W}}_{j + \frac{m}{4},}}$

{circumflex over (X)}
_j, and Ŷ_jas:

$\begin{matrix} {\hat{\overline{A}}}_{j} = {\hat{\overline{W}}}_{j} + (ζ_{m}^{4 j + 1} {\hat{\overline{X}}}_{j} + ζ_{m}^{3 (4 j + 1)} {\hat{\overline{Y}}}_{j}), & (Eq . 3 A) \end{matrix}$

${\hat{\overline{A}}}_{j + \frac{m}{4}} = {\hat{\overline{W}}}_{j + \frac{m}{4}} + z (ζ_{m}^{4 j + 1} \hat{\overline{X_{j}}} - ζ_{m}^{3 (4 j + 1)} \hat{\overline{Y_{j}}}),$

${\hat{\overline{A}}}_{j + \frac{m}{2}} = {\hat{\overline{W}}}_{j} - (ζ_{m}^{4 j + 1} \hat{\overline{X_{j}}} + ζ_{m}^{3 (4 j + 1)} \hat{\overline{Y_{j}}}),$

${\hat{\overline{A}}}_{j + \frac{3 m}{4}} = {\hat{\overline{W}}}_{j + \frac{m}{4}} - z (ζ_{m}^{4 j + 1} \hat{\overline{X_{j}}} - ζ_{m}^{3 (4 j + 1)} \hat{\overline{Y_{j}}}) .$

(Eq. 3A) represents the asymmetric DIT butterfly computation for split-radix DGT. It is noted there are two boundary cases at m=2 and m=4. When m=2, the DGT problem Â_jcan get solved as:

$\begin{matrix} \hat{{\overline{A}}_{0}} = {\overline{A}}_{0} + {\overline{A}}_{1} ζ_{2}, & (Eq . 3 B) \end{matrix}$

$\hat{{\overline{A}}_{1}} = {\overline{A}}_{0} + {\overline{A}}_{1} ζ_{2}^{5} = {\overline{A}}_{0} - \hat{{\overline{A}}_{1}} ζ_{2} .$

And, when m=4, the DGT problem Â_jis solved as:

$\hat{{\overline{A}}_{0}} = ({\bar{A}}_{0} + {\bar{A}}_{2} ζ_{4}^{2}) + ({\bar{A}}_{1} ζ_{4} + {\bar{A}}_{3} ζ_{4}^{3}),$

$\hat{{\overline{A}}_{1}} = ({\bar{A}}_{0} - {\bar{A}}_{2} ζ_{4}^{2}) + z ({\bar{A}}_{1} ζ_{4} - {\bar{A}}_{3} ζ_{4}^{3}),$

$\hat{{\overline{A}}_{2}} = ({\bar{A}}_{0} + {\bar{A}}_{2} ζ_{4}^{2}) - ({\bar{A}}_{1} ζ_{4} + {\bar{A}}_{3} ζ_{4}^{3}),$

$\hat{{\overline{A}}_{3}} = ({\bar{A}}_{0} - {\bar{A}}_{2} ζ_{4}^{2}) - z ({\bar{A}}_{1} ζ_{4} - {\bar{A}}_{3} ζ_{4}^{3}) .$

FIG. 2 shows the data flow and the butterfly of an 8-point split-radix DGT according some embodiments of the present invention. In FIG. 2, dataflow and the proposed butterfly operators of low-complexity split-radix DGT and IDGT are provided, in which TW_1 and TW_2 are twiddle factors which are described in later Algorithm 1 and 2 as show in FIG. 3 and FIG. 4.

The details of the proposed split-radix DIT DGT are shown in Algorithm 1. The split-radix DGT butterfly in (Eq. 3A) is observed as being asymmetric, and different butterflies can be processed at boundary cases m=2,4. It is recommended to decompose the asymmetric butterfly operations as well as the butterfly operations in boundary cases into the similar butterfly operators. In the proposed algorithm of the present invention, four butterfly operators shown in FIG. 1 are applied, namely DGT 1, DGT 0-1, DGT 0-2, and DGT 0-3. Additionally, the order sequence of each of operators can be pre-computed and stored into an integer SEQ (i.e., the pre-computed integer). In some embodiments, a method is proposed to generate the SEQ as well as the corresponding control logic to select the target operator, as shown in Algorithm 1. The help function br_l(i) can generate the bit reversal of integer i ranging from 0 to (2^l−1). For example, br₄(1011_b)=1101_b. The help function scramble_l(A) permutes the length−2^lpolynomial A, moving the i-th term to index br_l(i).

The Proposed Split Radix IDGT:

The low complexity IDGT is derived in the split radix and decimation-in-frequency (DIF) nature. Given a length-m polynomial Â, one has Ā=IDGT_m(Â). It is defined that

${\bar{W}}_{i} = {\bar{A}}_{2 i} for 0 \leq i < \frac{m}{2}, {\bar{X}}_{i} = {\bar{A}}_{4 i + 1} and {\overline{Y}}_{i} = {\bar{A}}_{4 i + 3} for 0 \leq i < \frac{m}{4} .$

The derivation of W_iis started by splitting the summation of IDGT into two groups according to the index of Â as follows:

$\begin{matrix} {\bar{A}}_{i} = m^{- 1} ζ_{m}^{- i} \sum_{j = 0}^{\frac{m}{2} - 1} ({\hat{\bar{A}}}_{j} + {\hat{\bar{A}}}_{j +}_{\frac{m}{2}} ζ_{m}^{- 4 i \frac{m}{2}}) ζ_{m}^{- 4 ji}, & (Eq . 4) \end{matrix}$

For

$\begin{matrix} 0 \leq i < \frac{m}{2}, & (Eq . 4) \end{matrix}$

is substituted into W_i=Ā_2i, and the scalability property is applied, in which substituting (ζ_m^4m)≡1 on custom-character _q[z]/z²+1, as follows:

$\begin{matrix} \begin{matrix} {\bar{W}}_{i} = m^{- 1} ζ_{m}^{- 2} \sum_{j = 0}^{\frac{m}{2} - 1} (\hat{{\overline{A}}_{j}} + \hat{{\overline{A}}_{j + \frac{m}{2}}} ζ_{m}^{- 4 m i}) ζ_{m}^{2 \times (- 4 i j)} \\ = {(\frac{m}{2})}^{- 1} ζ_{\frac{m}{2}}^{- i} \sum_{j = 0}^{\frac{m}{2} - 1} (\frac{\hat{{\overline{A}}_{j}} + \hat{{\overline{A}}_{j + \frac{m}{2}}}}{2}) ζ_{\frac{m}{2}}^{- 4 i j} \\ = {(\frac{m}{2})}^{- 1} ζ_{\frac{m}{2}}^{- i} \sum_{j = 0}^{\frac{m}{2} - 1} \hat{\overline{W_{j}}} ζ_{\frac{m}{2}}^{- 4 i j}, \end{matrix} & (Eq . 5) \end{matrix}$

with defining

${\hat{\overline{W}}}_{j} \frac{\hat{{\overline{A}}_{j}} + \hat{{\overline{A}}_{j + \frac{m}{2}}}}{2} .$

It is found that (Eq. 5) is equivalent to the length

$- \frac{m}{2} IDGT$

of ŵ_j. Thus, the subproblem of length m/2 is constructed.

To construct the other two subproblems of length

$- \frac{m}{4},$

namely X_iand Y_ifor

$0 \leq i < \frac{m}{4},$

again, the derivation is started from the definition of IDGT with post-processing, but splitting the summation to four groups according to the index of Â. For 0≤i<m:

$\begin{matrix} ⁠ {\bar{A}}_{i} = ⁠ m^{- 1} ζ_{m}^{- i} \sum_{j = 0}^{\frac{m}{4} - 1} (\hat{{\overline{A}}_{j}} + \hat{{\overline{A}}_{j + \frac{m}{4}}} ζ_{m}^{- 4 i \frac{m}{4}} + {\hat{\bar{A}}}_{j + \frac{m}{2}} ζ_{m}^{- 4 i \frac{m}{2}} + {\hat{\bar{A}}}_{j + \frac{3 m}{4}} ζ_{m}^{- 4 i \frac{3 m}{4}}) ζ_{m}^{- 4 j i} . & (Eq . 6) \end{matrix}$

For

$\begin{matrix} 0 \leq i < \frac{m}{4}, & (Eq . 6) \end{matrix}$

is substituted into X_i=Ā_4i+1, and the scalability property is applied, in which substituting (ζ_m^4m)≡1 on custom-character _q[z]/z²+1, and it is obtained:

$\begin{matrix} {\bar{X}}_{i} = {\bar{A}}_{4 i + 1} = {(\frac{m}{4})}^{- 1} ζ_{\frac{m}{4}}^{- i} \sum_{j = 0}^{\frac{m}{4} - 1} [\frac{1}{2} (\hat{{\overline{A}}_{j}} - \hat{{\overline{A}}_{j + \frac{m}{2}}}) + \frac{- z}{2} (\hat{{\overline{A}}_{j + \frac{m}{4}}} - \hat{{\overline{A}}_{j + \frac{3 m}{4}}})] \frac{ζ_{m}^{- (4 j + 1)}}{2} ζ_{\frac{m}{4}}^{- 4 ji} . & (Eq . 7) \end{matrix}$

With Defining that:

$\begin{matrix} {\hat{\overline{X}}}_{j} = [\frac{1}{2} ({\hat{\overline{A}}}_{j} - {\hat{\overline{A}}}_{j + \frac{m}{2}}) + \frac{- z}{2} ({\hat{\overline{A}}}_{j + \frac{m}{4}} - {\hat{\overline{A}}}_{j + \frac{3 m}{4}})] \frac{ζ_{m}^{- (4 j + 1)}}{2}, & (Eq . 8) \end{matrix}$

The one can simplify (Eq. 7) as:

$\begin{matrix} {\bar{X}}_{i} = {(\frac{m}{4})}^{- 1} ζ_{\frac{m}{4}}^{- i} \overset{\frac{m}{4} - 1}{\sum_{j = 0}} {\hat{\overline{X}}}_{j} ζ_{\frac{m}{4}}^{- 4 ji}, & (Eq . 9) \end{matrix}$

It is found that (Eq. 9) are equivalent to the length

$- \frac{m}{4} IDGT$

of {circumflex over (X)}. Thus, the subproblem of length m/4 is constructed. The subproblem Y_i=Ā_4i+3for

$0 \leq i < \frac{m}{4}$

can also be constructed similar to (Eq. 7)-(Eq. 9). In summary, the split-radix DIF IDGT butterfly operations are defined as:

$\begin{matrix} {\hat{\overline{W}}}_{j} = \frac{1}{2} ({\hat{\overline{A}}}_{j} + {\hat{\overline{A}}}_{j + \frac{m}{2}}), & (Eq . 10) \end{matrix}$

${\hat{\overline{W}}}_{j + \frac{m}{4}} = \frac{1}{2} ({\hat{\overline{A}}}_{j + \frac{m}{4}} + {\hat{\overline{A}}}_{j + \frac{3 m}{4}}),$

$\hat{{\overline{X}}_{j}} = [\frac{1}{2} ({\hat{\overline{A}}}_{j} - {\hat{\overline{A}}}_{j + \frac{m}{2}}) + \frac{- z}{2} ({\hat{\overline{A}}}_{j + \frac{m}{4}} - {\hat{\overline{A}}}_{j + \frac{3 m}{4}})] \frac{ζ_{m}^{- (4 j + 1)}}{2},$

$\hat{{\overline{Y}}_{j}} = [\frac{1}{2} ({\hat{\overline{A}}}_{j} - {\hat{\overline{A}}}_{j + \frac{m}{2}}) - \frac{- z}{2} ({\hat{\overline{A}}}_{j + \frac{m}{4}} - {\hat{\overline{A}}}_{j + \frac{3 m}{4}})] \frac{ζ_{m}^{- 3 (4 j + 1)}}{2},$

It is noted the two boundary cases at m=2,4. When m=2, the IDGT problem Ā_ican be solved as:

$\begin{matrix} {\bar{A}}_{0} = \frac{1}{2} ({\hat{\overline{A}}}_{0} + {\hat{\overline{A}}}_{1}) & (Eq . 11) \end{matrix}$

${\bar{A}}_{1} = \frac{1}{2} ({\hat{\overline{A}}}_{0} + {\hat{\overline{A}}}_{1} ζ_{2}^{- 4}) ζ_{2}^{- 1} = \frac{1}{2} ({\hat{\overline{A}}}_{0} - {\hat{\overline{A}}}_{1}) ζ_{2}^{- 1} .$

And, when m=4, the IDGT problem is solved as:

$\begin{matrix} {\bar{A}}_{0} = \frac{1}{2} (\frac{{\hat{\overline{A}}}_{0} + {\hat{\overline{A}}}_{2}}{2} + \frac{{\hat{\overline{A}}}_{1} + {\hat{\overline{A}}}_{3}}{2}), & (Eq . 12) \end{matrix}$

${\bar{A}}_{1} = \frac{ζ_{4}^{- 1}}{2} (\frac{{\hat{\overline{A}}}_{0} - {\hat{\overline{A}}}_{2}}{2} - z \frac{{\hat{\overline{A}}}_{1} - {\hat{\overline{A}}}_{3}}{2}),$

${\bar{A}}_{2} = \frac{ζ_{4}^{- 2}}{2} (\frac{{\hat{\overline{A}}}_{0} + {\hat{\overline{A}}}_{2}}{2} - \frac{{\hat{\overline{A}}}_{1} + {\hat{\overline{A}}}_{3}}{2}),$

${\bar{A}}_{3} = \frac{ζ_{4}^{- 3}}{2} (\frac{{\hat{\overline{A}}}_{0} + {\hat{\overline{A}}}_{2}}{2} + z \frac{{\hat{\overline{A}}}_{1} + {\hat{\overline{A}}}_{3}}{2}) .$

FIG. 2 shows the data flow and the butterfly of an 8-point IDGT. The details of the split-radix DIF IDGT are shown in Algorithm 2. Help functions scramble_land br_lare defined as afore-mentioned. Similar to DGT, the butterfly operations are decomposed as well as the boundary cases m=2,4 into multiple computing operators. In the proposed IDGT algorithm, three butterfly operators shown in FIG. 2 can be applied, namely DIF 1, DIF 0-1, DIF 0-2. It is observed that the proposed DGT and IDGT can share the pre-computed integer (SEQ), thus the memory overhead for operator selection can get reduced.

Complexity analysis on the split radix DGT/IDGT is provided herein.

To analyze the computation cost of split-radix DIT DGT, one can set up the recurrent equations based on the asymmetric split-radix butterflies and the two boundary cases. The number of modular multiplication and modular addition in a length-m DGT is defined as M(m) and A(m), respectively. Given the size of each sub-problems in (Eq. 3A) is m/4, one can find that m/2 additions over custom-character _q[z]/z²+1 and m/2 multiplications over _q[z]/z²+1 are needed in the first stage of split-radix DIT DGT butterfly computation. The second stage of the DGT butterfly computation involved m additions over _q[z]/z²+1 but no multiplication. In summary, 3m/2 additions over _q[z]/z²+1 and m/2 multiplications over custom-character _q[z]/z²+1 are required to compute the length-m/4 sub-problem of split-radix DIT DGT. Recall that each addition over _q[z]/z²+1 is separated into 2 modular additions, and each multiplication over _q[z]/z²+1 involves 5 modular additions and 3 modular multiplications when using Karatsuba algorithm. Accordingly, 11m/2 modular additions and 3m/2 modular multiplications in GF(q) are required for the length-m/4 sub-problem of split-radix DIT DGT. Recall that when m=2, the DGT problem consists of 2 additions over custom-character _q[z]/z²+1 and 1 multiplication over _q[z]/z²+1 as shown in (Eq. 3B). Accordingly, 9 modular additions and 3 modular multiplications in GF(q) are required when m=2. Similarly, 31 modular additions and 9 modular multiplications are required when m=4.

Similar to the split-radix DIT DGT, the recurrence equations based on the asymmetric split-radix DIF IDGT butterfly and the two boundary cases (i.e., (Eq. 10), (Eq. 11), and (Eq. 12)) can be set up to analyze the computation cost. Observing that the split-radix DIF IDGT butterfly and the two boundary cases requiring the same number of multiplication and addition over custom-character _q[z]/z²+1 as in DGT, the cost of the split-radix DIT DGT and the split-radix DIF IDGT can be represented in terms of modular multiplications M(m) and modular additions A(m) by the following recurrences:

$M (m) = {\begin{matrix} M (\frac{m}{2}) + 2 M (\frac{m}{4}) + 3 m / 2, & if m > 4, \\ 9, & if m = 4, \\ 3, & if m = 2 . \end{matrix}$

$A (m) = {\begin{matrix} A (\frac{m}{2}) + 2 A (\frac{m}{4}) + 1 1 m / 2, & if m > 4, \\ 3 1, & if m = 4, \\ 9, & if m = 2, \end{matrix}$

such that,

$M (m) = m \log_{2} m + \frac{m}{3} - \frac{{(- 1)}^{\log_{2} m}}{3},$

$A (m) = \frac{1 1 m}{3} \log_{2} m + \frac{5 m}{9} - \frac{5 {(- 1)}^{\log_{2} m}}{9} .$

Having the above analysis, FIG. 5 shows Table III for illustrating comparison on the number of modular operations according to one aspect of the present disclosure. Table III compares the modular multiplication and modular addition of low complexity NTT/INTT, the classic DGT/IDGT, and the split-radix DGT/IDGT of the present invention for given problem sizes n. In terms of modular multiplication, the split-radix DGT/IDGT of the present invention has the smallest number of modular multiplications among the three algorithms. Comparing with the classic DGT and IDGT, split-radix DGT and IDGT reduce 47.3% and 57.8% of modular multiplications, respectively, when the polynomial size n=128 (i.e., DGT/IDGT size of

$m = \frac{n}{2} = 6 4) .$

The split-radix DGT and IDGT can also save 9.6% of modular multiplications compared to the low-complexity NTT and INTT. Similarly, the split-radix DGT/IDGT needs one less stage than the low-complexity NTT/INTT. The reason is that a length-n NTT/INTT is equivalent to a

$length - \frac{n}{2} DGT / IDGT$

(which means the transform size is halved in DGT/IDGT compared to NTT/INTT). Additionally, as DIT is applied in DGT and DIF is used in IDGT, no bit-reordering on the coefficients is required.

The split-radix DGT and IDGT can be applied to solve the polynomial multiplication on custom-character _q[x]/xⁿ+1 when 2n|(q−1) and n is a power of 2, and it is a more efficient variant as comparing with the classic DGT/IDGT. The split-radix DGT and IDGT can also provide a shorter transform length and need one less stage comparing with the other NTT/INTT algorithms. Thus, the split-radix DGT/IDGT of the present invention is competitive in the design of high-performance NWC architecture.

In the present invention, an architecture design is provided as well, which refers to an apparatus of cryptosystem with utilizing split-radix DGT/IDGT.

As afore mentioned, the CRYSTALS-KyberKEM adopted the parameter set (n,q) as (256,3329), which can divide the length-256 NTT into two length-128 NTTs of odd-index terms and the even-index terms, respectively. Considering using DGT to replace the length-128 NTT in CRYSTALS-KyberKEM, the pack operation (e.g., as shown in (Eq. 0)) is required to pack the odd-index terms and the even-index terms from custom-character _qinto _q[z]/z²+1. Therefore, the DGT in CRYSTALS-KyberKEM consists of two length-64 DGTs for the odd-index terms and the even-index terms (i.e., which are noted as odd polynomial and even polynomial in this disclosure, respectively). Additionally, the available twiddle factor ζ_min KyberKEM can be set as ∂₆₄=1+737*z.

FIG. 6 depicts architecture of a cryptosystem processor 100 for operating the split-radix DGT/IDGT according some embodiments of the present invention. In order to cut down the hardware overhead on implementing the NWC via split-radix DGT/IDGT, it is decided to integrate the operation of the split-radix DGT, the split-radix IDGT, and the component-wise multiplication into a unified Split Radix DGT/IDGT (SRDGT) module.

As shown in FIG. 6, a cryptosystem processor 100 can be referred to as a unified SRDGT module including a SRDGT butterfly unit (SRDGT BFU) 112, a twiddle factor memory (ZETA_ROM) 114, a stream permutation network (SPN) 116, and a control unit 118 electrically communicating with the SRDGT BFU 112, the ZETA_ROM 114, and the SPN 116. In some embodiments, extensions to k parallel SRDGT BFUs 112 can be arranged, where k is noted as scalability coefficient. The index of each of the SRDGT BFUs 112 determines the route, ranging from 0 to k−1. The SPN 116 has Mem 0, Mem 1 and Mem 2 which are instantiated by dual-port random access memory (RAM). In some embodiments, each of the Mem 0, Mem 1 and Mem 2 can serve as a true dual-port random access memory. The ZETA_ROM 114 is instantiated by dual-port read only memory (ROM). In some embodiments, the ZETA_ROM 114 has a first ZETA port and a second ZETA port electrically communicating with the SRDGT BFU 112.

The unified SRDGT BFU 112 is designed to compute DGT and IDGT in iterative nature. FIG. 7A and FIG. 7B illustrate detailed block diagrams of the SRDGT BFU 112 according some embodiments of the present invention. FIG. 7B further illustrates the details of the active data path and the operators for each mode. Output ports out_A, out_B, out_C, out_D, are illustrated separately, with input ports in_a, in_b, in_c, in_d, in_e, in_f. In some embodiment, each the SRDGT BFU 112 includes six input ports and four output ports. Such the exact numbers for the input ports and the output ports are made for increase in hardware efficiency. Control signal “sel” in different operations can be found in Table IV, as shown in FIG. 7C. A pipelined architecture in FIG. 7A is designed to increase the throughput of the SRDGT BFU 112. When the pipeline is fulfilled, the SRDGT BFU 112 of the present invention can read and write two data points if working under DGT/IDGT mode. When the SRDGT BFU 112 is switched to compute under a component-wise multiplication (CWM) mode, the SRDGT BFU 112 of the present invention can support read and write of four data points simultaneously.

The SRDGT BFU 112 of the present invention is designed to support nine working modes to implement the SRDGT butterfly as shown in FIG. 2 in a compact way. The 6-bits control signal “sel” and its corresponding mode is shown in FIG. 7A and Table IV of FIG. 7C.

Among the nine working modes of the SRDGT BFU 112, four are for the iterative DGT (DGT 0-1, DGT 0-2, DGT 0-3, and DGT 1 as shown in FIG. 2), three are for the iterative IDGT (IDGT 0-1, IDGT 0-2, and IDGT 1 as shown in FIG. 2), and two are for CWM (CWM 0, and CWM 1). The modes for iterative DGT and IDGT need to be switched during the computation, as described in Algorithms 1 and 2. The SEQ can also be applied to controlling the mode switch. Since the computation of CRYSTALS-KyberKEM only involves length-64 DGT/IDGT, the SEQ can be a 32-bit constant integer 0X0000FF0D.

The CWM is defined as

$(x {\hat{\overline{r}}}_{2 i + 1} + {\hat{\overline{r}}}_{2 i}) \mod (x^{2} - ζ_{6 4}^{4 br (i) + 1})$

$\equiv (x {\hat{\overline{a}}}_{2 i + 1} + {\hat{\overline{a}}}_{2 i}) (x {\hat{\overline{b}}}_{2 i + 1} + {\hat{\overline{b}}}_{2 i}),$

By using the Karatsuba-based CWM approach, the number of multiplications over custom-character _q[z]/z²+1 can be obtained via:

$\begin{matrix} {\hat{\overline{r}}}_{2 i} = {\hat{\overline{a}}}_{2 i} {\hat{\overline{b}}}_{2 i} + {\hat{\overline{a}}}_{2 i + 1} {\hat{\overline{b}}}_{2 i + 1} \cdot ζ_{6 4}^{4 br (i) + 1}, & (Eq . 13) \end{matrix}$

${\hat{\overline{r}}}_{2 i + 1} = ({\hat{\overline{a}}}_{2 i} + {\hat{\overline{a}}}_{2 i + 1}) ({\hat{\overline{b}}}_{2 i} + {\hat{\overline{b}}}_{2 i + 1}) - ({\hat{\overline{a}}}_{2 i} {\hat{\overline{b}}}_{2 i} + {\hat{\overline{a}}}_{2 i + 1} {\hat{\overline{b}}}_{2 i + 1}) .$

The Karatsuba-based CWM approach can be used for computation under using two working modes in the SRDGT BFU 112 of the present invention, namely CWM 0 and CWM 1. The computation of (Eq. 13) is mapped to the data flow of BFU as:

$\begin{matrix} CWM 0 : s_{0} = {\hat{\overline{a}}}_{2 i} + {\hat{\overline{a}}}_{2 i + 1}, s_{1} = {\hat{\overline{b}}}_{2 i} + {\hat{\overline{b}}}_{2 i + 1}, & (Eq . 14) \end{matrix}$

$m_{0} = {\hat{\overline{a}}}_{2 i} {\hat{\overline{b}}}_{2 i}, m_{1} = {\hat{\overline{a}}}_{2 i + 1} {\hat{\overline{b}}}_{2 i + 1}$

$CWM 1 : {\hat{\overline{r}}}_{2 i} = m_{0} + m_{1} \cdot ζ_{6 4}^{4 br (i) + 1},$

${\hat{\overline{r}}}_{2 i + 1} = s_{0} \cdot s_{1} - (m_{0} + m_{1}) .$

The detailed dataflow and working mechanism of the SRDGT BFU 112 are stated in FIG. 6 and Table IV as well.

The multiplier over custom-character _q[z]/z²+1 is provided herein.

Notice that each multiplication in (Eq. 13) multiplication over custom-character _q[z]/z²+1. Therefore, a multiplier over _q[z]/z²+1 is required to compute this operation.

The DSP48E1 slice in Xilinx FPGA consists of one multiplier and two adders. Since all the operators in DSP48E1 is programmable by fully utilizing these high performance hardware resources, one can design a high throughput multiplier over custom-character _q[z]/z²+1. In the present disclosure, the Karatsuba algorithm is adopted, with given s=(a)z+b, t=(c)z+d, the multiplication over _q[z]/z²+1 is rearranged and shown as:

$\begin{matrix} \bar{s} \circ \bar{t} = [a (d - c) + c (a + b) \mod q] z + [b (c + d) - c (a + b) \mod q] . & (Eq . 15) \end{matrix}$

As can be seen from (Eq. 15), there are three multiplications, four additions, two subtractions, and two modular reductions in each multiplication over custom-character _q[z]/z²+1. In the present invention, mapping the whole computations in (Eq. 15) is provided into three DSP48E1, as shown in FIG. 8 which illustrate an exemplary architecture of multiplier over _q[z]/z²+1 with three DSP48E1 slices, so as to reduce the wiring delay between the logics and the DSP core. Regarding the illustration of FIG. 8, the rearranged multiplication over custom-character _q[z]/z²+1 is described in (Eq. 15). The architecture of _q[z]/z²+1 multiplier achieves a working frequency of 299 MHz on Xilinx Artix-7 platform.

Stream Permutation Network and Fully Pipelined Scheduling are provided herein.

In the present invention, the stream permutation network (SPN) and the data scheduling plan are designed to support two main goals for single SRDGT BFU (i.e., the SRDGT BFU 112 as afore described): (1) SPN can satisfy the bandwidth requirement of the SRDGT BFU; and (2) the schedule of SPN can ensure a fully pipelined working mode of DGT/IDGT.

The above goals can be achieved based on at least three features observable from FIG. 6. FIG. 7A, and FIG. 7B with the following statements:

- 1. The SRDGT BFU (i.e., the SRDGT BFU 112) can provide 4 active input ports in “DGT” and “IDGT” modes, and provide 6 active input ports in “CWM 0” mode, and provide 5 active input ports in “CWM 1” mode. In some embodiments, each SRDGT BFU has the minimum number of the input ports to support “DGT” and “IDGT” modes and “CWM 0” and “CWM 1” modes. For example, the SRDGT BFU has six input ports.
- 2. The 2 input data points from ZETA_ROM 114 (i.e., the twiddle factors TW₁and TW₂) use specific datapath and does not dependent on the SPN 116. That is, the twiddle factors TW₁and TW₂are transmitted via datapath independent of the SPN 116. In some embodiments, the ZETA_ROM 114 electrically communicates with the SRDGT BFU 112 via two paths. One of the paths is from an output port of the ZETA_ROM 114 to an input port “in_e” of the SRDGT BFU 112 so as to transmit the twiddle factor TW₁to the input port “in_e” of the SRDGT BFU 112. Another one of the paths is from an output port of the ZETA_ROM 114 to an input port “in_f” of the SRDGT BFU 112 so as to transmit the twiddle factor TW₂to the input port “in_f” of the SRDGT BFU 112.
- 3. The pair of input ports that have the same data input (i.e., the port pairs (in_a, in_f) and (in_d, in_c) of the SRDGT BFU 112 in the “CWM 0” mode) can share one reading operation from the SPN 116.

Based on the above features, in some embodiments, the SPN 116 can support the required/desired numbers of data points reading/writing per cycle in the DGT/IDGT/CWM mode. For example, the SPN 116 can support 2/2/4 data points reading per cycle in the DGT/IDGT/CWM mode, respectively. Similarly, the SPN 116 can also support 2/2/4 data points writing per cycle in the DGT/IDGT/CWM mode, respectively.

In order to satisfy the SPN data width requirement, three true dual-port block RAM (BRAM), namely MEM 0, MEM 1, and MEM 2, as shown in FIG. 6 and labeled as MEM 0, MEM 1, and MEM 2, are placed or arranged to work in parallel. In some embodiments, the true dual-port BRAMs MEM 0, MEM 1, and MEM 2 serve as memory caches configured to store polynomial. Such the configuration enables a maximum 6 data points read/write simultaneously.

When the SRDGT BFU 112 works in DGT/IDGT mode, the two read ports of one BRAM and the two write ports of another BRAM are enabled. FIG. 9 depicts exemplary scheduling of memory operations for the SRDGT BFU 112 in KyberKEM. In some embodiments, two sources are acceptable: DGT_in or pre-stored in Mem 0 and Mem 1. When the input polynomial coming from DGT_in, the operation box “In stream*” are enabled and the “Read*” are disabled. When the input polynomial stored in Mem 0 and Mem 1, the operation boxes “Read*” are enabled and the “In stream*” are disabled. Detailed scheduling of memory operations in the first two stages are also shown below. In the illustration of FIG. 9, the white boxes represent Read operations, and the black boxes represent Write operations. The address of data is presented inside the boxes if applicable.

As shown in FIG. 9, this design enables 2 data points read and 2 data points write simultaneously. It is noted that the coefficients of the intermediate polynomial are stored in the memory in order, and each coefficient occupies one address in the BRAM. Such the configuration allows the cryptosystem processor 100 to use the memory address generating method stated in Algorithm 1 and 2 straightforwardly.

In some embodiments, when the SRDGT BFU 112 works in CWM mode, there are 4 data points received by the SRDGT BFU 112 and 4 data points output from the SRDGT BFU 112 in each cycle. FIG. 10 depicts exemplary scheduling of memory operations for the CWM mode of according to some embodiments of the present invention. The data points in (Eq. 14) are noted on the BRAM ports in the working modes, “CWM 0” and “CWM 1”. The read ports of “MEM 0 port b” and “MEM 1 port a” are enabled, and the write ports of “MEM 0 port a”, “MEM 1 port b”, “MEM 2 port a”, and “MEM 2 port b” are enabled. As the data input port “CWM_in” also supports 2 data points input, the bandwidth requirement of the SPN is fulfilled.

The main challenge of implementing a fully pipelined iterative DGT/IDGT lies in the data dependency between adjacent transform stages. The fully pipelined scheduling plan of the present invention is specific for the DGT/IDGT in KyberKEM, consisting of two length-64 DGTs. The two length-64 DGTs are interspersed and processed alternately to eliminate the data dependency between adjacent transform stages. FIG. 9 provides a detailed example of memory scheduling for DGT in KyberKEM in the first two stages. This fully pipelined scheduling plan is also extended to compute IDGTs in KyberKEM.

The cycle count of the fully pipelined SRDGT BFU (i.e., the SRDGT BFU 112) and the state-of-the-art LC NTT are analyzed and compared then. The SRDGT BFU requires 2×64/2×log₂64=384 cycles for the length-64 DGTs of odd and even polynomials, and no pipelined bubble exists. Calculating the same length-128 NTT. LC NTT requires 128/2×log₂128=448 cycles of odd and even polynomials, with additional 64 cycles of pipelined bubbles to write the results back to BRAMs. The above comparison demonstrates the advantages of the halved transform length DGT, the data scheduling plan, and the fully pipelined architecture of the present invention.

The extensions to multiple BFUs are provided herein. In KyberKEM, a higher security level requires more DGT computation tasks. In some embodiments, in order to support multiple tasks simultaneously, the extension to multiple the SRDGT BFUs 112 is available, as shown in FIG. 6. In some embodiments, each port of the SRDGT BFU 112 or the SPN 116 can receive or send a 24-bit wide data stream. If the scalability coefficient is noted as k, each memory block in the SPN 116 will be expanded to k×24 bit wide, corresponding to k SRDGT BFUs 112 operating simultaneously for k independent DGT/IDGT/CWM tasks. Meanwhile, since the bit width of the DGT data points is a multiple of 8, the extended Split Radix DGT architecture of the cryptosystem processor 100 can use the byte write function of the Xilinx BRAM instances to specify a storage location for the inputting data. Thus, the extended Split Radix DGT architecture of the cryptosystem processor 100 accepts a single polynomial or k polynomials that need to be operated simultaneously as inputs, improving the flexibility of the schedule when applied in the upper-level modules.

More features regarding hardware architecture of KyberKEM are provided herein.

In some embodiments, KyberKEM involves key generation, encapsulation, and decapsulation. In the present invention, hardware architecture is provided as shown in FIG. 11 which depicts the architecture of a KyberKEM-Split Radix DGT apparatus 200 according to some embodiments of the present invention.

Referring to FIG. 11, the KyberKEM-Split Radix DGT apparatus 200 is configured to support the KyberKEM algorithms and includes a Keccak module 210, a centered binomial distribution (CBD) module 212, a reject sampling module 214, a decode module 216, an encode module 218, a compress module 220, a decompress module 222, a RAM module 224, an accumulator (ACC) module 226. The KyberKEM-Split Radix DGT apparatus 200 further includes a split radix DGT module 230, which is identical with or similar to the cryptosystem processor 100 as afore-described. The components and modules of the KyberKEM-Split Radix DGT apparatus 200 can be electrically communicated with each other as arrows shown in FIG. 11. For example, the RAM module 224 electrically communicates with the split radix DGT module 230. Herein, electrically communication includes indirectly coupling; for example, the split radix DGT module 230 indirectly coupled with the encode module 218 via the ACC module 226 can be stated as electrically communicating with the encode module 218.

The CBD module 212 and the reject sampling module 214 can be configured to perform sampling in the functions CBD_ηand Parse, respectively. The compress and decompress modules 220, 222 are responsible for the compress and decompress of ciphertext, respectively. The encode module 218 is configured to transfer the data format from the byte array to the coefficients of a polynomial, and the decode module 216 transfers the coefficients of a polynomial back to the byte array. The encode and decode modules 218 and 216 are modified from the open-source code. The Keccak module 210 is configured to compute the functions of SHAKE128, SHAKE256, SHA3-256, and SHA3-512. The functionality of the Keccak module 210 is expanded from the open-source code, and it will take 24 clock cycles to execute 24 rounds in the function KECCAK-f.

The bandwidth matching carrying through the architecture is used to increase the area time efficiency. In addition, the entire structure is divided into three parts, with different data bit widths for different parts. The advantage of setting bandwidth matching in different parts is the overall hardware latency, and the consumed resources can trade off based on the security level. The data bandwidth is 64 bits, 48 bits, and 48×k bits in the I/O part, the sample/serialization part, and the DGT part, respectively, where k is the security level parameter of KyberKEM and equals to the scalability parameter in the split radix DGT module 230 (i.e., the cryptosystem processor 100) as afore defined. The I/O part includes the input and output FIFOs, working as the input/output buffer of the architecture. In the sample/serialization part, the byte array from input FIFO can be sent to the Keccak module 210 to sample and the decode module 216 to de-serialize into 48-bit width. The compress module 220 is able to accept the 48-bit-width data from the encode module 218 and serialize it to 64-bit width data for the output FIFO. In some embodiments, the KyberKEM-Split Radix DGT apparatus 200 may further include a data register 232, which electrically communicates with the compress module and is configured to store the 64-bit width data from the compress module 220. The RAM module 224 is configured to store the sampling polynomials from the CBD module 212 and the reject sampling module 214 and the decompressed polynomials from the decompress module 222 (i.e., the decompress module 222 can be configured to decompress polynomials). The byte write function of the Xilinx BRAM instance can be used in the RAM module 224 to facilitate the flexibility of the write bandwidth. In some embodiments, when k polynomials for DGT/IDGT/CWM are ready, the SRDGT module with the split radix DGT module 230 (i.e., the cryptosystem processor 100) will load these k polynomials and process them simultaneously. In some embodiments, the KyberKEM-Split Radix DGT apparatus 200 may further include control units, the input and output FIFOs.

In present invention, the just-in-time strategy is applied to minimize the memory footprint. The just-in-time strategy means that the sampling polynomials are generated based on the requirement of the succeeding computation. For example, the strategy is applied for the data generated by reject sampling module 214. The reject sampling module 214 samples the output from the Keccak module 210 under the uniform distribution. The output of the reject sampling module 214 is stored in the RAM module 224, including Â in key generation and Â^Tin encryption, and can get passed to the SRDGT module with the split radix DGT module 230 (i.e., the cryptosystem processor 100) until k polynomials are ready. Each of these polynomials in the cases can be used only once. Thus, the memory space can be overwritten by the following k polynomials based on the just-in-time strategy, and the memory space reserved can get reduced from k²polynomials to k polynomials.

The implementation results and comparisons are provided herein.

The hardware design of KyberKEM-SRDGT of the present invention has been synthesized and implemented using Vivado 2019.2 design suite on Xilinx XC7A200 (Artix-7) FPGA device, with all the building blocks implemented in hardware.

Regarding split-radix DGT module results and comparisons, the hardware resource utilization and the latency specification of the SRDGT module are shown in Table V in FIG. 12. The detailed cycle counts of NTT (DGT), INTT(IDGT), and component-wise multiplication (CWM) are also compared with state-of-the-art implementations. The k=1 case of the present invention is only enclosed for a fair comparison since the design of the present invention with larger k is designed to process k independent polynomials simultaneously. The measurement of the efficiency of the hardware implementations is based on the area-time product (ATP), which is computed by the product of LUT, BRAM, and DSP resources and the computing time. The ATP for NTT and CWM for a detailed comparison are analyzed. Notably, the comparison of the total latency and ATP is not included for the polynomial multiplication in Table V, since it is noted that the KyberKEM does not use the complete polynomial multiplication (including 2 NTTs, 1 CWM, and 1 INTT) during the key generation, encapsulation and decapsulation. For simplicity, the ATP ratios are provided instead of the original ATP indices.

Due to the careful placement of registers and the usage of high-speed DSP48E1 slides in Artix-7 FPGA, the SRDGT module is able to operate at a frequency of 239 MHz. Another merit of the SRDGT algorithm and the architecture is the relatively small cycle count. Specifically, the DGT, IDGT and CWM computations require 384, 384, and 132 cycles, respectively, for length-256 polynomial multiplication. And the latency of DGT, IDGT and CWM are 1.6 μs, 1.6 μs, and 0.55 μs, respectively.

In comparison to the SW implementation, the cycle count of the SRDGT architecture achieves a speedup of 20.1×, 24.3×, 211.2× for NTT (DGT), INTT(IDGT), and CWM, respectively. In comparison to the HW/SW implementations, the SRDGT hardware achieves more than 32.4× speedup for NTT (DGT) computation. Besides, some related works use 1.86× and 1.81× more LUTs than design of the present disclosure in a similar FPGA platform, respectively.

The state-of-the-art HW implementations are divided into two groups depending on whether the CWM is supported. The hardware in some related works support CWM. One of the related works has a higher NTT ATP ratio in LUT, BRAM, and DSP compared to the architecture of the present invention, indicating the high efficiency of our architecture. The architecture of the present invention still has lower cycle counts because the transform size is halved, and only six stages are required in our split radix DGT and IDGT, with the full-pipelined working nature provided by the SPN. One of the related works also presented a unified butterfly unit for NTT, INTT, and CWM. However, taking advantage of the novel split-radix DGT algorithm of the present invention, the cycle count of the present invention is only 384/512=75% of the counts in the previous work for NTT (DGT), and only 132/256=51.6% of the counts in the previous work for CWM. The architecture of the present invention outperforms the NTT ATP and CWM ATP compared to the previous work except for the NTT-DSP ATP because of the compact design in the unified BFU of the previous work. One of the related works proposes three different configurations to trade off the hardware resources and speed. The architecture of the present invention outperforms all these configurations concerning the LUT-NTT and BRAM-NTT ATP, while their work can have a better DSP-NTT ATP. Besides, the architecture of the present invention outperforms the CWM ATP ratios for LUT, BRAM, and DSP compared to the related works.

The clock cycle counts of NTT (DGT) and INTT (IDGT) are used, unlike directly using the ATP of the NTT and CWM when comparing the architecture of the present invention with related works for fairness since these works do not support CWM while the architecture of the present invention uses additional hardware resources for CWM.

Regarding KyberKEM results and comparisons, an ATP ratio is the normalized product of FPGA resources and the total time by setting the architecture of the present invention as baseline. FIG. 13 depicts Table VI for showing the hardware resource utilization and the latency of the proposed KyberKEM hardware system. Different security level parameter sets, including Kyber-512-CCA, Kyber-768-CCA, and Kyber-1024-CCA, are implemented, and the results are compared with the state-of-the-art implementations concerning speed and hardware resource utilization. The speed of the hardware is obtained by taking the cycle counts and total time, including the key generation, encapsulation, and decapsulation. For simplicity, the total cycle ratio is provided in Table VI using results of the architecture of the present invention as the baseline. The overall efficiency of the hardware architecture is mainly measured by the ATP ratio, obtained by the product of LUT, BRAM, and DSP resources and the total time (noted as LUT-Time ATP, BRAM-Time ATP, and DSP-Time ATP, respectively), and normalized using our results as the baseline.

In the KyberKEM architecture with the SRDGT module of the present invention, all the dimensions k defined in KyberKEM specification are supported. The data bandwidth of implementation of the present invention is set to 64 bits. The design of the present invention achieves more than 227.6× speedup when compared with one of the related works, which is software implementation on ARM Cortex-M4. Compared with the HW/SW co-design in one of the related works, the architecture of the present invention achieves at least 43.8× speedup and 340.5/333.5/163.8× smaller LUT-Time ATP, BRAM-Time ATP, and DSP-Time ATP, respectively, among all the security level of KyberKEM.

The architecture of the present invention is compared with the related pure hardware implementations. For all the security levels of KyberKEM, the hardware corresponding to the architecture of the present invention obtains at least 1.0×, 2.1×, 2.8×, and 10.7× speedup compared with some of the related works, respectively. Compared with one of the related works, the architecture of the present invention utilizes 1.0/0.6/0.4×ATP in Kyber-512-CCA, but only 0.7/0.5/0.3×ATP in Kyber-1024-CCA, in terms of LUT-Time ATP, BRAM-Time ATP, and DSP-Time ATP, respectively. The reason may be that the KyberKEM architecture of the present invention can benefit more from the Split Radix DGT module at a lower security level. Nevertheless, at a higher security level, the schedule bottleneck can be Keccak and Reject Sample (modules), but not the Split Radix DGT (module). This fact will cause the total cycle gap between the design of the architecture of the present invention and one of the related works to decrease gradually, namely from 1.3× to 1.0× total cycles from Kyber-512-CCA to Kyber-1024-CCA.

As discussed above, the development of quantum computers threatens the security of the conventional public-key cryptography algorithms. CRYSTALS-KyberKEM is one of the leading algorithms in the ongoing NIST Post-Quantum Cryptography (PQC) competition. As a lattice-based cryptographic scheme, the efficiency of CRYSTALS-KyberKEM is dependent on the polynomial multiplication over R_qor equivalently NWC.

In the present disclosure, the implementation of DGT with the split-radix method is explored, thereby providing a higher level of parallelism compared to the LC NTT and less computational complexity compared to classic DGT. The architecture of split-radix DGT module of the present invention can support DGT, IDGT and CWM specific for KyberKEM, and outperforms the state-of-the-arts on NWC modules. In the meantime, KyberKEM architecture with split-radix DGT module of the present invention is configured to support all the security levels of KyberKEM.

The architecture of the present invention can increase performance and hardware efficiency than the state-of-the-arts. In some experiments, there are specifically only 35.7 μs, 47.6 μs, and 68.6 us required for Kyber-512-CCA. Kyber-768-CCA and Kyber-1024-CCA, respectively.

ROM, RAM, and other logical components can be realized through a well-designed layout that incorporates a range of physical components. In some embodiments, this includes the incorporation of passive elements like resistors, inductors, and capacitors, which are crucial for regulating current flow and stabilizing voltage levels. In some embodiments, active elements such as transistors and integrated circuits are arranged in specific embodiments to amplify and control signals within the system. In some embodiments, the layout is further supported by the presence of interconnecting wires and conductive traces, which enable seamless transmission of signals between components.

The functional units and modules of the apparatuses and methods in accordance with the embodiments disclosed herein may be implemented using computing devices, computer processors, or electronic circuitries including but not limited to application specific integrated circuits (ASIC), field programmable gate arrays (FPGA), microcontrollers, and other programmable logic devices configured or programmed according to the teachings of the present disclosure. Computer instructions or software codes running in the computing devices, computer processors, or programmable logic devices can readily be prepared by practitioners skilled in the software or electronic art based on the teachings of the present disclosure.

All or portions of the methods in accordance to the embodiments may be executed in one or more computing devices including server computers, personal computers, laptop computers, mobile computing devices such as smartphones and tablet computers.

The embodiments may include computer storage media, transient and non-transient memory devices having computer instructions or software codes stored therein, which can be used to program or configure the computing devices, computer processors, or electronic circuitries to perform any of the processes of the present invention. The storage media, transient and non-transient memory devices can include, but are not limited to, floppy disks, optical discs, Blu-ray Disc, DVD, CD-ROMs, and magneto-optical disks, ROMs, RAMs, flash memory devices, or any type of media or devices suitable for storing instructions, codes, and/or data.

Each of the functional units and modules in accordance with various embodiments also may be implemented in distributed computing environments and/or Cloud computing environments, wherein the whole or portions of machine instructions are executed in distributed fashion by one or more processing devices interconnected by a communication network, such as an intranet, Wide Area Network (WAN), Local Area Network (LAN), the Internet, and other forms of data transmission medium.

The foregoing description of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations will be apparent to the practitioner skilled in the art.

The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated.

CRYPTOSYSTEM WITH UTILIZING SPLIT-RADIX DISCRETE GALOIS TRANSFORMATION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims