Various exemplary embodiments disclosed herein relate generally to masked decoding of polynomials.
Recent significant advances in quantum computing have accelerated the research into post-quantum cryptography schemes: cryptographic algorithms which run on classical computers but are believed to be still secure even when faced with an adversary with access to a quantum computer. This demand is driven by interest from standardization bodies, such as the call for proposals for new public-key cryptography standards by the National Institute of Standards and Technology (NIST). The selection procedure for this new cryptographic standard has started and has further accelerated the research of post-quantum cryptography schemes.
There are various families of problems to instantiate these post-quantum cryptographic approaches. Constructions based on the hardness of lattice problems are considered to be promising candidates to become the next standard. A subset of approaches considered within this family are instantiations of the Learning With Errors (LWE) framework: the Ring-Learning With Errors problem. Another subset of approaches are based on recovering a quotient of polynomials in a ring. This means that the operations in these scheme involve arithmetic with polynomials with integer coefficients. Examples of the former include KYBER and NewHope, the latter NTRU-HRRS-KEM and Streamlined NTRU Prime.
When lattice based cryptographic functions are implemented, the main computationally expensive operations are the arithmetic with polynomials. More precisely, computations are done in a ring Rq=(/q)[X]/(F): the ring where polynomial coefficients are in /q and the polynomial arithmetic is performed modulo F.
A summary of various exemplary embodiments is presented below. Some simplifications and omissions may be made in the following summary, which is intended to highlight and introduce some aspects of the various exemplary embodiments, but not to limit the scope of the invention. Detailed descriptions of an exemplary embodiment adequate to allow those of ordinary skill in the art to make and use the inventive concepts will follow in later sections.
Various embodiments relate to a method for masked decoding of a polynomial a using an arithmetic sharing a to perform a cryptographic operation in a data processing system using a modulus q, the method for use in a processor of the data processing system, including: subtracting an offset δ from each coefficient of the polynomial a; applying an arithmetic to Boolean (A2B) function on the arithmetic shares ai of each coefficient ai to produce Boolean shares âi that encode the same secret value ai; and performing in parallel for all coefficients a shared binary search to determine which of coefficients ai are greater than a threshold t to produce a Boolean sharing value {circumflex over (b)} of the bitstring b where each bit of b decodes a coefficient of the polynomial a.
Various embodiments are described, further comprising bitslicing â to produce bitsliced values {circumflex over (x)} wherein the binary search is based upon the bitsliced values {circumflex over (x)}.
Various embodiments are described, further comprising initiating the Boolean sharing value {circumflex over (b)} to (0, . . . , 0) wherein performing in parallel for all coefficients a binary search further includes: updating the Boolean sharing {circumflex over (b)} by iterating through each bit value associated with the modulus q starting with the most significant bit to determine when coefficients ai are greater than a threshold t.
Further various embodiments relate to a method for masked decoding of a polynomial a using an arithmetic sharing a to perform a cryptographic operation in a data processing system using a modulus q, the method for use in a processor of the data processing system, including: calculating ai(0)=ai(0)−δ mod q for each coefficient i of the polynomial a having m coefficients; calculating âi=A2B(ai) where A2B is an arithmetic to Boolean (A2B) function on the arithmetic shares of the coefficient ai of the polynomial a to produce Boolean shares âi that encode the same secret value ai; calculating {circumflex over (x)}=Bitslice(â) where {circumflex over (x)} are bitsliced values; calculating ŷ=Refresh(2m−1,0, . . . ,0) where the Refresh function refreshes a given Boolean sharing of a variable; calculating {circumflex over (b)}=(0, . . . ,0) to initialize the Boolean sharing {circumflex over (b)} of the bitstring b; initializing a first tracking variable d to 0; perform the following steps to perform a binary search for each value of i from 1 to k where k is the number of bits in the modulo q: calculating e=d+2k−i, where e is a second tracking variable; when e≥t calculate the following: {circumflex over (b)}={circumflex over (b)}⊕SecAND(ŷ,{circumflex over (x)}k−i) where the function SecAND computes the bit-wise AND of two given Boolean-shared inputs in a masked fashion and the function ⊕ computes the bit-wise XOR of two given Boolean-shared inputs; and {circumflex over (x)}k−i(0)=¬{circumflex over (x)}k−i(0), where the function ¬ computes the bit-wise negation of the input bitstring; when e≤t calculating d=e; when d=t returning the value {circumflex over (b)} and ending the binary search; and calculating ŷ=SecAND(ŷ,{circumflex over (x)}k−i), where the function SecAND computes the bit-wise AND of two given Boolean-shared inputs in a masked fashion.
Further various embodiments relate to a data processing system comprising instructions embodied in a non-transitory computer readable medium, the instructions for masked decoding of a polynomial a using an arithmetic sharing a to perform a cryptographic operation in a processor, the instructions, including: instructions for subtracting an offset δ from each coefficient of the polynomial a; instructions for applying an arithmetic to Boolean (A2B) function on the arithmetic shares of the coefficient ai of the polynomial a to produce Boolean shares âi that encode the same secret value ai; and instructions for performing in parallel for all coefficients a shared binary search to determine which of the coefficients ai are greater than a threshold t to produce a Boolean sharing value {circumflex over (b)} of the bitstring b where each bit of b decodes a coefficient of the polynomial a.
Various embodiments are described, further comprising instructions for bitslicing â to produce bitsliced values {circumflex over (x)} wherein the binary search is based upon the bitsliced values {circumflex over (x)}.
Various embodiments are described, further comprising instructions for initiating the Boolean sharing value {circumflex over (b)} to (0, . . . , 0) wherein performing in parallel for all shares a binary search further includes: instructions for updating the Boolean sharing {circumflex over (b)} by iterating through each bit value associated with the modulus q starting with the most significant bit to determine when coefficients ai are greater than a threshold t.
Further various embodiments relate to a data processing system comprising instructions embodied in a non-transitory computer readable medium, the instructions for masked decoding of a polynomial a using an arithmetic sharing a to perform a cryptographic operation in a processor, the instructions, including: instructions for calculating ai(0)=ai(0)−δ mod q for each coefficient i of the polynomial a having m coefficients; instructions for calculating âi=A2B(ai) where A2B is an arithmetic to Boolean (A2B) function on the arithmetic shares of the coefficient ai of the polynomial a to produce Boolean shares âi that encode the same secret value ai; instructions for calculating {circumflex over (x)}=Bitslice(â) where {circumflex over (x)} are bitsliced values; instructions for calculating ŷ=Refresh(2m−1,0, . . . ,0) where the Refresh function refreshes a given Boolean sharing of a variable; instructions for calculating {circumflex over (b)}=(0, . . . ,0) to initialize the Boolean sharing {circumflex over (b)} of the bitstring b; instructions for initializing a first tracking variable d to 0; instructions for performing the following steps to perform a binary search for each value of i from 1 to k where k is the number of bits in the modulo q: instructions for calculating e=d+2k−i, where e is a second tracking variable; instructions for when e≥t calculating the following: {circumflex over (b)}={circumflex over (b)}⊕SecAND(ŷ,{circumflex over (x)}k−i) where the function SecAND computes the bit-wise AND of two given Boolean-shared inputs in a masked fashion and the function ⊕ computes the bit-wise XOR of two given Boolean-shared inputs; and {circumflex over (x)}k−i(0)=¬{circumflex over (x)}k−i(0), where the function ¬ computes the bit-wise negation of the input bitstring; instructions for when e≤t calculating d=e; instructions for when d=t returning the value {circumflex over (b)} and ending the binary search; and instructions for calculating ŷ=SecAND(ŷ,{circumflex over (x)}k−i), where the function SecAND computes the bit-wise AND of two given Boolean-shared inputs in a masked fashion.
In order to better understand various exemplary embodiments, reference is made to the accompanying drawings, wherein:
To facilitate understanding, identical reference numerals have been used to designate elements having substantially the same or similar structure and/or substantially the same or similar function.
The description and drawings illustrate the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention and are included within its scope. Furthermore, all examples recited herein are principally intended expressly to be for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art and are to be construed as being without limitation to such specifically recited examples and conditions. Additionally, the term, “or,” as used herein, refers to a non-exclusive or (i.e., and/or), unless otherwise indicated (e.g., “or else” or “or in the alternative”). Also, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments.
A common element of multiple post-quantum cryptographic scheme proposals is the decoding of polynomials to a bitstring. Often this is achieved by decoding each coefficient of the polynomial to a bit given some criteria, e.g., a threshold. One family of attacks, so-called side-channel analysis, exploits data dependencies in physical measurements of the target device (e.g., power consumption) and can be thwarted with the help of masking the processed data. A popular approach is to mask this decoding step. However, previous techniques introduce a significant performance overhead or can only be applied to specific moduli (i.e., the modulus is limited to a power-of-two).
The decapsulation operation of a Key Encapsulation Mechanism (KEM) extracts an encapsulated key from a given ciphertext using a secret key. If this secret key were to be leaked, it would invalidate the security properties provided by the KEM. It has been shown that unprotected implementations of post-quantum schemes are vulnerable to implementation attacks, e.g., side-channel analysis. In particular, it was demonstrated that the secret key can be extracted from physical measurements of key-dependent parts in the decapsulation operation. For several post-quantum KEMs, the key-dependent operations include a decoding of polynomials to a bitstring. Commonly, the coefficients of the polynomial are in q (integers modulo q) and are mapped to one bit depending on their value. For this operation, the space of q is split into two intervals 0 and 1, which are continuous, disjunct and their union covers the complete space of q. Coefficients which are in 0 may be mapped to 0. Coefficients which are in 1 may be mapped to 1. Note that usually the intervals are of the same size. However, this is not a requirement for the embodiments described herein which may process uneven intervals. While this decoding operation is trivial in the unmasked case, a secure implementation of these KEMs requires the integration of dedicated countermeasures for this step.
Masking is a common countermeasure to thwart side-channel analysis and has been utilized for various applications. Besides security, efficiency is also an important aspect when designing a masked algorithm. Important metrics for software implementations of masking are the number of operations and the number of fresh random elements required for the masking scheme.
The first dedicated masking scheme for the decoding of polynomials was presented in Oscar Reparaz, Sujoy Sinha Roy, Frederik Vercauteren, and Ingrid Verbauwhede, A masked ring-lire implementation, Cryptographic Hardware and Embedded Systems—CHES 2015—17th International Workshop, Saint-Malo, France, Sep. 13-16, 2015, Proceedings (Tim Güneysu and Helena Handschuh, eds.), Lecture Notes in Computer Science, vol. 9293, Springer, 2015, pp. 683-702. It uses a probabilistic table-based approach to decode the coefficients of a masked polynomial to unmasked bits. The main drawbacks are that the solution is limited to first order security, produces unmasked output bits, introduces a high performance overhead, and due to its probabilistic nature increases the failure rate of the post-quantum scheme.
The next solution of masked decoding was present in Tobias Oder, Tobias Schneider, Thomas Pöppelmann, and Tim Güneysu, Practical cca2-secure and masked ring-lwe implementation, IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018 (2018), no. 1, 142-174 (hereinafter Oder). It uses a sequence of one arithmetic-to-arithmetic (A2A) and one arithmetic-to-Boolean (A2B) sharing conversion per coefficient to decode the coefficients of a masked polynomial to masked bits. The main drawbacks are that this solution is limited to first order security and introduces a high performance overhead due to the two conversions.
The most recent and most efficient approach was proposed in Michiel Van Beirendonck, Jan-Pieter D'Anvers, Angshuman Karmakar, Josep Balasch, and Ingrid Verbauwhede, A side-channel resistant implementation of SABER, IACR Cryptol. ePrint Arch. 2020 (2020), 733 (hereinafter Van Beirendonck). In their case, the targeted post-quantum KEM uses a power-of-two modulus which reduces the decoding to a simple shift of the coefficients. The masked implementation of this step is realized with one call to a new table-based A2A. While this results in an efficient masked decoding step, their approach is most efficient for schemes with power-of-two moduli and cannot be easily used for prime moduli.
Besides the dedicated masked decoding schemes, it is also possible to implement it using generic masking of look-up-tables. In this case, the decoding step is completely implemented in a table and protected using a table masking scheme, e.g., Jean-Sébastien Coron, Higher order masking of look-up tables, Advances in Cryptology—EUROCRYPT 2014—33rd Annual International Conference on the Theory and Applications of Cryptographic Techniques, Copenhagen, Denmark, May 11-15, 2014. Proceedings (Phong Q. Nguyen and Elisabeth Oswald, eds.), Lecture Notes in Computer Science, vol. 8441, Springer, 2014, pp. 441-458 (hereinafter Coron). The main drawbacks are that it requires multiple tables whose size depends on the used modulus, and can introduce a high performance overhead especially for large moduli. In addition, it has been shown that this table-based approach suffers especially at higher orders from attacks that exploit multiple points in time. (See e.g., Nicolas Bruneau, Sylvain Guilley, Zakaria Najm, and Yannick Teglia, Multi-variate high-order attacks of shuffled tables recomputation, Cryptographic Hardware and Embedded Systems—CHES 2015—17th International Workshop, Saint-Malo, France, Sep. 13-16, 2015, Proceedings (Tim Güneysu and Helena Handschuh, eds.), Lecture Notes in Computer Science, vol. 9293, Springer, 2015, pp. 475-494.)
The embodiments described herein improve on the state-of-the-art enabling a significantly more efficient implementation of post-quantum schemes which include the decoding of polynomials to bitstrings. This is achieved by a masked decoding approach that requires only one A2B conversion per coefficient, works for arbitrary moduli, and does not necessarily require pre-computed tables. It improves both the number of operations and of random elements as compared to prior methods, while not necessarily requiring storing large tables as the table-based approaches.
The embodiments described herein use a new way of decoding masked polynomials which works for arbitrary moduli and without pre-computed tables. In particular, it avoids the costly A2A conversion of the approach from Oder and relies on bit-slicing to significantly reduce the number of masked AND operations. At its core, it includes an A2B conversion with a subsequent masked binary search to determine the decoded bit. In contrast to the majority of the prior art which is usually limited to first order security, the embodiments disclosed herein may be described in a way that it can be instantiated at any desired security order. Overall, this helps to reduce both the total number of operations and random elements while allowing one to obtain higher-order protected implementations compared to the prior art.
Let ƒ∈q[X] be a polynomial of degree (at most) m−1: i.e., ƒ(X)=Σj=0m−1xjXj, with xj∈q where x=(x0, . . . , xm−1) may be written. An arithmetic sharing of a polynomial x is written as x consisting of n arithmetic shares x(i)=(x0(i), . . . , xm−1(i))∈qm, 0≤i<n such that
A Boolean sharing of some value x∈2
The goal is to decode a masked polynomial a to a masked bitstring {circumflex over (b)} where:
The function Decode: q2 is defined as follows
where 0 denotes a continuous interval of values in Zq which are mapped to 0 and 1 denotes a continuous interval of values in q which are mapped to 1 for a given post-quantum scheme. Note that it is assumed that 0 and 1 are disjunct and cover the complete space of q (i.e., 0∩1=∅ and 0∪1=q). Further the function DecodePoly: q[X]2
DecodePoly(x)=Concat(Decode(xi))0≤i<m,
where Concat denotes the concatenation of the bits in the vector, and thus resulting in an element of 2
A masked decoding method may be described using the following pseudo code:
The masked decoding method performs the decoding as follows. Initially in steps 1 to 3, δ is subtracted from each coefficient (i.e., for coefficients 0 to m−1) of the polynomial a, which is then transformed to Boolean shares using an A2B function. Due to this shifting, the coefficient values 0≤x<t should be decoded to 0, while x≥t are decoded to 1. This mapping cannot be easily computed on shared values directly. Instead, a masked binary-search like approach is performed in a bitsliced fashion. Informally, the mapping can be represented by multiple logical formulas for all bits ⊆{0, . . . , k} which add up to values greater than the threshold, i.e., (2i)≥t. The binary search in the masked decoding method constructs these formulas more efficiently. However, the simplified representation in the masked decoding method may be optimized by unrolling the loop for specific k and performing common subexpression elimination. The process is exemplified for KYBER in the pseudo code. Note that for KYBER the resulting expression is already optimal and does not need to be optimized, i.e., the masked decoding method may be implemented as is.
An example of how the masked decoding method operates will now be described. A modulus value q=3329 is used as in the KYBER KEM with the parameters k=12, δ=2497, and t=1664. In the following, an equation is derived to compute the decoding operations Decode(x) using only XOR, AND, and negation. The resulting equation for Decode is computed by the masked decoding method iteratively using masked implementations for XOR and AND.
The output {circumflex over (b)} is initialized to all zeros and will be updated as the binary search is carried out.
The above example illustrates the operation of the binary search of the masked decoding method. The loop starting at step 8 performs a binary search to determine all aj>t in parallel for all j, i.e., across all coefficients. This speeds up the decoding process.
As noted before, the solution in Oder is only valid at the first security order, contrary to the masked decoding method which works at arbitrary orders. Still, any higher order extension based on the original ideas of Oder would require at least one A2A and one A2B conversion per coefficient. The masked decoding method is much more efficient in this regard, as it only uses one A2B per coefficient. The binary-search of the masked decoding method is also performed in a bitsliced fashion, which should result in a superior performance given common register sizes (i.e., 32 bits) and parameter sets (e.g., KYBER).
In Van Beirendonck, while their approach is very efficient and potentially applicable for higher security orders, it is limited to power-of-two moduli and at least for their implementation requires pre-computed tables. In contrast, the masked decoder may be used for arbitrary moduli and provides competitive performances even for power-of-two moduli, as in that case the binary-search requires only one iteration.
In Coron, the generic table-based approach requires tables depending on the number of shares and the size of the moduli. Consider the case of KYBER with q=3329, which would require multiple tables with 212 entries. In contrast, the masked decode method may be instantiated without any pre-computed tables depending on how the A2B conversion is implemented.
The masked decode method described herein provides a more efficient decoding of polynomial coefficients that may be used with any modulus and instantiated at any security order to counter side-channel attacks. As described above, the masked decode method described herein provides various benefits over existing methods of using shares for the binary decoding of the coefficients of a polynomial.
The processor 120 may be any hardware device capable of executing instructions stored in memory 130 or storage 160 or otherwise processing data. As such, the processor may include a microprocessor, microcontroller, graphics processing unit (GPU), field programmable gate array (FPGA), application-specific integrated circuit (ASIC), or other similar devices.
The memory 130 may include various memories such as, for example L1, L2, or L3 cache or system memory. As such, the memory 130 may include static random-access memory (SRAM), dynamic RAM (DRAM), flash memory, read only memory (ROM), or other similar memory devices.
The user interface 140 may include one or more devices for enabling communication with a user as needed. For example, the user interface 140 may include a display, a touch interface, a mouse, and/or a keyboard for receiving user commands. In some embodiments, the user interface 140 may include a command line interface or graphical user interface that may be presented to a remote terminal via the network interface 150.
The network interface 150 may include one or more devices for enabling communication with other hardware devices. For example, the network interface 150 may include a network interface card (NIC) configured to communicate according to the Ethernet protocol or other communications protocols, including wireless protocols. Additionally, the network interface 150 may implement a TCP/IP stack for communication according to the TCP/IP protocols. Various alternative or additional hardware or configurations for the network interface 150 will be apparent.
The storage 160 may include one or more machine-readable storage media such as read-only memory (ROM), random-access memory (RAM), magnetic disk storage media, optical storage media, flash-memory devices, or similar storage media. In various embodiments, the storage 160 may store instructions for execution by the processor 120 or data upon with the processor 120 may operate. For example, the storage 160 may store a base operating system 161 for controlling various basic operations of the hardware 100. The storage 162 may include instructions for implementing the masked decoding method described above.
It will be apparent that various information described as stored in the storage 160 may be additionally or alternatively stored in the memory 130. In this respect, the memory 130 may also be considered to constitute a “storage device” and the storage 160 may be considered a “memory.” Various other arrangements will be apparent. Further, the memory 130 and storage 160 may both be considered to be “non-transitory machine-readable media.” As used herein, the term “non-transitory” will be understood to exclude transitory signals but to include all forms of storage, including both volatile and non-volatile memories.
While the host device 100 is shown as including one of each described component, the various components may be duplicated in various embodiments. For example, the processor 120 may include multiple microprocessors that are configured to independently execute the methods described herein or are configured to perform steps or subroutines of the methods described herein such that the multiple processors cooperate to achieve the functionality described herein. Further, where the device 100 is implemented in a cloud computing system, the various hardware components may belong to separate physical systems. For example, the processor 120 may include a first processor in a first server and a second processor in a second server.
As used herein, the term “non-transitory machine-readable storage medium” will be understood to exclude a transitory propagation signal but to include all forms of volatile and non-volatile memory. When software is implemented on a processor, the combination of software and processor becomes a single specific machine. Although the various embodiments have been described in detail, it should be understood that the invention is capable of other embodiments and its details are capable of modifications in various obvious respects.
Because the data processing implementing the present invention is, for the most part, composed of electronic components and circuits known to those skilled in the art, circuit details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.
Although the invention is described herein with reference to specific embodiments, various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present invention. Any benefits, advantages, or solutions to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles.
Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements.
Any combination of specific software running on a processor to implement the embodiments of the invention, constitute a specific dedicated machine.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention.