Modular multiplication is a very important component of most cryptographic constructions, such as RSA (Rivest-Shamir-Adleman) and ECC (Elliptic-Curve Cryptography) used in TLS/web connections for public-key cryptography today. Future PQC (Post Quantum Crypto) or FHE (Fully Homomorphic Encryption) algorithm standards also rely on modular multiplication.
Examples of modular multiplication include Montgomery module multiplication, more commonly referred to as Montgomery multiplication, and Barrett Multiplication. Montgomery modular multiplication relies on a special representation of numbers called Montgomery form. The algorithm uses the Montgomery forms of a and b to efficiently compute the Montgomery form of ab mod N. The efficiency comes from avoiding expensive division operations. Classical modular multiplication reduces the double-width product ab using division by N and keeping only the remainder. This division requires quotient digit estimation and correction. The Montgomery form, in contrast, depends on a constant R>N which is coprime to N, and the only division necessary in Montgomery multiplication is division by R. The constant R can be chosen so that division by R is easy, significantly improving the speed of the algorithm. In practice, R is always a power of two, since division by powers of two can be implemented by bit shifting.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
Embodiments of methods and apparatus for optimization technique for modular multiplication algorithms are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
For clarity, individual components in the Figures herein may also be referred to by their labels in the Figures, rather than by a particular reference number. Additionally, reference numbers referring to a particular type of component (as opposed to a particular component) may be shown with a reference number followed by “(typ)” meaning “typical.” It will be understood that the configuration of these components will be typical of similar components that may exist but are not shown in the drawing Figures for simplicity and clarity or otherwise similar components that are not labeled with separate reference numbers. Conversely, “(typ)” is not to be construed as meaning the component, element, etc. is typically used for its disclosed function, implement, purpose, etc.
There are two well-known modular multiplication algorithms: Montgomery and Barrett. For large sizes, word-level software and hardware implementations of these algorithms are utilized. For these word-level algorithms, multiplication and reduction sections present similar performance.
Word-level modular multiplication involves many core multiplication operations, and most of these core multiplications can be parallelized. Modular multiplication consists of two main operations: multiplication and reduction.
There are several core multiplications in the multiplication operation. The core multiplications are highly parallelizable since they are —independent multiplications. However, word-level reduction is a serial process and is hard to parallelize. The more serial steps it includes (the bigger the number), the more inefficiency it will introduce to the performance of modular multiplication.
Montgomery Multiplication
Montgomery Multiplication and Reduction operations can be defined and implemented in many ways. To be able to demonstrate our approach, we utilize a word-level approach. First, we define preliminaries:
Montgomery Reduction can be utilized after full multiplication of the operands. This is defined as follows:
A schematic representation 100 of the foregoing is depicted in
It should be noted that this algorithm is not exactly representing Montgomery Reduction, but it is representing an Almost Montgomery Reduction operation. In the Almost Montgomery Reduction operation, the result is not less than the modulus M, but it is exactly k bits. If k/w is not an integer, the operation guarantees an L-digit result, eliminating the need for further checking.
Montgomery Multiplication operation interleaves multiplication and reduction components of the operation. Montgomery Multiplication algorithm can be defined as follows:
A single step of the Montgomery Multiplication algorithm is depicted in
To simplify presentation, Montgomery Multiplication Algorithm can be defined over 2 k-bit integers as follows:
A schematic representation 300 of the foregoing is shown in
Step 2 of the Montgomery reduction begins with CH and CL in registers 310 and 312, where CH is the high 2*k bits of C(CH=C/R) and CL(CL=C mod R) is the low 2*k bits of C. CH and CL are input to a block 314 in which Montgomery reduction is performed using 2*L serial steps using a known algorithm, with the particular algorithm being outside the scope of this disclosure. The output block 314 is RES, as depicted in a block 316.
In terms of A and B, RES can be written as follows:
Assume we have a precomputed constant KM, which is derived from operand B:
When C is calculated in this manner, the value of the lower k bits of C is ‘0’, as shown in a block 413. Meanwhile, the upper 2*k bits (CH) are loaded into block 410 and the upper k bits of CL are loaded into a block 412. As before, CH and CL are fed into a Montgomery reduction block 414. However, since the lower k bits of CL are ‘0’, the Montgomery reduction can be performed in L serial steps, halving the 2*L serial steps required under the conventional Montgomery reduction in
An embodiment of the Montgomery Multiplication algorithm that interleaves multiplication and reduction components is defined as follows:
Barrett Multiplication
The Barrett Multiplication algorithm can be defined over 2 k-bit integers as follows:
For the reduction step, the number of serial steps can be calculated as:
A representation 500 of an example of a conventional Barrett Multiplication algorithm is shown in
CH, the upper half (C/2k) of C is loaded into a block 510 while CL, the lower half (C mod 2k) of C is loaded into a block 512. CH and CL are provided as inputs to a block 514 in which Barrett reduction is performed using 2*L serial steps using a known algorithm, with the particular Barrett reduction algorithm being outside the scope of this disclosure. The output of block 514 is RES, as shown in a block 516.
In terms of A and B, RES can be written as follows:
Assume we have a precomputed constant KB, which is derived from operand B:
A schematic representation of an optimized Barrett multiplication algorithm using this approach for calculation RES is shown in
In the Barrett reduction step, CH is loaded in block 612 and CL is loaded in block 614. In this case, the upper k bits of CH are ‘0’s, leaving the lower k bits of CH to be operated on. As a result, the Barrett reduction in a block 616 only requires L serial steps rather than L*2 serial steps. The output of block 616 is RES, as shown in a block 618. Under the optimized Barret multiplication algorithm of
Generally, the improved Montgomery and Barrett algorithms may be implemented via execution of instructions on processors used in various types of devices comprising communication endpoints. For example, such communication endpoints may include computers/servers or network devices installed in such computers/servers, such as NICs (Network Interface Controllers), SmartNICs, Infrastructure Processing Units (IPUs), Data Processing Units (DPUs), and Edge Processing Units (EPUs). Additional use cases include but are not limited to use in encrypted memory applications, encrypted network applications, for authentication, and access control. In addition to execution of instructions on processors, all or part of the improved Montgomery and Barrett algorithms may be implemented using programmable logic, such as but not limited to using a Field Programmable Gate Array (FPGA), Application Specific Integrated Circuits (ASICs), and other types of embedded logic.
a shows respective infrastructure processor units (IPUs) 700 and 700a, each including two optical modules 702 and 704. In the illustrated embodiment of
Under an IPU 700a shown in
CPU/SoC 710 employs a System on a Chip including multiple processor cores. Various CPU/processor architectures may be used, including x86 and ARM architectures. In one non-limiting example, CPU/SoC 710 comprises an Intel® Xeon® processor. Software executed on the processor cores may be loaded into memory 718, either from a storage device (not shown), for a host, or received over a network coupled to enhanced optical module 702 and 704.
As further shown in each of
In addition to IPUs and SmartNICs, embodiments of the improved Montgomery and Barrett algorithms may be implemented on various types of add-in cards and other devices. Examples of such devices and add-in cards include but are not limited to line cards, switches, routers, cellular equipment (like nano-cells, picocells, ethernet connected radios), radio access network (RAN) equipment, WiFi equipment, network appliances, storage devices, security devices, servers with network ports, and telecom equipment.
Processors 970 and 980 are shown including integrated memory controller (IMC) circuitry 972 and 982, respectively. Processor 970 also includes interface circuits 976 and 978; similarly, second processor 980 includes interface circuits 986 and 988. Processors 970, 980 may exchange information via the interface 950 using interface circuits 978, 988. IMCs 972 and 982 couple the processors 970, 980 to respective memories, namely a memory 932 and a memory 934, which may be portions of main memory locally attached to the respective processors.
Processors 970, 980 may exchange information with a network interface (NW I/F) 990 via individual interfaces 952, 954 using interface circuits 976, 994, 986, 998. The network interface 990 (e.g., one or more of an interconnect, bus, and/or fabric, and in some examples is a chipset) may optionally exchange information with a coprocessor 938 via an interface circuit 992. In some examples, the coprocessor 938 is a special-purpose processor, such as, for example, a high-throughput processor, a network or communication processor, compression engine, graphics processor, general purpose graphics processing unit (GPGPU), neural-network processing unit (NPU), embedded processor, or the like.
Generally, in addition to processors and CPUs, the teaching and principles disclosed herein may be applied to Other Processing Units (collectively termed XPUs) including one or more of Graphic Processor Units (GPUs) or General Purpose GPUs (GP-GPUs), Tensor Processing Units (TPUs), Data Processing Units (DPUs), Infrastructure Processing Units (IPUs), Edge Processing Units (EPU), Artificial Intelligence (AI) processors or AI inference units and/or other accelerators, FPGAs and/or other programmable logic (used for compute purposes), etc. While some of the diagrams herein show the use of CPUs and/or processors, this is merely exemplary and non-limiting. Generally, any type of XPU may be used in place of a CPU or processor in the illustrated embodiments. Moreover, as used in the following claims, the term “processor” is used to generically cover CPUs and various forms of XPUs.
A shared cache (not shown) may be included in either processor 970, 980 or outside of both processors, yet connected with the processors via an interface such as P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Network interface 990 may be coupled to a first interface 916 via interface circuit 996. In some examples, first interface 916 may be an interface such as a Peripheral Component Interconnect (PCI) interconnect, a PCI Express interconnect or another I/O interconnect, such as but not limited to COMPUTE EXPRESS LINK™ (CXL). In some examples, first interface 916 is coupled to a power control unit (PCU) 917, which may include circuitry, software, and/or firmware to perform power management operations with regard to the processors 970, 980 and/or coprocessor 938. PCU 917 provides control information to a voltage regulator (not shown) to cause the voltage regulator to generate the appropriate regulated voltage. PCU 917 also provides control information to control the operating voltage generated. In various examples, PCU 917 may include a variety of power management logic units (circuitry) to perform hardware-based power management. Such power management may be wholly processor controlled (e.g., by various processor hardware, and which may be triggered by workload and/or power, thermal or other processor constraints) and/or the power management may be performed responsive to external sources (such as a platform or power management source or system software).
PCU 917 is illustrated as being present as logic separate from the processor 970 and/or processor 980. In other cases, PCU 917 may execute on a given one or more of cores (not shown) of processor 970 or 980. In some cases, PCU 917 may be implemented as a microcontroller (dedicated or general-purpose) or other control logic configured to execute its own dedicated power management code, sometimes referred to as P-code. In yet other examples, power management operations to be performed by PCU 917 may be implemented externally to a processor, such as by way of a separate power management integrated circuit (PMIC) or another component external to the processor. In yet other examples, power management operations to be performed by PCU 917 may be implemented within BIOS or other system software.
Various I/O devices 914 may be coupled to first interface 916, along with a bus bridge 918 which couples first interface 916 to a second interface 920. In some examples, one or more additional processor(s) 915, such as coprocessors, high throughput many integrated core (MIC) processors, GPGPUs, accelerators (such as graphics accelerators, digital signal processing (DSP) units, and cryptographic accelerator units), FPGAs, XPUs, or any other processor, are coupled to first interface 916. In some examples, second interface 920 may be a low pin count (LPC) interface. Various devices may be coupled to second interface 920 including, for example, a keyboard and/or mouse 922, communication devices 927 and storage circuitry 928. Storage circuitry 928 may be one or more non-transitory machine-readable storage media, such as a disk drive, Flash drive, SSD, or other mass storage device which may include instructions/code and data 930. Further, an audio I/O 924 may be coupled to second interface 920. Note that other architectures than the point-to-point architecture described above are possible. For example, instead of the point-to-point architecture, a system such as system 900 may implement a multi-drop interface or other such architecture.
While various embodiments described herein use the term System-on-a-Chip or System-on-Chip (“SoC”) to describe a device or system having a processor and associated circuitry (e.g., Input/Output (“I/O”) circuitry, power delivery circuitry, memory circuitry, etc.) integrated monolithically into a single Integrated Circuit (“IC”) die, or chip, the present disclosure is not limited in that respect. For example, in various embodiments of the present disclosure, a device or system can have one or more processors (e.g., one or more processor cores) and associated circuitry (e.g., Input/Output (“I/O”) circuitry, power delivery circuitry, etc.) arranged in a disaggregated collection of discrete dies, tiles and/or chiplets (e.g., one or more discrete processor core die arranged adjacent to one or more other die such as memory die, I/O die, etc.). In such disaggregated devices and systems the various dies, tiles and/or chiplets can be physically and electrically coupled together by a package structure including, for example, various packaging substrates, interposers, active interposers, photonic interposers, interconnect bridges and the like. The disaggregated collection of discrete dies, tiles, and/or chiplets can also be part of a System-on-Package (“SoP”).
Although some embodiments have been described in reference to particular implementations, other implementations are possible according to some embodiments. Additionally, the arrangement and/or order of elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some embodiments.
In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.
In the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. Additionally, “communicatively coupled” means that two or more elements that may or may not be in direct contact with each other, are enabled to communicate with each other. For example, if component A is connected to component B, which in turn is connected to component C, component A may be communicatively coupled to component C using component B as an intermediary component.
An embodiment is an implementation or example of the inventions. Reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the inventions. The various appearances “an embodiment,” “one embodiment,” or “some embodiments” are not necessarily all referring to the same embodiments.
Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular embodiment or embodiments. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.
An algorithm is here, and generally, considered to be a self-consistent sequence of acts or operations leading to a desired result. These include physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like. It should be understood, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.
As discussed above, various aspects of the embodiments herein may be facilitated by corresponding software and/or firmware components and applications, such as software and/or firmware executed by an embedded processor or the like. Thus, embodiments of this invention may be used as or to support a software program, software modules, firmware, and/or distributed software executed upon some form of processor, processing core or embedded logic a virtual machine running on a processor or core or otherwise implemented or realized upon or within a non-transitory computer-readable or machine-readable storage medium. A non-transitory computer-readable or machine-readable storage medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a non-transitory computer-readable or machine-readable storage medium includes any mechanism that provides (e.g., stores and/or transmits) information in a form accessible by a computer or computing machine (e.g., computing device, electronic system, etc.), such as recordable/non-recordable media (e.g., read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.). The content may be directly executable (“object” or “executable” form), source code, or difference code (“delta” or “patch” code). A non-transitory computer-readable or machine-readable storage medium may also include a storage or database from which content can be downloaded. The non-transitory computer-readable or machine-readable storage medium may also include a device or product having content stored thereon at a time of sale or delivery. Thus, delivering a device with stored content, or offering content for download over a communication medium may be understood as providing an article of manufacture comprising a non-transitory computer-readable or machine-readable storage medium with such content described herein.
The operations and functions performed by various components described herein may be implemented by software running on a processing element, via embedded hardware or the like, or any combination of hardware and software. Such components may be implemented as software modules, hardware modules, special-purpose hardware (e.g., application specific hardware, ASICs, DSPs, etc.), embedded controllers, hardwired circuitry, hardware logic, etc. Software content (e.g., data, instructions, configuration information, etc.) may be provided via an article of manufacture including non-transitory computer-readable or machine-readable storage medium, which provides content that represents instructions that can be executed. The content may result in a computer performing various functions/operations described herein.
As used herein, a list of items joined by the term “at least one of” can mean any combination of the listed terms. For example, the phrase “at least one of A, B or C” can mean A; B; C; A and B; A and C; B and C; or A, B and C.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
This application claims the benefit of the filing date of U.S. Provisional Application No. 63/469,173 filed May 26, 2023, entitled “OPTIMIZATION TECHNIQUE FOR MODULAR MULTIPLICATION ALGORITHMS” under 35 U. S. C. § 119(e). U.S. Provisional Application No. 63/469,173 is further incorporated herein in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
63469173 | May 2023 | US |