The present invention relates to the field of processing units, and more particularly to a system and method for sharing a multiplier tree between a floating point unit and a cryptographic unit.
Present general purpose processing chips are not ideally suited to the task of public-key cryptographic operations. Accordingly, many computing systems include stand alone cryptography units which may be included on-chip along with general-purpose cores. However, cryptography units represent wasted space for those customers who do not need high cryptographic performance. Floating point units are often similarly included on-chip for performing specialized floating point processing. However, the set of customers that desire high cryptographic performance is typically disjoint from those requiring high floating point performance.
Correspondingly, improvements in the integration of cryptographic and floating point units in processing systems would be desirable.
Various embodiments are presented of a system comprising a floating point unit and a cryptographic unit having a shared multiplier tree.
A device may include a multiplier tree, a floating point unit (FPU), and a cryptographic unit (CU). The device may also include a general purpose processing unit or processing core that utilizes the FPU and/or the CU. The FPU may be configured to perform floating point operations, and the CU may be configured to perform cryptographic operations. The FPU and the CU may share the multiplier tree.
The multiplier tree may include a feedback path and memory elements included in the feedback path. During the floating point operations of the FPU, the multiplier tree may be configured to perform multiply operations for the FPU. The feedback path and the memory elements may not be used when the FPU is performing floating point operations.
During cryptographic operations, the multiplier tree may be configured to perform multiply operations for the CU. The CU may be configured to use the feedback path and/or the memory elements in the multiplier tree during cryptographic operations. In one embodiment, the feedback path may be configured to provide data from a previous cycle to a current cycle. For example, the memory elements may be configured to save an upper portion (or other portion) of a multiplication result and provide the result on the feedback path as a lower portion (or other portion) additive value for a subsequent multiply-add operation.
In some embodiments, the FPU and the CU may be configured to share the multiplier tree dynamically based on operations submitted for execution by the device. For example, in one embodiment, the FPU and the CU may be configured to share the multiplier tree on a per cycle basis, where the FPU may be configured to use the multiplier tree in a first cycle, and where the CU may be configured to use the multiplier tree in a next second cycle.
Alternatively, or additionally, the FPU and the CU may be configured to share the multiplier tree on a per thread basis, where the FPU may be configured to use the multiplier tree for instructions from a first thread, and where the CU may be configured to use the multiplier tree for instructions from a second thread.
In one embodiment, either the FPU or the CU may be configured to use the multiplier tree exclusively based on a configuration parameter. The configuration parameter may be determined at various times by various entities. For example, the configuration parameter may be determined by an operating system. In one embodiment, the configuration parameter may be determined during a boot up sequence of a computer comprising the device. Use of the multiplier tree by the FPU or the CU may also be assigned at other time times or by other entities, as desired.
Accordingly, a method for performing operations in a processor system may include receiving a floating point instruction and correspondingly performing floating point operations in response to the floating point instruction. Performing floating point operations may include the multiplier tree performing multiply operations. The method may further include receiving a cryptographic instruction and correspondingly performing cryptographic operations in response to the cryptographic instruction.
As indicated above, the feedback path and memory elements may be used during cryptographic operations but may not be used during floating point operations. For example, performing cryptographic operations may include using the feedback path to provide data from a previous cycle to a current cycle. In a more specific example, performing cryptographic operations may include saving an upper portion of a multiplication result in one or more of the memory elements and providing the result on the feedback path as a lower portion additive value for a subsequent multiply-add operation, although other embodiments are envisioned.
The method may further include reserving the multiplier tree for use during either performing floating point operations or performing cryptographic operations. Reserving the multiplier tree may be performed dynamically based on operations submitted for execution to the processor system. Alternatively, or additionally, the method may include reserving the multiplier tree for use during floating point operations in a first one or more cycles and reserving the multiplier tree for use during cryptographic operations in a next second one or more cycles.
In one embodiment, the floating point instruction(s) may be received from a first thread and the cryptographic instruction(s) may be received from a second thread. Accordingly, the method may further include performing floating point operations in response to future instructions from the first thread using the multiplier tree and performing cryptographic operations in response to future instructions from the second thread using the multiplier tree.
Finally, the method may include receiving a first configuration parameter assigning the multiplier tree for use during floating point operations and receiving a second configuration parameter assigning the multiplier tree for use during cryptographic operations.
A better understanding of the present invention can be obtained when the following detailed description of the preferred embodiment is considered in conjunction with the following drawings, in which:
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
The following references are hereby incorporated by reference in their entirety as though fully and completely set forth herein:
U.S. Publication No. 2004/0264693, titled “Method and Apparatus for Implementing Processor Instructions for Accelerating Public-Key Cryptography,” filed on Jul. 24, 2003 and published on Dec. 30, 2004.
The following is a glossary of terms used in the present application:
Memory Medium—Any of various types of memory devices or storage devices. The term “memory medium” is intended to include an installation medium, e.g., a CD-ROM, floppy disks 104, or tape device; a computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Rambus RAM, etc.; or a non-volatile memory such as a magnetic media, e.g., a hard drive, or optical storage. The memory medium may comprise other types of memory as well, or combinations thereof. In addition, the memory medium may be located in a first computer in which the programs are executed, and/or may be located in a second different computer which connects to the first computer over a network, such as the Internet. In the latter instance, the second computer may provide program instructions to the first computer for execution. The term “memory medium” may include two or more memory mediums which may reside in different locations, e.g., in different computers that are connected over a network.
Computer System—any of various types of computing or processing systems, including a personal computer system (PC), mainframe computer system, workstation, network appliance, Internet appliance, personal digital assistant (PDA), television system, grid computing system, or other device or combinations of devices. In general, the term “computer system” can be broadly defined to encompass any device (or combination of devices) having at least one processor that executes instructions from a memory medium.
As shown in
As indicated above, the general processor 180 in the system 100 may use the FPU 120 and/or the CU 160 for more specific operations or instructions (e.g., floating point operations or cryptographic operations, respectively). As also indicated above, the processor 180 may include the FPU 120, the CU 160, and/or the multiplier tree 140, possibly within the same pipeline (e.g., the FMA multiplier pipeline). In some embodiments, the FPU 120 and/or the CU 160 may be coupled to the processor 180 as coprocessors internal or external to the processor 180. Furthermore, in some embodiments, the FPU 120, the multiplier tree 140, and/or the CU 160 may be associated with a single core (or with one or more cores) of a plurality of cores in the system 100. Other FPUs, CUs, and/or multiplier trees may be associated with each core (or with other cores) of the plurality of cores. In other words, in one embodiment, each processing core may have an associated FPU 120, CU 160, and multiplier tree 140 in the system 100.
In one embodiment, the CU 160 and/or the multiply tree 140 (e.g., the memory elements and feedback path of the multiply tree) may be protected from access from or other interaction with the FPU 120, the processor 180 and/or other elements. This may allow for cryptographic information/operations to be performed more securely.
Note that the system 100 may further include other elements that are not shown, as desired. For example, the system 100 may include various memory mediums, registers, busses, caches, processors, cores, peripherals, timing devices, etc. In one embodiment, the system 100 may be a general use computer, such as a personal computer or server, a network device such as a router or switch, or a consumer electronic device (e.g., mobile devices, cell phones, personal digital assistants, portable media players, etc.) which requires processing of instructions, among other possible systems.
FIGS. 2A-3C—Exemplary Diagrams of the Multiplier Tree
h0=umulxhi x0, y0;
l0=mulx x0, y0;
h1=umulxhi x1, y0;
l1=mulx x1, y0;
. . .
h15=umulxhi x15, y0;
l15=mulx x15, y0;
r0=l0;
r1=addcc h0, l1; //set carryout bit
r2=addxccc h1, l2; //use, then set the carryout bit.
. . .
r15=addxccc h14, l15; //use, then set carryout bit
r16=addxc h15,0; //use carryout bit
Note that the upper 64-bits, for example, h0, of a 128-bit partial product x0*y0 may be manually propagated into the next partial product x1*y0 using an addcc instruction. That process is typically slow because the output is delayed by the multiplier latency, which may be, e.g., an 8-cycle latency in the case of an exemplary processor. The present invention provides a more efficient technique for efficiently handling the propagation of the upper 64-bits of a 128-bit product into a next operation.
In one embodiment, an unsigned multiplication using an extended carry register (the instruction umulxc, e.g., illustrated in
As shown in
As shown, result register rd (209) may receive the lower n bits [n-1:0] 201, i.e., the second portion 201 of result 207. The lower n bits may comprise: rs1*rs2+extended carry previously saved in the extended carry register (exc) 203. The upper n bits [2n−1:n] 205 of 207 (rs1*rs2+previous exc) may be stored in the extended carry register (exc) 203 for use in subsequent computations. The exc value, saved from the most significant n bits of the result of one operation, may be added into the least significant n bits of the next operation. Note that in the implementation illustrated in
According to another embodiment, the multiply tree may implement a umulxck instruction, which may effectively combine both multiply and accumulate operations. In some embodiments, the instruction umulxck is an instruction which multiplies its first input register, rs1, times k and adds both the second input register, rs2, as well as the previous carry to produce both an integer result and a new carry out. That is, umulxck computes (rs1*k)+rs2+previous exc to produce both rd and a new exc. In addition to computing a row y0*X, the umulxck instruction also allows for accumulating an additional row S=(s15, . . . , s0) implicitly without requiring additional add (e.g., addxccc) operations. The umulxck instruction is illustrated in
As shown in
In one embodiment, umulxck, effectively combines both multiply and accumulate operations. In addition to computing a row y0*X, the umulxck instruction also allows for accumulating an additional row S=(s15, . . . , s0) implicitly without requiring additional add (e.g., addxccc) operations. The umulxck instruction is illustrated in
As shown in
In the embodiment illustrated in
Finally,
Multiple word multiplications are needed in public-key encryption systems such as the Rivest-Shamir-Adleman (RSA) public-key algorithm and the Diffie-Hellman (DH) key exchange schemes. These schemes require modular exponentiation with operands of at least 512 bits. Modular exponentiation is computed using a series of modular multiplications and squarings. A newly standardized public-key system, Elliptic Curve Cryptography (ECC), also uses large integer arithmetic, even though it requires smaller key sizes. The Elliptic Curve public-key cryptographic systems operate in both integer and binary polynomial fields. A typical RSA operation requires a 1024 bit modular exponentiation (or two 512 bit modular exponentiations using the Chinese Remainder Theorem). RSA key sizes are expected to grow to 2048 bits in the near future. A 1024 bit modular exponentiation includes a sequence of large integer modular multiplications, each in turn is further broken up into many word size multiplications. In total, a 1024 bit modular exponentiation requires over 1.6 million 64 bit multiplications. Thus public-key algorithms are compute intensive with relatively few data movements.
In order to better support cryptographic applications, it is desirable to enhance the capability of general purpose processors to accelerate public-key computations. The multiplication of any multiple word values will benefit from this method, not just cryptographic applications.
The storage of integer values with more than 64 bits requires multiple computer words. The multiplication of such words is tedious. The SPARC opcodes provide some support. There are two 64 bit multiplication instructions, mulx and umulxhi. The mulx instruction multiplies two 64 bit values and returns the lower order 64 bits of the product. The umulxhi instruction multiplies two 64 bit values and returns the upper order 64 bits of the product. Thus, to multiply n 64 bit words by m 64 bit words requires n*m executions of the mulx instruction and also n*m executions of the umulxhi instruction. This produces 2*n*m 64 bit words that need to be added together. For example, consider n=4 and m=3. Represent the 4 word value as the 64 bit words D, C, B, and A where D is the most significant 64 bits and A is the least 64 bits of the 256 bit value. Represent the 3 word value as the 64 bit words T, S, and R where T is the most significant 64 bits and R is the least 64 bits of the 192 bit value. Represent the result of the mulx instruction of, say, A and R by ARl (l for lower) and the result of the umulxhi instruction of A and R by ARu (u for upper). The initial partial products for this multiplications are shown in
The lower order 64 bits, N, of the result is ARl. The next 64 bits, M, is the sum of BRl, ARu, and ASl. Then the next 64 bits, L, is the sum of CRl, BRu, BSl, ASu, and ATl plus the carry out from the sum of BRl, ARu, and ASl. As an aid in adding the carries from one column to the next, the addxccc instruction includes the xcc.c bit in an addition and sets the carry out bit xcc.c.
This section presents a hardware organization for a multiplier that enables multiple word multiplies to be carried out with greater efficiency. The number of multiplies is cut nearly in half to n*(m+1)+1. No addition operations may be needed. The number of clock cycles to perform the multiple word multiply is n*(m+1) plus the pipeline latency of the multiplier. This is a speed-up by a factor of two to four. Furthermore, the amount of memory space is reduced to only the input operands and the result location, as all intermediate partial products that need to be stored are contained within the result storage area.
A typical organization of multiply hardware may include the following pipeline stages:
1. Form the partial products and start the carry save adder (CSA), reducing the number of partial product terms.
2. Finish the CSA, further reducing the number of partial product terms to two.
3. Carry lookahead add (CLA) the two partial product terms to get the result.
Note that the result contains twice as many bits as each input. Thus, the output is either the lower or upper half of the result, but not both.
The multiple word multiply organization pipeline stages contains the following (see
1. Form the partial products and start the CSA, reducing the number of partial product terms. A third input is included as an additional partial product term.
2. Finish the CSA, further reducing the number of partial product terms to two. This stage also includes as input into the lower half two more partial product terms that are the upper half of the resulting two partial product terms from the previous multiple word multiply opcode.
3. Carry lookahead add the lower half of the two partial product terms plus the carry in, which is the carry out from the addition in the previous multiple word multiply opcode. The output is the result of this addition, without the carry out.
The feedbacks to the compressors and adder may only occur during the opcode for multiple word multiplies. At all other times, the values may be held and zeros are fed back. This may allow for other operations and interrupts to take place interspersed within the computation.
For the four word by three word example, the instruction sequence is shown below. Note that the result is placed in locations N, M, L, K, J, I, and H where N contains the least significant 64 bits and H contains the most significant 64 bits.
(any*zero)+zero −>discard
R*A+zero→N
R*B+zero→M
R*C+zero→L
R*D+zero→K
R*zero+zero→J
S*A+M→M
S*B+L→L
S*C+K→K
S*D+J→J
S*zero+zero→I
T*A+L→L
T*B+K→K
T*C+J→J
T*D+I→I
T*zero+zero→H
The above sequence is repeated here with the intermediate values shown:
?*0+input 0+internal unknown−>internal 0, output (discard=unknown)
R*A+input 0+internal 0−>internal ARu, output (N=ARl)
R*B+input 0+internal ARu−>internal BRu, output (M=BRl+ARu)
R*C+input 0+internal BRu→internal CRu, output (L=CRl+BRu)
R*D+input 0+internal CRu→internal DRu, output (K=DRl+CRu)
R*0+input 0+internal DRu→internal 0, output (J=DRu)
S*A+input (M=BRl+ARu)+internal 0→internal ASu, output (M=BRl+ARu+ASl)
S*B+input (L=CRl+BRu)+internal ASu→internal BSu, output (L=CRl+BRu+BSl+ASu)
S*C+input (K=DRl+CRu)+internal BSu→internal CSu, output (K=DRl+CRu+CSl+BSu)
S*D+input (J=DRu)+internal CSu→internal DSu, output (J=DRu+DSl+CSu)
S*0+input 0+internal DSu→internal 0, output (I=DSu)
T*A+input (L=CRl+BRu+BSl+ASu)+internal 0→internal ATu, output (L=CRl+BRu+BSl+ASu+ATl)
T*B+input (K=DRl+CRu+CSl+BSu)+internal ATu→internal BTu, output (K=DRl+CRu+CSl+BSu+BTl+ATu)
T*C+input (J=DRu+DSl+CSu)+internal BTu→internal CTu, output (J=DRu+DSl+CSu+CTl+BTu)
T*D+input (I=DSu)+internal CTu→internal DTu, output (I=DSu+DTl+CTu)
T*0+0→internal 0, output (H=DTu)
Consider a k by k multiply. The largest value each input can be is 2̂k−1. If we add to this two additional k bit values, then the largest value that can result is:
(2̂k−1)̂2+2(2̂k−1)=2̂(2k)−2(2̂k)+1+2(2̂k)−2=2̂(2k)−1
and that is also the largest value that can be in a 2k bit result. So, to the product of two k bit values, two more k bit values may be added and the result still fits in the 2k bit result. Thus, the carry out of the CSA and the carry out of the 4 to 2 compressors are both zero. If one of the k bit values that is added to the product is the most significant half (e.g., the upper) k bits of the previous such operation, then even though this value may be contained in two k bit registers (as shown in
When Booth encoding is used to form the partial products for the CSA, the carry out of the CSA and/or the 4 to 2 compressors may not be zero. This is because Booth encoding uses negative multipliers. Booth encoding considers bits of the multiplier in pairs instead of one at a time. Zero, one, and two times the multiplicand is obtained with a mux, but three times the multiplicand cannot be done quickly. So instead, we use 3=4−1 with the value of 4 saved for the next pair and −1 used with this pair. Negative numbers (in twos complement form) have an infinite number of one bits going of to the left (the most significant bit positions). If everything is added up, a carry will propagate through this infinite row of ones, setting them all to zero. However, in the CSA and the 4 to 2 compressors, the summation is not yet complete, so the propagation may or may not have reached the carry out position. So, when the two bit values are fed back, if both carry outs are zero, then the carry that removes the leading ones has not yet reached the carry out position and so k ones may need to be concatenated to the left of one of the terms being feed back. However, if either carry out is one, then the carry that removed the leading ones has reached the carry out position and so k zeros may need to be concatenated to the left of one of the terms being feed back.
When there is a change of context, it may be necessary to save the current internal value (for the context that is being suspended) and restore the saved internal value (for the context that is being resumed). The current internal value may be obtained by executing the multiple word multiply opcode with zero times zero, plus zero. Then the current internal value is the output of the operation and that value then can be saved just as the register values are saved when there is a change of context. To restore the saved internal value that has been previously saved, the multiple word multiply operation may be used. Let V be the saved value that is to be restored to the internal state. The multiple word multiply opcode may be executed with input values V(2̂k−1)+V (note that 2̂k−1 is the value with all the bits turned on). This places the value of V into the internal state. Notice that as the value of V becomes the new internal value, what was the current internal value is output. Thus, the saved value V may be saved and the current internal value may be obtained at the same time with just one execution of the multiple word multiply operation. If the internal state is only accessible by software (e.g., supervisor or hypervisor) that may not be subject to context switching, then saving and restoring the internal state may not be necessary.
If the option of an integer multiply-add (without internal feedback) is desired, then it can easily be implemented as the same as the multiple word multiply operation except that the internal feedback is turned off, as it is on all other instructions. Note that these instructions may not use the internal state and can be freely intermixed with those that do.
Thus,
In 400, a first instruction may be received by a processing system and may be appropriately routed. In some embodiments, the first instruction may be received from an operating system or hypervisor, thread, or other sources.
In some embodiments, the first instruction may be identified (e.g., by an opcode or other label) as a floating point or cryptographic instruction. Accordingly, the first instruction may be routed for execution using floating point operations (e.g., by an FPU, such as the FPU 120 described above) in 402 or cryptographic operations (e.g., by a CU, such as the CU 160 described above) in 406 accordingly. In one embodiment, a multiply tree (such as the multiply tree 140 described above) may be reserved for the floating point operations or cryptographic operations respectively. Thus, in one example, the nature of the first instruction may be determined, and, if the first instruction is a floating point instruction, the multiplier tree may be reserved for the FPU, and the first instruction may be executed using the FPU and the multiplier tree. Similarly, the multiplier tree may be reserved for the CU if the first instruction is a cryptographic instruction and requires multiplication. Thus, instructions may be routed on an instruction or cycle by cycle basis.
In another embodiment, the first instruction may be routed on a thread by thread basis. For example, if the first instruction is received from a first thread that is associated (or has been previously associated) with floating point operations, the first instruction may be routed and/or labeled (e.g., by an opcode or other labeling method) for floating point operations in 402 (e.g., to an FPU, such as the FPU 120 described above). Alternatively, if the first instruction is received from a second thread that is associated (or has been previously associated) with cryptographic operations, the first instruction may be routed and/or labeled for cryptographic operations in 406 (e.g., to a CU, such as the CU 160 described above). Thus, in one embodiment, the instructions may be routed to various processing units on a thread basis, where instructions from a first thread are routed to the FPU and instructions from a second thread are routed to the CU. The multiplier tree may be reserved fro the FPU or the CU accordingly.
Routing of the first instruction may be determined according to one or more parameters, as desired. For example, in one embodiment, a parameter may be set which determines whether the multiply tree is reserved for the FPU or the CU. Correspondingly, the first instruction may be routed according to the setting of the parameter. For example, if the parameter indicates that the multiplier tree is reserved for use by the FPU, no instructions (or possibly no instructions requiring the multiply tree) may be assigned to the CU. Similarly, if the parameter indicates that the multiplier tree is reserved for use by the CU, no instructions (or possibly no instructions requiring the multiply tree) may be assigned to the FPU. In such embodiments, instructions that would have been destined to the FPU or CU may be instead executed by another processor, such as a general processor.
In some embodiments, the parameter may be assigned or determined at various times. For example, the parameter may be assigned during initial set up of a processing system including the FPU, CU, and multiplier tree (e.g., a computer or other processing device), during boot up of the system, at various time intervals during operation of the system, by the operating system of the system (e.g., on a thread by thread basis, cycle basis, instruction basis, or otherwise), and/or during other times.
In one embodiment, a first parameter may be received assigning the multiplier tree for use during floating point operations (e.g., by the FPU), and subsequently a second parameter may be received assigning the multiplier tree for use during cryptographic operations (e.g., by the CU). Thus, in one embodiment, the first parameter may indicate that the FPU use the multiply tree for a first time period, and after the second parameter is received, the CU may use the multiply tree for a second time period. The first parameter may be received according to any of the times described above, and similarly, the second parameter may be received according to any subsequent time described above. Note that receiving the first parameter and receiving the second parameter may refer to receiving the same parameter, but with different values, or simply overwriting the value of an existing parameter stored in memory, among other possibilities. Thus, sharing or reserving of the multiplier tree (and correspondingly, routing of instructions) may be determined according to the parameter.
In 402, a floating point instruction may be received. In some embodiments, the floating point instruction may be received by the FPU. Additionally, the floating point instruction may be transmitted by a processor or general processing core (e.g., of a computer). In one embodiment, the floating point instruction may be provided from an operating system or hypervisor of a computer, an execution thread, or others, e.g., according to the reception and routing described in 400.
In 404, floating point operations may be performed in response to the floating point instruction. The floating point operations may be performed by the FPU. The floating point operations (or at least a portion of them) may be performed using a multiply tree (e.g., the multiply tree 140 described above). In other words, the multiply tree may perform multiply operations for the FPU. As noted above, the multiply tree may include a feedback path and memory elements (e.g., for storing previous results, as indicated above); however, during floating point operations involving the multiplier tree, the feedback path and the memory elements may not be used. However, it should be noted that there may be instructions that are executed by the FPU but do not necessarily use the multiply tree.
In 406, a cryptographic instruction may be received. The cryptographic instruction may be received by the CU. Additionally, similar to above, the cryptographic instruction may be transmitted by a processor or processing core, an operating system or hypervisor, etc., e.g., according to the reception and routing described in 400.
In 408, cryptographic operations may be performed in response to the cryptographic instruction. The cryptographic operations may be performed by the CU. At least a portion of the cryptographic operations may be performed using the multiplier tree.
During cryptographic operations involving the multiplier tree, the feedback path and the memory elements may be used. More specifically, performing the cryptographic operations may include using the feedback path to provide data from a previous cycle to a current cycle, e.g., using the memory elements. For example, the memory elements may save a previous result of an immediately preceding operation or cycle (which may not use a holding flop memory element), or may save a previous result of a cycle before the immediately preceding operation or cycle (e.g., using a holding flop memory element). The memory elements may be any of a variety of memory elements, such as, for example, flip flops (e.g., one bit flip flops, holding flip flops, etc.), registers, etc.
In one embodiment, the upper portion of a multiplication result may be stored in one or more of the memory elements and may provide the result on the feedback path as an additive value for a subsequent multiply-add operation. However, it should be noted that there may be instructions that are executed by the FPU but do not necessarily use the multiply tree.
Cryptographic operations, the feedback path, the memory elements, and/or the entirety of the CU may be protected from other elements of the processing system, e.g., the general processor, the FPU, and/or others. For example, the values stored in the memory elements may not be accessible by other elements in the processing system. In some embodiments, this may allow for higher security in the cryptographic operations.
Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.