The present disclosure describes a system and method for carry/borrow handling.
Some processors may be capable of performing extensive mathematical operations on vectors of arbitrary length. For example, conventional processors may be able to perform arithmetic logic unit (ALU) operations such as, add, subtract, multiply, divide, shift, etc. Many of these operations may produce a carry out of one register and/or a borrow from another. As large amounts of data continue to be processed at an increasing rate, efficient arithmetic/shift operations and carry-handling techniques may be required.
Features and advantages of the claimed subject matter will be apparent from the following detailed description of embodiments consistent therewith, which description should be considered with reference to the accompanying drawings, wherein:
Although the following Detailed Description will proceed with reference being made to illustrative embodiments, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art.
Some arithmetic operations may require the addition of multiple vectors. In these instances, a carry generated in one location may need to propagate an arbitrary distance into the high part of the result, which may be very unlikely when using a random bit pattern. Using conventional techniques, the carry correction operations required to solve this problem may require an excessive amount of microprogram code.
Generally, the embodiments described herein provide a more efficient approach for carry/borrow handling in a processor. In accordance with the present disclosure, carry and borrow handling may be performed using a single, separate, subroutine (i.e., an independent subsection of microcode), which may be conditional upon the status of a boolean condition flag. The embodiments described herein may provide an efficient implementation (i.e., having little or no delay) in situations where the branch to the conditional subroutine is selected, as well as, in situations where the branch is not selected. This conditional subroutine may act as a delayed branch on a carry/borrow flag from a previous instruction to enable the branch to hide in the shadow of a later executed instruction.
MMP 100 may also include a number of memory devices, such as, first data RAM 102 and second data RAM 104. Data RAMs 102 and 104 may be capable of receiving data from a First-In, First-Out (FIFO) unit 110 and may store operands before being sent to other components associated with MMP 100, such as ALU 106. ALU 106 may be a general purpose ALU such as a standard 64-bit ALU found in some general purpose processors. ALU 106 may also include condition codes that may allow a programmer to compare results, use conditional branches, etc. The output of ALU 106 may be provided to other components within MMP 100 such as shift circuitry 108. Shift circuitry 108 may be configured to perform shifting operations using either a right-to-left or a left-to-right mode. In some embodiments shift circuitry 108 may be configured to perform a conditional shift. For example, a conditional right shift may be used depending upon whether the output from ALU 106 is odd (i.e., do not shift) or even (i.e., shift), which may be useful for speeding up certain operations (e.g., greatest common divisor algorithms). As mentioned above, MMP 100 may include a number of FIFO units, such as input FIFO 110 and output FIFO 112. Input FIFO 110 may be configured to transmit data through MUXes 124 a, b (e.g., 2-to-1 mux) to Data RAMs 102 and 104 for storage. Moreover, some data may be transmitted through multiplexers (MUXes) 124 c,d to ALU 106 as necessary. MMP 100 may execute a variety of different instructions depending upon the source and destination operand configuration. Some instructions may write the contents of input FIFO 110 directly into Data RAMs 102, 104. Additionally or alternatively, some instructions may allow for the contents of input FIFO 110 to be written directly into ALU 106.
MMP 100 may further include windowing circuitry 114, which may be configured to calculate variable windows used in certain operations, such as those required for modular exponentiation. Windowing circuitry 114 may be configured to perform sliding or fixed exponent windowing, depending upon the selected mode. Windowing circuitry 114 may calculate windows on long exponents for the purpose of reducing the number of multiplications required in modular exponentiation. In some embodiments, the exponent may be treated as a binary string and the bits may be scanned in either a left-to-right or right-to-left orientation. Modular exponentiation operations may scan the bits of the exponent to determine the next group (i.e., window) to be multiplied as the exponent slides from left to right.
MMP 100 may include additional components, such as, control circuitry 118, which may be configured to direct the operation of MMP 100. Control circuitry 118 may be in communication with numerous components within MMP 100, including, but not limited to, windowing circuitry 114, variable RAM 116, control store memory 120, global variables 122 and multiplexers (MUXs) 124 (a-f). Variable RAM 116 may be in communication with Data RAMs 102, 104 and may be used to make references to variables in first and second data RAMs 102, 104. Variable RAM 116 may be divided into a number of different scopes or modules, which may define where a given variable is defined as well as where it may be used. Control store memory 120 may be configured to contain the microprogram code for the CPU, which may be accessed and controlled via control circuitry 118. Global variables 122 may be accessed from anywhere in the program via control circuitry 118. MMP 100 may further include a temporary register 126 capable of storing carry/borrow data as well as decode circuitry 128, which will each be discussed in greater detail below.
In one exemplary embodiment, MMP 100 may be configured to perform efficient carry/borrow handling when adding or subtracting vectors of unequal length by calling a single separate carry/borrow handling subroutine. For example, when adding two vectors of unequal length, a carry generated in one location may need to propagate an arbitrary distance into the higher part of the result. This conditional subroutine call may act like a delayed branch and may evaluate the carry/borrow flag from a next-to-last instruction from MMP 100. The MMP instructions, may be used, inter alia, to perform vector addition and subtraction in MMP 100. Since the subroutine may be called infrequently, the branch to subroutine may have a very fast execution path when it is not selected. In some embodiments, this conditional subroutine may be activated using a state bit that may be set via the next-to-last MMP instruction, which may also indicate whether an addition or subtraction operation has occurred. Thus, a single conditional subroutine call may be used for both carry and borrow correction code.
When the conditional subroutine is selected, global variables 122 (e.g., global variable G0) may have the pointer address the location in data RAMs 102 and/or 104 where the carry/borrow propagation needs to resume. The program may continue from this point in the subroutine by setting up a local variable using global variable G0. This pointer may correspond to the next-to-last MMP instruction's destination address for ALU write-back incremented by 1 (i.e., representing a resume address where carry/borrow would need to be propagated).
In some embodiments, an instruction (e.g., previous MMP instruction (N)) requiring a mathematical operation may be sent to ALU 206. For example, such an instruction may be configured to execute A=A+B as shown in
An additional example of carry/borrow handling in accordance with an exemplary embodiment of the present disclosure is provided below. This embodiment provides a method for adding two vectors of unequal length using a reference count of the smaller vector incremented by one. Assume a first instruction that adds B[7:0] to A[7:0] with a reference count (RC)+1 (i.e., 8+1=a 9 word-vector) with the last word of B zeroed. The last word of B may be added to the corresponding word of the larger vector. Any additional carry propagation has a very low probability as all 64 bits of that word in the larger vector may be 1. Upon the execution of the first instruction, carry/borrow flag 250 & pointer address may be stored in temporary register 126 or 226. Temporary register 126 may be a 10-bit register, which may include carry/borrow flag 250, state bit 260 and the next pointer address indicating next ALU writeback location for a previous MMP instruction. State bit 260 may include data indicating whether the previous MMP instruction was an addition or subtraction operation. In some embodiments, the addresses may be real physical addresses to first data RAM (A) 102 and second data RAM (B) 104. This first instruction may run for a number of cycles (e.g., 9) and may have the carry/borrow flag 250 set. This may imply that the correct result is in A[8:0], however carry flag 250 may need to be added to the sub-vector A[15:9].
A second instruction may be executed before the conditional subroutine for carry/borrow correction may be issued. At the start of the second instruction, the addition carry flag, state bit & pointer value (9) may be saved to temporary register 126 and a number of cycles may be executed (e.g., 16). The carry may indicate that the first instruction was an addition operation (i.e., the hardware knows the opcode of the first instruction was an addition operation). The conditional subroutine may begin executing immediately after the second instruction is in progress. However, the conditional subroutine may need to delay for a couple of cycles until the second instruction has finished updating temporary register 126. If the preserved carry-bit=1, the carry/borrow subroutine may begin at the target address since the first instruction was an addition operation.
In some embodiments, prior to the execution of the first instruction of the conditional subroutine, MMP 100 may copy the pointer from temporary register 126 into global variable 122 (e.g., G0). At the start of the subroutine, the program may initialize a local variable (e.g., within first or second data RAM 102, 104) relative to global variable G0 & may then begin adding with carry in a loop until no carry remains.
In some embodiments, during a subtraction operation, the program may need to perform a computation of the form A[15:0]-B[7:0]. As expected, MMP 100 may need to store the subtraction carry/borrow flag of temporary register 126 at the beginning of the second instruction (e.g., in data RAMs 102 and/or 104). The pointer address may refer to the last address that was written from ALU 106 by the vector operation, incremented by 1. This may provide the next word requiring carry/borrow correction.
The MMP instructions may copy the carry/borrow flag, state bit and the next ALU write-back pointer into temporary register 126 at the beginning of a vector operation. Once the carry/borrow flag is set in temporary register 126 the branch to subroutine may be taken. If the branch to subroutine is taken, the pointer from temporary register 126 may be copied into global variables 122 (e.g., G0).
One exemplary embodiment depicting public key encryption (PKE) circuitry 400 is shown in
Referring now to
The methodology of
The IC 600 may include media/switch interface circuitry 602 (e.g., a common switch interface (CSIX)) capable of sending and receiving data to and from devices connected to the integrated circuit such as physical or link layer devices, a switch fabric, or other processors or circuitry. The IC 600 may also include hash and scratch circuitry 604 that may execute, for example, polynomial division (e.g., 48-bit, 64-bit, 128-bit, etc.), which may be used during some packet processing operations. The IC 600 may also include bus interface circuitry 606 (e.g., a peripheral component interconnect (PCI) interface) for communicating with another processor such as a microprocessor (e.g. Intel Pentium®, etc.) or to provide an interface to an external device such as a public-key cryptosystem (e.g., a public-key accelerator) to transfer data to and from the IC 600 or external memory. The IC may also include core processor circuitry 608. Some embodiments, core processor circuitry 608 may comprise circuitry that may be compatible and/or in compliance with the Intel® XScale™ Core micro-architecture described in “Intel® XScale™ Core Developers Manual,” published December 2000 by the Assignee of the subject application. Of course, core processor circuitry 608 may comprise other types of processor core circuitry without departing from this embodiment. Core processor circuitry 608 may perform “control plane” tasks and management tasks (e.g., look-up table maintenance, etc.). Alternatively or additionally, core processor circuitry 608 may perform “data plane” tasks (which may be typically performed by the packet engines included in the packet engine array 612, described below) and may provide additional packet processing threads.
Integrated circuit 600 may also include a packet engine array 612. Packet engine array 612 may include a plurality of packet engines. Each packet engine may provide multi-threading capability for executing instructions from an instruction set, such as a reduced instruction set computing (RISC) architecture. Each packet engine in the array 612 may be capable of executing processes such as packet verifying, packet classifying, packet forwarding, and so forth, while leaving more complicated processing to the core processor circuitry 408. Each packet engine in the array 612 may include e.g., eight threads that interleave instructions, meaning that as one thread is active (executing instructions), other threads may retrieve instructions for later execution. Of course, one or more packet engines may utilize a greater or fewer number of threads without departing from this embodiment. The packet engines may communicate among each other, for example, by using neighbor registers in communication with an adjacent engine or engines or by using shared memory space.
Integrated circuit 600 may also include memory interface circuitry 610. Memory interface circuitry 610 may control read/write access to external memory. Machine readable firmware program instructions may be stored in external memory, and/or other memory internal to the IC 600. These instructions may be accessed and executed by the integrated circuit 600. When executed by the integrated circuit 600, these instructions may result in the integrated circuit 600 performing the operations described herein, for example, those described below in
Referring now to
As used in any embodiment described herein, “circuitry” may comprise, for example, singly or in any combination, hardwired circuitry, programmable circuitry, state machine circuitry, and/or firmware that stores instructions executed by programmable circuitry. It should be understood at the outset that any of the operations and/or operative components described in any embodiment herein may be implemented in software, firmware, hardwired circuitry and/or any combination thereof.
In some embodiments, the embodiments of
Embodiments of the methods described herein may be implemented in a computer program that may be stored on a storage medium having instructions to program a system to perform the methods. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic and static RAMs, erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), flash memories, magnetic or optical cards, or any type of media suitable for storing electronic instructions. Other embodiments may be implemented as software modules executed by a programmable control device.
Accordingly, at least one embodiment described herein may provide numerous advantages over the prior art. For example, the carry/borrow techniques described herein may be used to conserve code-space by having a single subroutine per memory device that may be called upon to correct operations upon various sub-vectors via a saved carry/borrow flag and a resume pointer address. The carry/borrow flags and resume-pointer addresses may be stored in a register prior to the beginning of vector MMP operations to enable the conditional branch subroutine to execute quickly. Moreover, if the branch to the subroutine is not selected there is no delay in the critical path of normal execution.
The terms and expressions which have been employed herein are used as terms of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications are possible within the scope of the claims. Accordingly, the claims are intended to cover all such equivalents.