1. Field of the Invention
The present invention relates to circuitry for computer systems, and more specifically, to microprocessor shifter circuits utilizing butterfly and inverse butterfly circuits, and control circuits therefor.
2. Related Art
Bit manipulation operations are important features of modern microprocessors. Unfortunately, bit manipulation operations carried out by existing microprocessors are limited in terms of flexibility and ease of implementation at the hardware level. For example, bit manipulation operations performed by existing microprocessors are often limited to shift and rotate operations. Since a microprocessor is typically optimized around the processing of words (i.e., fixed-length groups of binary bits of information), it is not surprising that bit-level operations are not well-supported by current word-oriented microprocessors. Simple “bit-parallel” Boolean operations such as AND, OR, XOR, and NOT are usually supported as the “logical” operations of the Arithmetic-Logic Unit (ALU), the most fundamental functional unit of a microprocessor. However, only very simple non-bit-parallel operations are supported, such as shift and rotate operations, in which all bits of an operand move by the same amount. These operations are usually supported by separate shifter functional units.
A few microprocessor Instruction Set Architectures (ISAs) have more advanced bit operations. However, these operations are implemented by complex shifter functional unit circuitry, which appreciably adds to the size and complexity of the microprocessor. Examples of such operations include subword extract and deposit operations (e.g., “pextrw” and “pinsrw” operations in the INTEL IA-32 ISA), field extract and deposit operations (e.g., “extr” and “dep” operations in HEWLETT PACKARD PA-RISC or INTEL IA-64 ISAs), or rotate and mask operations (e.g., “rldimi” in POWERPC ISA). These can be viewed as variants of the basic shift or rotate operation operations, with certain bits masked out and set to zeros, or sign bits replicated, or bits from a second operand merged into the result. Additionally, some instruction sets have multimedia permute operations that rearrange the subwords packed into one or more registers (e.g., “mix” operation in HEWLETT PACKARD PA-RISC 2.0 and INTEL IA-64 architectures).
There are also many emerging applications, such as cryptography, imaging, and bioinformatics, where even more advanced bit manipulation operations are needed. While circuitry to achieve these operations can be built by assembling simple logical and shift operation circuits, or by implementing same in firmware, such approaches often result in very large circuits or slow execution speeds. Applications using these advanced bit manipulation operations would thus be significantly sped-up if the processor were able to support more powerful bit manipulation instructions. Such operations include arbitrary bit permutations, bit gather operations (performing multiple bit-field extract operations in parallel), and bit scatter operations (performing multiple bit-field deposit operations in parallel).
Accordingly, what would be desirable, but has not yet been provided, are shifter circuits utilizing butterfly and inverse butterfly circuits, and control circuits therefor, which address the foregoing shortcomings of existing microprocessors.
The present invention relates to microprocessor shifter circuits utilizing butterfly and inverse butterfly circuits, and control circuits therefor. The shifter circuits can be implemented in existing microprocessors, and allow for complex bit manipulations to be performed by such microprocessors at high speeds. The shifter circuits can perform butterfly and inverse butterfly operations, parallel extract and parallel deposit operations, group operations, mix operations, bit permutation operations, as well as instructions executed by existing microprocessors. The shifter circuits can replace existing shifter circuits in microprocessors, so as to provide new ways for conducting existing shift, rotate, extract, deposit, and mix operations, as well as more advanced bit manipulation instructions. The shifter circuits can be implemented with a reduced amount of circuitry, thus conserving chip space. User applications relating to steganography, binary image morphology, transfer coding, bioinformatics, imaging, and integer compression techniques, among other applications, can be implemented using the shifter circuits of the present invention. The shifter circuits can be provided in various combinations to provide microprocessor functional units which perform a plurality of bit manipulation processes.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present invention relates to microprocessor shifter circuits utilizing butterfly and inverse butterfly circuits, and control circuits therefore, for performing complex bit manipulations at high speeds. The shifter circuits can perform butterfly and inverse butterfly operations, parallel extract and parallel deposit operations, group operations, mix operations, permutation operations, as well as instructions executed by existing microprocessors. The shifter circuits can be provided in various combinations to provide microprocessor functional units which perform a plurality of bit manipulation processes.
The shifter circuits 10 and 22 could be implemented in a microprocessor individually, or they could be connected to each other to form a Benes network, which is a general permutation network. The butterfly configuration of the circuit 10 is referred to herein as “bfly,” and the inverse butterfly configuration of the circuit 22 is referred to herein as “ibfly.” A single execution of bfly followed by ibfly (or vice versa) can achieve any of the n! permutations of n bits in, at most, 2 instruction cycles of a microprocessor.
The circuits shown in
Each of the circuit stages 14-18 of
An embodiment of the present invention provides an architecture that has only 2 source operands per instruction, and which utilizes 3 Application Registers (ar.b1, ar.b2, ar.b3) associated with the functional unit to supply the control bits during the execution of these instructions. Application Registers are registers which are already available in some Instruction Set Architectures (also abbreviated ISAs), such as the INTEL IA-64 ISA. The control bits are determined either statically by a compiler, or dynamically by software. Other permutation primitives like “grp,” discussed below, do not require Application Registers. An arbitrary permutation of the n bits within a register can be performed by a sequence of at most lg(n) grp instructions, or by a sequence of at most 2 instructions using bfly and ibfly instructions. The latter (bfly and ibfly) achieve arbitrary n-bit permutations in O(1) cycles, rather than O(lg(n)) cycles.
The present invention can support a number of bit manipulation operations, which could be implemented as instructions which are added to the existing instruction set of a microprocessor. These instructions are summarized in Table 1 below:
The conventional shifter instructions executed by existing microprocessors and listed in the top half of Table 1 above (i.e., above the dotted line in Table 1) can also be implemented by the present invention, since they are all based upon rotation operations. The additional functions implemented by the present invention are listed in the bottom half of Table 1 (i.e., below the dotted line in Table 1). An inverse butterfly circuit can achieve any rotation (cyclic shift) of its input, and can perform the following on its input: right and left shifts, extract operations, deposit operations, and mix operations. These operations can be modeled as a rotate operation with additional logic handling zeroing or sign extension from an arbitrary position, or merging bits from the second source operand (for deposit). Mix operations can be modeled as a rotate of one operand by the subword size and then a merge of subwords alternating between the two operands. Since inverse butterfly circuits only perform permutations without zeroing and without replication, an extra 2:1 multiplexer stage at the end of the shifter circuits of the present invention either selects the rotated bits as-is or other bits which are computed as either zero, or the sign bit (replicated), or the bits of the second source operand, depending on the operation.
The generation of control bits for the shifter circuits of the present invention is now discussed with reference to
An n-bit inverse butterfly circuit can be viewed as two (lg(n)−1)-stage circuits followed by a stage that swaps or passes through paired bits that are n/2 positions apart. To right_rotate the input {inn-1 . . . in0} by s positions, the two (lg(n)−1)-stage circuits must have right_rotated their half inputs by s′=s mod n/2 and the input to stage lg(n) must be of the form:
{inn/2+s′-1 . . . inn/2inn-1 . . . inn/2+s′∥ins′-1 . . . in0inn/2-1 . . . ins′} (1)
which is true for s less than or greater than n/2.
As shown at 56 in
{ins′-1 . . . in0 inn-1 . . . inn/2+s′∥inn/2+s′-1 . . . inn/2 inn/2-1 . . . ins′} (2)
When the rotation amount is greater than or equal to n/2 then the bits that do not wrap in the (lg(n)−1)-stage circuits (solid) must be swapped in the final stage to yield the input right_rotated by s (as illustrated at 58 in
{inn/2+s′-1 . . . inn/2 inn/2-1 . . . ins′∥ins′-1 . . . in0 inn-1 . . . inn/2+s′} (3)
For example, consider the 8-bit inverse butterfly circuit with right rotation amount s=5, depicted in
One can mathematically derive recursive equations for the control bits, cbj, j=1, 2, . . . lg(n), for achieving rotations on an inverse butterfly datapath. These equations yield the compact circuit shown in
From Equations 1-3 and
where ak is a string of k “a”s, “1” means “swap” and “0” means “pass through.” Note that s=s mod n/2 when s<n/2 and s−n/2=s mod n/2 when s≧n/2:
where ˜ indicates negation.
Due to the recursive structure of the inverse butterfly circuit, Equation 5 can be generalized by substituting j for lg(n), 2j for n and 2j−1 for n/2:
There are j bits in s mod 2j, with the most significant bit denoted sj−1. The condition s mod 2j<2j−1 is equivalent to sj−1 being equal to 0 and the condition s mod 2j≧2j−1 is equivalent to sj−1 being equal to 1:
Equation 7 can be rewritten as the pattern XORed with sj−1:
cbj=(1s mod 2
Since s mod k≦k−1, k−(s mod k)≧1 and hence the length of the string of zeros in Equation 8 is always ≧1 (k=2j−1). Consequently, the least significant bit of the pattern (prior to XOR with sj−1) is always “0”:
cbj=(1s mod 2
cbj=((1s mod 2
The bit pattern inside the inner parenthesis of the bottom Equation 9 is referred to as f(s, j), a string of 2j−1−1 bits with the s mod 2j−1 leftmost bits set to “1” and the remaining bits set to “0.” This function is only defined for j≧2 and returns the empty string for j=1:
Note that one can derive f(s, j+1) from f(s, j):
f(s,j+1)=1s mod 2
If bit sj−1=0 then s mod 2j=s mod 2j−1:
If bit sj−1=1 then s mod 2j=2j−1+s mod 2j−1:
Combining Equations 13 and 14, we get:
Since f(s, j) is a string of 2j−1−1 bits, we can replace the string of ones in the lower Equation 15 by f(s, j) ORed (+) with 0 or 1, respectively. Similarly, we can replace the string of zeros in Equation 15 by f(s, j) ANDed (·) with 0 or 1, respectively. The value of sj−1 determines whether 0 or 1 is used:
From Equations 10 and 16 we obtain a simple recursive expression for f(s, j):
Equation 11 shows a method for getting the j/2 control bits cbj for the switches in any stage j of the inverse butterfly circuit using a recursive function f(s, j) and a bit sj−1 from the shift amount s. This can also be seen in
Referring to
Equation 11 provides that the control bits cbj are generated by taking the recursive function f(s,j) and Exclusive-OR it with a bit from the shift amount, sj−1, and concatenate the result with sj−1. Equation 17 then describes how the function f(s,j) can be recursively generated from the same function for the previous stage, f(s, j−1). This uses the OR function, +, and the AND function with f(s, j−1). The recursive function f(s,j) is derived by f(s, j−1) OR with sj−2, concatenated with sj−2, then concatenated with f(s, j−1) AND with sj−2. This can also be seen in the hardware realization in
Discussion of the use of these equations is now provided with the example of
cb1=(f(5,1)⊕s0)∥s0{ }⊕s0∥s0=s0=1. (18)
The second stage control bits, cb2, replicated for the two 4-bit circuits, are given by:
Note that f(5,2)=1 in the above. The final stage control bits, cb3, are given by:
The other operations (shifts, extract, deposit, and mix) shown in Table 1 above can be modeled as a rotation part plus a masked-merge part with zeroes, sign bits or second source operand bits. The rotation part can use the same rotation control bit generator described above to configure the inverse butterfly circuit datapath. The masked-merge part can be achieved by using an enhanced inverse butterfly datapath with an extra multiplexer stage added as the final stage. The mask control bits are “0” when selecting the rotated bits and “1” when selecting the merge bits.
For a right shift by s, the s sign or zero bits on the left are merged in. This requires a control string 1s∥0n−s for the extra multiplexer stage. From the definition of f(s, j), it can be seen that f(s, lg(n)+1) is the string 1s∥0n−1−s. Thus, the desired control string is given by f(s, lg(n)+1)∥0 (recall that s<n therefore the least significant bit is always “0”, i.e., the least significant bit is always selected from the inverse butterfly datapath). f(s, lg(n)+1) can easily be produced by extending the rotation control bit generator by one extra stage. For left shift, which can be viewed as the left-to-right reversal of right shift, the control bits for the extra stage are obtained by reversing left-to-right the right shift control string to yield 0n−s∥1s.
For extract operations, which are like right shift operations with the left end replaced by the sign bit of the extracted field or zeros, the inverse butterfly circuit of the present invention selects in its extra multiplexer stage the rotated bits or zeros or the sign bit of the extracted field i.e., the bit in position pos+len−1 in the source register (see
For deposit operations, which are like left shift operations with the right and left ends replaced by zeros or bits from the second operand, the inverse butterfly circuit of the present invention selects in its extra multiplexer stage the rotated bits or zeros or bits from the second input operand. The correct pattern is a string of n-pos−len “1”s followed by len “0”s followed by s=pos “1”s (1n-pos−len∥0len∥1pos) merge in bits on the right and left around the deposited field. cblg(n)+1(pos+len)={f(pos+len, lg(n)+1)⊕(pos+len)lg(n)∥(pos+len)lg(n)} is 1pos+len∥0n-pos−len (as pos+len ranges from 0 to n and has lg(n)+1 bits: (pos+len)lg(n) . . . 0.). Reversing left-to-right this string yields 0n-pos−len∥1pos+len and then negating it produces 1n-pos−len∥0pos+len. Bitwise ORing this with the left shift control string, 0n-pos∥1pos, yields 1n-pos−len∥0len∥1pos, the correct pattern for the masked-merge part of the deposit operation is produced.
For mix operations, the enhanced inverse butterfly circuit of the present invention selects in its extra multiplexer stage the rotated bits or the bits from the second input operand. The control bit pattern is simply a pattern of alternating strings of “0”s and “1”s, the precise pattern depending on the subword size and whether mix left or mix right is executed. These patterns can be hard coded in the circuit for the 12 mix operations (6 operand sizes×2 directions).
The foregoing mask-merged bit patterns and mask control are summarized in Table 2 below:
The inverse butterfly circuit implemented by the present invention can perform a parallel extract (pex) operation. The inverse butterfly circuit is decomposed into even and odd subcircuits. As shown in
Fact 1. Any single data bit can be moved to any result position by just moving it to the correct R or L subcircuit of the intermediate result at every stage of the inverse butterfly circuit.
Proof. This can be proved by induction on the number of stages. At stage 1, the data bit is moved to its final position mod 2 (i.e., to R or L). At stage 2, it is moved to its final position mod 4, and so on. At stage lg(n), it is moved to its final position mod 2lg(n)=n, which is its final result position.
Fact 2. A permutation is routable on an inverse butterfly circuit if the destinations of the bits constitute a complete set of residues mod m (i.e., the destinations equal 0, 1, . . . , m−1 mod m) for each subcircuit of width m.
Proof. Based on Fact 1, bits are routed on the inverse butterfly circuit by moving them to the correct position mod 2 after the first stage, mod 4 after the second stage, etc. Consequently, if the two bits entering stage 1, (with 2-bit wide inverse butterfly circuits as shown in
Theorem 1. Any Parallel Extract instruction on n=2lg(n) bits can be implemented with one pass through an inverse butterfly circuit of lg(n) stages without path conflicts (with the un−selected bits on the left zeroed out).
Proof. The pex operation compresses bits in their original order into adjacent bits in the result. Consequently, two adjacent selected data bits that enter the same stage 1 subcircuit must be adjacent in the output. In other words, one bit has a destination equal to 0 mod 2 and the other has a destination equal to 1 mod 2—the destinations constitute a complete set of residues mod 2 and thus are routable through stage 1. The selected data bits that enter the same stage 2 subcircuit must be adjacent in the output and thus form a set of residues mod 4 and are routable through stage 2. A similar situation exists for the subsequent stages, up to the final n-bit wide stage. No matter what the bit mask of the overall pex operation is, the selected data bits will be adjacent in the final result. Thus the destination of the selected data bits will form a set of residues mod n and the bits will be routable through all lg(n) stages of the inverse butterfly circuit.
Parallel deposit can be performed using the butterfly circuit of the present invention, since the parallel deposit operation is the inverse of the parallel extract operation. Also, the grp operation discussed above can be mapped to two parallel inverse butterfly circuits, since grp is a combination of a grp_right (or pex) and a grp_left (a mirrored pex).
A decoder for the parallel extract (and parallel deposit) instruction takes as its input the n-bit mask and produces the n/2×lg(n) control bits for the inverse butterfly (or butterfly) circuit. The decoder can be designed to consist of only two types of operations that can be performed in software or implemented as circuits: a parallel prefix population counter, which counts the ones from position 0 (on the right) to every bit position from 0 to n−2, and a set of left rotators that complement bits upon wraparound (LROTC—Left ROTate and Complement). The theoretical underpinnings for a parallel extract decoder are discussed below in connection with
X is the rightmost bits of R, as the output of pex is the selected data bits compressed and right justified in the result. Y is rotated left from the midpoint by the size of X, |X|, at the input to the final stage so that when it is swapped into R in the final stage it is contiguous to X on its right. Z is the rightmost bits in L at the input to the final stage, so that when it is passed through in L, it is contiguous with the bits in R. Thus, the control bit pattern for the final stage is 1n/2-|X|∥0|X|, where “1” denotes swap, and “0” denotes pass-through of the paired bits in L and R, in the last stage. |X| is equal to the count of “1”s in the right half of the bit mask as all the selected data bits in the right half of the input have already been compressed and right justified prior to the input to the final stage. This pattern can be generated by a left rotate and complement on wraparound (LROTC) operation of the one string of length n/2. For every position in the rotation, a “1” is wrapped around from the left and complemented to obtain a “0” on the right. The end result, after a LROTC by |X| bits, is a string with the |X| rightmost bits set to “0” and the rest set to “1.”
Referring to the top row of
Rather than rotating the data bits explicitly, it is possible to compensate for the rotation by modifying the routing through the subsequent stages. This can be achieved by rotating the control bits by the same number of positions, complementing upon wraparound (LROTC). These rotations are propagated forward. Consequently, the control bits for a stage, obtained by a LROTC of the one string by the count of “1”s in the right half of the local bit mask, are further modified by a LROTC operation by the count of “1”s in the right half of the next larger local bit mask, and so on. The result is a single LROTC operation of the one string by count of “1”s in the right half the local bit mask through bit 0, the least significant bit. Overall, the counts are needed of the k rightmost bits, for k=0 to n−2—the set of prefix population counts.
The counting of the number of “1”s is done in parallel by the Parallel Prefix Population Count operation, while the rotation is done by the LROTC operation at each stage. The full decoding algorithm (which could be performed in hardware or software) is given below:
It is noted that three classes of parallel extract and parallel deposit instructions can be handled by the present invention. The first consists of static versions of these instructions, where software or the compiler “pre-decodes” the mask in the second source register into control bits for the datapath, and moves the control bits into the Application Registers (ARs), or any registers that may be used to control the ibfly circuit. This uses the mov instruction in Table 1, followed by the pex or pdep instructions. The second class is dynamic mask decoding by a hardware implementation of the decoder of
The merge and mask control bits implemented by masked merge unit 102 of
Table 3 below presents a high level comparison of the present invention to known barrel and log shifter designs:
The first two lines of Table 3 contain the components that contribute to circuit area. Both the log shifter and the inverse butterfly version of the present invention have n×lg(n) elements, while the barrel shifter has n2 elements. The log shifter also has the fewest control lines (lg(n)) while the present invention has the most as each switch, or pair of elements, requires an independent control bit. The next two lines pertain to latency. The datapath of the barrel shifter has a single gate delay while the log shifter and the present invention have lg(n) gate delay. However, both the log shifter and the present invention utilize narrow multiplexers with lower capacitance at output nodes.
The method of logical effort was first used to compare the delay along the critical paths for the barrel shifter, the log shifter, and the inverse butterfly shifter. This estimated the critical path in terms of FO4 gate equivalents, which is the delay of an inverter driving 4 similar inverters. The latency of only the basic shifter operations on these datapaths was compared. As the 64-bit barrel shifter is impractical due to the capacitance on the output lines, a 64-bit shifter was implemented as an 8-byte barrel shifter followed by an 8-bit barrel shifter, which limits the number of transmission gates tied together to 8. The delay only from the input to the decoder through the two shifter levels for the barrel shifter and through the three shifter levels for the log shifter was considered.
For the present invention, the delay from the input to the control bit generator (i.e., the rotation control bit generator of
Since the log shifter is the faster and more compact of the two current shifter designs, it was implemented for testing purposes along with the present invention, using a standard cell library. All designs were synthesized to gate level, optimizing for shortest latency, using Design Compiler mapping to a TSMC 90 nm standard cell library. The results are summarized in Tables 4-6 below.
Table 4 shows the result of a basic shifter that only implements shift and rotate instructions. For the log shifter, parallel datapaths were implemented for left and right shifts. The present invention has 1.18×the latency of the log shifter, which is similar to the logical effort calculation. The present invention is also smaller than the log shifter, at approximately 70% of the area (note that a single datapath log shifter would have smaller area; additionally, accounting for the wires will increase the area for the ibfly-based shifter relative to the log shifter, as mentioned). For comparison, we also implemented an ALU (supporting add, subtract, and, or, not, xor with register or immediate operands) synthesized using the same standard cell library. The present invention is faster (92% latency) and smaller (52% area) than this ALU.
Table 5 shows the result when both shifter architectures are enhanced to support extract, deposit, and mix instructions. The critical path of the log shifter is now through the extract sign bit propagation, so the latency is now comparable to that of the inverse butterfly-based shifter of the present invention. The present invention is still only 83% of the area of the log shifter. The results for an ALU of similar latency are also included (which has a comparable circuit area, in NAND gate equivalents).
Table 6 shows the results when support is added to the present invention for advanced bit manipulation operations, but not to the log-shifter. The first line in table 6 represents the log shifter circuit from Table 5, included as the baseline. The second line is a unit that supports the ibfly and static pex instructions. The functionality of the butterfly circuit can be emulated using inverse butterfly, albeit with a multi-cycle penalty. The latency increases are due to extra multiplexing for the control bits and output. The area increases due to the ARs, the extra multiplexers, and the pex masking. This unit has 1.18× the latency and 1.29× the area of the log shifter. The third line in Table 6 is a unit that also supports butterfly and static pdep. The latency increases slightly due to output multiplexing and the area increases due to the second (butterfly) datapath and second set of three ARs. This unit has 1.20× the latency and 1.87× the area of the log shifter. Alternatively, a separate unit can be added to perform just bfly and pdep (line 4 in Table 6), thereby enabling simultaneous superscalar execution with the ibfly-pex-shifter unit (line 2 in Table 6). It can be seen that the shift-permute unit of the present invention (line 3 in Table 6) can be split into two units at no additional increase in area. The results for an ALU of similar latency are also included, wherein the ALU is smaller than the log shifter due to the relaxed latency constraint. The shift-permute unit (line 3) now has 2.25× the area of the ALU, but smaller sizes are possible.
In each half of stage 2 we transfer from the local R to the local L the bits whose final destination is in the local L. So in R, we transfer g to RL and in L we transfer d to LL. Prior to stage 3, we right rotate the bits to right justify them in their original order in their new subnetworks. So d0 is right rotated by 1, the number of bits that stayed in LR, to yield 0d, and gf is right rotated by 1, the number of bits that stayed in RR, to yield fg.
In each subnetwork of stage 3 we again transfer from the local R to the local L the bits whose final destination is in the local L. So in LL we transfer d and in LR we transfer e. After stage 3 we have transferred each bit to its correct final destination: d0e0fg0h. Note that we use a control bit of “0” to indicate a swap, and a control bit of “1” to indicate a pass through operation. Rather than explicitly right rotating the data bits in the L half after each stage, we can compensate by modifying the control bits. This is shown in
Provided below is an explanation as to how the pdep operation can be mapped to the butterfly circuit:
Fact 3: Any single data bit can be moved to any result position by moving it to the correct half of the intermediate result at every stage of the butterfly circuit.
This can be proved by induction on the number of stages. At stage 1, the data bit is moved within n/2 positions of its final position. At stage 2, it is moved within n/4 positions of its final result, and so on. At stage lg(n), it is moved within n/2lg(n)=1 position of its final result, which is its final result position. Referring back to
Fact 4: If the mask has k “1”s in it, the k rightmost data bits are selected and moved, i.e., the selected data bits are contiguous. They never cross each other in the final result.
This fact is by definition of the pdep instruction. See the example of
Fact 5: If a data bit in the right half (R) is swapped with its paired bit in the left half (L), then all selected data bits to the left of it will also be swapped to L (if they are in R) or stay in L (if they are in L).
Since the selected data bits never cross each other in the final result (Fact 4), once a bit swaps to L, the selected bits to the left of it must also go to L. Hence, if there is one “1” in the mask, the one selected data bit, d0, can go to R or L. If there are two “1”s in the mask, the two selected data bits, d1d0, can go to RR or LR or LL. That is, if the data bit on the right stays in R, then the next data bit can go to R or L, but if the data bit on the right goes to L, the next data bit must also go to L. If there are three “1”s, the three selected data bits, d2d1d0, can go to RRR, LRR, LLR or LLL. For example, in stage 1 of
Fact 6: The selected data bits that have been swapped from R to L, or stayed in L, are all contiguous mod n/2 in L.
From Fact 5, the destinations of the k selected data bits dk-1 . . . d0 must be of the form L . . . LR . . . R, a string of zero or more L's followed by zero or more R's. Define X as the bits staying in R, Y as the bits going to L that start in R and Z as the bits going to L that start in L. It is possible that:
When X alone exists (i), there are no bits that go to L, so Fact 6 is irrelevant.
The structure of the butterfly circuit requires that when bits are moved in a stage, they all move by the same amount. Fact 4 states that the selected data bits are contiguous. Together these imply that when Y alone exists or X and Y exist (ii and iii), Y is moved as a contiguous block from R to L and Fact 6 is trivially true.
When X and Z exist (iv), Z is a contiguous block of bits that does not move so again Fact 6 is trivially true.
When X, Y and Z exist (v), Y comprises the leftmost bits of R, and Z the rightmost bits in L since they are contiguous across the midpoint of the stage (Fact 4). When Y is swapped to L, since the butterfly circuit moves the bits by an amount equal to the size of L or R in a given stage, Y becomes the leftmost bits of L. Thus Y and Z are now contiguous mod n/2, i.e., wrapped around, in L, as shown in
For example, in
Fact 7: The selected data bits in L can be rotated so that they are the rightmost bits of L, and in their original order.
From Fact 6, the selected data bits are contiguous mod n/2 in L. At the output of stage 1 in
At the end of this step, we have two half-sized butterfly circuits, L and R, with the selected data bits right-aligned and in order in each of L and R (last row of
The selected data bits emerge from stage 1 in
Fact 8: If the data bits are rotated by x positions left (or right) at the input to a stage of a butterfly circuit, then at the output of that stage we can obtain a rotation left (or right) by x positions of each half of the output bits by rotating left (or right) the control bits by x positions and complementing upon wrap around.
Consider again the example of
An explanation is now provided as to why the control bits are complemented when they wrap around. The goal is to keep the data bits in the half they were originally routed to at each stage of the butterfly circuit, in spite of the rotation of the input.
Similarly, if a and b were originally swapped (see
Thus, complementing the control bit when it wraps (see
Provided below is a theorem to formalize the overall result:
Theorem 2: Any parallel deposit instruction on n bits can be implemented with one pass through a butterfly circuit of lg(n) stages without path conflicts (with the bits that are not selected zeroed out externally).
Proof: Assume there are k “1”s in the right half of the bit mask. Then, based on Fact 3, the k rightmost data bits (block X) must be kept in the right half (R) of the butterfly circuit and the remaining contiguous selected data bits must be swapped (block Y) or passed through (block Z) to the left half (L). This can be accomplished in stage 1 of the butterfly circuit by setting the k rightmost configuration bits to “1” (to pass through X and Z), and the configuration bits of the remaining selected bits to “0” (to swap Y).
At this point, the selected data bits in the right subnetwork (R) are right-aligned but those in the left subnetwork (L) are contiguous mod n/2, but not right aligned (Fact 4,
Now the process above can be repeated on the left and right subnets, which are themselves butterfly networks: count the number of “1”s in the local right half of the mask and then keep that many bits in the right half of the subnetwork, and swap the remaining selected data bits to the left half Account for the rotation of the left half by modifying subsequent control bits. This can be repeated for each subnetwork in each subsequent stage until the final stage is reached, where the final parallel deposit result will have been achieved (see
Various functional units can be constructed in accordance with the present invention, for implementing any desired combinations of parallel extract, parallel deposit, butterfly permutations, and inverse butterfly permutations. Examples of such functional units will now be discussed with reference to
The decoder circuit 182 can be used for both pex and pdep with the caveat that the ordering of the stages that control bits are routed to is reversed (the circular arrow in
The various functional units discussed above in connection with
For bfly and ibfly, the data bits in GR r2 are permuted and placed in the destination register GR r1. Application registers ar.bi and ar.ibi, i=1, 2, 3, are used to hold the configuration bits for the butterfly or inverse butterfly datapath, respectively, and these registers must first be loaded by the mov_ar instruction. The mov_ar instruction in Table 7 is used to move the contents of two general-purpose registers to the application registers. The sub-opcode, x, indicates which application register, or pair of application registers, are written.
Static versions of pex and pdep are used when desired mask patterns are known at compile time. In the static version of the pex instruction, GR r2 is and'ed with mask GR r3, then permuted using inverse butterfly application registers ar.ib1-3, with the result placed in GR r1. For static pdep, GR r2 is permuted using butterfly application registers ar.b1-3, then and'ed with mask GR r3, with the result placed in GR r1.
Dynamic or variable versions of pex and pdep are used when desired mask patterns are only known at runtime. In the pex.v instruction, the data bits in GR r2 selected by the “1” bits in the mask GR r3 are placed, in the same order, in GR r1. In the pdep.v instruction, the right justified bits in GR r2 are placed in the same order in GR r1, in the positions selected by “1”s in mask GR r3. For both instructions, the mask r3 is translated dynamically by a decoder into control bits for an inverse butterfly or butterfly circuit.
Suppose the particular pattern of bit scatter or gather is determined at execution time, but this pattern remains the same over many iterations of a loop. This is referred to as a loop-invariant pex or pdep operation. The setib and setb instructions invoke a hardware decoder to dynamically translate the bitmask GR r3 to control bits for the datapath stages; these control bits are written to the inverse butterfly or butterfly application registers, respectively, for later use in static pex and pdep instructions. Table 7 also shows the grp instruction which can perform arbitrary n-bit permutations. It is noted that the grp instruction can be emulated by a short sequence of pex and Boolean instructions.
The last column of Table 7 shows the expected number of cycles taken for the execution of the instruction. All static instructions (with pre-loaded ARs) take a single cycle, comparable to the time taken for an add instruction. The hardware decoder takes about 2 cycles, hence the setib and setb instructions take 2 cycles each. The variable pex.v and pdep.v and grp instructions each take up to 3 cycles each because they have to go through the hardware decoder first and then incur some additional datapath latency through the inverse butterfly or butterfly circuits and the output multiplexers. It is noted that the cycle counts depend on what a given processor uses to determine the cycle time. Here, it is assumed, without loss of generality, that the latency of an ALU is used to determine the cycle time. Also, it is noted that these cycle counts can change depending on the circuit optimization performed.
Table 8 summarizes applications in which the present invention can be implemented. Check marks indicate definite usage, check marks in parenthesis indicate usage in alternative algorithms and question marks indicate potential usage. The use of the mov ar instruction is assumed (but not shown in Table 8) whenever the bfly, ibfly, or static pex and pdep instructions are used. Note that these are just representative applications and do not constrain the use of the proposed instructions and shifter circuits in other applications or in different ways in these applications.
Table 8 shows that static pex and pdep are the most frequently used. The variable pdep.v is not used at all, and the variable pex.v is only used twice. The loop-invariant pex and pdep instructions, indicated by the use of setib and setb instructions, are only used for the LSB Steganography application. The grp instruction is only used in a block cipher.
The SSE instruction pmovmskb serves a similar purpose; it creates an 8- or 16-bit mask from the most significant bit from each byte of a MMX or SSE register and stores the result in a general purpose register. However, pex offers greater flexibility than the fixed pmovmskb, allowing the mask, for example, to be derived from larger subwords, or from subwords of different sizes packed in the same register.
Similarly, binary image compression performed by MATLAB's bwpack function benefits from pex. Binary images in MATLAB are typically represented and processed as byte arrays—a byte represents a pixel and has permissible values 0x00 and 0x01. However, certain optimized algorithms are implemented for a bitmap representation, in which a single bit represents a pixel.
To produce one 64-bit output word requires only 8 static pex instructions to extract 8 bits in parallel from 8 bytes and 7 dep instructions to pack these eight 8-bit chunks into one output word (see
LSB steganography is an example of an application that utilizes the loop-invariant versions of the pex and pdep instructions. The sample size and the number of bits replaced are not known at compile time, but they are constant across a single message.
If the images are processed in bitmap form, a single pex instruction extracts the entire index at once (assuming a 64-bit word contains an 8×8 block of 1-bit pixels, illustrated at 290 in
A strand of DNA is a double helix—there are really two strands with the complementary nucleotides, A⇄T and C⇄G, aligned. When performing analysis on a DNA string, often the complementary string is analyzed as well. To obtain the complementary string, the bases are complemented and the entire string is reversed, as the complement string is read from the other end. The reversal of the DNA string amounts to a reversal of the ordering of the pairs of bits in a word. This is a straightforward bfly or ibfly permutation.
The DNA sequence is transcribed by the cell into a sequence of amino acids or a protein. Often, the analysis of the genetic data is more accurate when performed on a protein basis, such as is done by the BLASTX program. A set of three bases, or 6 bits of data, corresponds to a protein codon. Translating the nucleotides to a codon requires a table lookup operation using each set of 6 bits as an index. An efficient algorithm can use pdep to distribute eight 6-bit fields on byte boundaries, and then use the result as a set of table indices for a parallel table lookup (ptlu) instruction to translate the bytes, as shown in
For example, to match a pattern comprising many features where “1” indicates a match and “0” indicates no match, arbitrary subsets of the binary “1's” can be efficiently collected and compressed with this parallel extract instruction. The resulting compressed bits can be used as indices into the parallel table lookup instruction. This enables very fast action depending on which patterns are matched.
When aligning two DNA sequences, certain algorithms such as the known BLASTZ program use a “spaced seed” as the basis of the comparisons. This means that n out of m nucleotides are used to start a comparison rather than a string of n consecutive nucleotides. The remaining slots effectively function as wild cards, often causing the comparison to yield better results. For example, BLASTZ uses 12 of 19 (or 14 of 22 nucleotides) as the seed for comparison. The program compresses the 12 bases and uses the seed as an index into a hash table. This compression is a pex operation selecting 24 of 38 bits, as illustrated in
The present invention can also be used in random number generation and cryptology applications. Random numbers are very important in cryptographic computations for generating nonces, keys, random values, etc. Random number generators contain a source of randomness (such as a thermal noise generator) and a randomness extractor that transforms the randomness so that it has a uniform distribution. The INTEL random number generator uses a von Neumann extractor. This extractor breaks the input random bits, X=x1x2x3 . . . , into a sequence of pairs. If the bits in the pair differ, the first bit is output. If the bits are the same, nothing is output. This operation is equivalent to using a pex.v instruction on each word X from the randomness pool with the mask:
Mask=x1⊕x2∥0∥x3⊕x4∥0∥ (21)
or equivalently,
Mask=(X⊕(X<<1))& 0xAAA . . . A. (22)
A number of popular ciphers, such as DES, have permutations as primitive operations. Inclusion of permutation instructions such as bfly, ibfly (or grp) can greatly improve the performance of the inner loop of these functions. Also, these instructions can be used as powerful primitive operations in the design of the next generation of ciphers and hash functions (especially for the Cryptographic Hash Algorithm Competition (SHA-3)). Since all the pex and pdep instructions—static, loop-invariant and variable—are also very likely to be useful for cryptanalysis algorithms, they are indicated as “?” in Table 8.
Kernels for the above binary compression and decompression, steganography, transfer coding, bioinformatics translate, integer compression, and random number generation applications were coded and simulated using the SimpleScalar Alpha simulator enhanced to recognize our new instructions. The latencies of the instructions in the simulator are as given in Table 7.
Random number generation exhibited the greatest speedup due to the fact that a variable pex.v operation is performed. A single pex.v instruction replaces a very long sequence of instructions that loops through the data and mask and conditionally shifts each bit of the input to the correct position of the output based on whether the corresponding bit of the mask is ‘0’ or ‘1’. Of the benchmarks for static pex and pdep, the simple bit compression and decompression functions exhibited the greatest speedup as these operations combine many basic instructions into one pex or pdep. The speedup is lower in the steganography encoding case because there are only 4 fields per word, and also in the uudecode and BLASTX translate case because there are fewer fields overall. The lowest speedups were for integer compression cases as a smaller fraction of the runtime is spent on compression or decompressing bit fields.
To evaluate the cost of implementing the advanced bit manipulation functional units described above in connection with
The steps in the proof to Theorem 2 give an outline for how to decode the n-bit bitmask into controls for each stage of a butterfly datapath. For each right half of a stage of a subnetwork, we count the number of “1”s in that local right half of the mask, say k “1”s, and then set the k rightmost control bits to “1”s and the remaining bits to “0”s. This serves to keep block X in the local. R half and export Y to the local L half We then assume that we explicitly rotate Y and Z to be the rightmost bits in order in the local L half. Then, we iterate through the stages and come up with an initial set of control bits. After this, we eliminate the need for explicit rotations of Y and Z by modifying the control bits instead. This is accomplished by a left rotate and complement upon wrap around (LROTC) operation, rotating the control bits by the same amount obtained when assuming explicit rotations.
This process can be considerably simplified, as follows. First, note that when we modify control bits to compensate for a rotation in a given stage, we do so by propagating the rotation through all the subsequent stages. This means that when the control bits of a local L are modified, they are rotated and complemented upon wrap around by the number of “1”s in the local R, and by the number of “1”s in the local R of the preceding stage, and by the number of “1”s in all the local R′s of all preceding stages up to the R in the first stage. In other words, the control bits of the local L are rotated by the total number of “1”s to its right in the bitmask.
Consider the example of
Second, it is necessary to produce a string of k “1”s from a count (in binary) of k, to derive the initial control bits assuming explicit rotations. This can also be done with a LROTC operation, as illustrated in
It is possible to combine these two facts: the initial control bits are obtained by a LROTC of a zero string the length of the local R by the PP_POPCNT of the bits in the bitmask in the local R and all bits to the right of it. We denote a string of k “0”s as 0k. We specify a bitfield from bit h to bit v as {h:v}, where v is to the right of h. So,
Verification of the correctness of this approach is provided with reference to
One interesting point is that, for stage 1, the population count of the odd multiples of n/21 bits is needed, for stage 2 the population counts of the odd multiples of n/22 bits is needed, for stage 3 the population counts of the odd multiples of n/23 bits is needed, and so on. Overall, the counts are needed of the k rightmost bits, for k=0 to n−2. Such counts can be provided using the prefix population counter of the present invention, discussed above.
The control bits for the inverse butterfly for a pdep or a pex operation can be obtained using Algorithm 1, with the one caveat that the controls for stage i of the butterfly datapath are routed to stage lg(n)−i+1 in the inverse butterfly datapath. This can be shown using an approach similar to that shown earlier, except for working backwards from the final stage. Algorithm 1 decites the n mask bits into the nlg(n)/2 control bits for pdep and pex.
The execution time of Algorithm 1 in software is approximately 1200 cycles on an Intel Pentium-D processor. This software routine is useful for static pex or pdep operations and perhaps for loop invariant pex or pdep if the amount of processing in the loop dwarfs the 1200 cycle execution time. However, for dynamic pex.v and pdep.v a hardware decoder is required which implements Algorithm 1 in order to achieve a high performance. Fortunately, Algorithm 1 just contains two basic operations, population_count and LROTC, both of which have straightforward hardware implementations.
The first stage of the decoder is the parallel prefix population counter, discussed above. This is a circuit that computes in parallel all the population counts of step 1 of Algorithm 1. The circuit is a parallel prefix network with each node performing carry-save addition (i.e. a set of full adders). The counters resemble carry shower counters in which the inputs are grouped into sets of three lines which are input into full adders. The sum and carry outputs of the full adders are each grouped into sets of three lines which are input to another stage of full adders and so on. The parallel prefix architecture resembles radix-3 Han-Carlson, a parallel prefix look-ahead carry adder that has lg(n)+1 stages with carries propagated to the odd positions in the extra final stage. The radix-3 nature stems from the carry shower counter design, as we group 3 lines to input to a full adder at each level. The similarity to Han-Carlson is due to the 1- and 2-bit counts being deferred to the end, similar to odd carries being deferred in the Han-Carlson adder. Thus, the counter has log3(n)+2 stages.
One simplification of the counter is based on the properties of rotations—that they are invariant when the rotation amount differs by the period of rotation. Thus, for the ith stage of the butterfly network, the PP_POPCNTs are only computed mod n/2i-1. For example, for the 64-bit hardware decoder, for the 32 butterfly stage 6 PP_POPCNTs corresponding to the odd multiples of n/64, it is necessary only compute the PP_POPCNTs mod 2—only the least significant bit; for the 16 butterfly stage 5 PP_POPCNTs, we need only compute the PP_POPCNTs mod 4—the two least significant bits; and so on. Only the PP_POPCNT of 32 bits for stage 1 requires the full lg(n)-bit PP_POPCNT.
The outputs from the decoders shown in
The functional units shown in
The various functional units were coded in Verilog and synthesized using Synopsys Design Compiler mapping to a TSMC 90 nm standard cell library. The designs were compiled to optimize timing. The decoder circuit was initially compiled as one stage and then Design Compiler automatically pipelined the subcircuit. Timing and area figures are as reported by Design Compiler. We also synthesized a reference ALU using the same technology library as a reference for latency and area comparisons.
Table 9 below summarizes the timing and area for the circuits. This shows that the 64-bit functional unit in
Table 10 shows the number of different circuit types, to give a sense for why the functional units supporting variable pex.v, pdep.v and grp are so much larger. It shows that supporting variable operations comes at a high price. The added complexity is due to the complex decoder combinational logic and to the additional pipeline registers and multiplexer logic. This explains why in Table 10, the variable circuits (
Having thus described the invention in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. What is desired to be protected is set forth in the following claims.
This application is a continuation application of and claims the benefit of priority to U.S. patent application Ser. No. 12/126,616 filed May 23, 2008, now U.S. Pat. No. 8,285,766, which claims the priority of U.S. Provisional Application Ser. No. 60/931,493 filed May 23, 2007, the entire disclosures of which are expressly incorporated herein by reference.
The present invention was made with government support under Department of Defense Grant No. H98230-04-C-0496. Accordingly, the Government has certain rights to the present invention.
Number | Name | Date | Kind |
---|---|---|---|
4396994 | Kang et al. | Aug 1983 | A |
4653019 | Hodge et al. | Mar 1987 | A |
4829460 | Ito | May 1989 | A |
5481749 | Grondalski | Jan 1996 | A |
5670900 | Worrell | Sep 1997 | A |
5673321 | Lee | Sep 1997 | A |
5729482 | Worrell | Mar 1998 | A |
5961575 | Hervin et al. | Oct 1999 | A |
5978822 | Muwafi et al. | Nov 1999 | A |
6098087 | Lemay | Aug 2000 | A |
6381690 | Lee | Apr 2002 | B1 |
6715066 | Steele, Jr. | Mar 2004 | B1 |
6738793 | Lin et al. | May 2004 | B2 |
6877019 | Bandy et al. | Apr 2005 | B2 |
6917955 | Botchev | Jul 2005 | B1 |
6922472 | Lee et al. | Jul 2005 | B2 |
6952478 | Lee et al. | Oct 2005 | B2 |
7035887 | Ziegler et al. | Apr 2006 | B2 |
7092526 | Lee | Aug 2006 | B2 |
7174014 | Lee et al. | Feb 2007 | B2 |
7689635 | Gupta et al. | Mar 2010 | B2 |
7951404 | Gomori | May 2011 | B2 |
8285766 | Lee et al. | Oct 2012 | B2 |
20080243974 | Paumier et al. | Oct 2008 | A1 |
20090138534 | Lee et al. | May 2009 | A1 |
20130103730 | Lee et al. | Apr 2013 | A1 |
Entry |
---|
Office Action dated Dec. 12, 2011, from U.S. Appl. No. 12/126,616. |
Lee, Ruby B., “Accelerating Multimedia with Enhanced Microprocessors,” IEEE Micro, vol. 15, No. 2, pp. 22-32, Apr. 1995. |
Lee, et al., “Efficient Permutation Instructions for Fast Software Cryptography,” IEEE Micro, vol. 21, No. 6, pp. 56-69, Nov.-Dec. 1991. |
Shi, et al., “Bit Permutation Instructions for Accelerating Software Cryptography,” Proceedings of the IEEE International Conf. on Application—Specific Systems, Architectures and Processors, pp. 138-148, Jul. 2000. |
Yang, et al. “Fast Subword Permutation Instructions Using Omega and Flip Network Stages,” Proceedings of the International Conference on Computer Design (ICCD 2000), pp. 15-22, Sep. 2000. |
Lee, et al., “How a Processor Can Permute n bits in O (1) cycles,” Proceedings of Hot Chips 14-A symposium on High Performance Chips, Aug. 2002. |
Shi, et al., “Arbitrary Bit Permutations in One or Two Cycles,” Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors, Jun. 2003. |
R. Lee, “Precision Architecture”, IEEE Computer, vol. 22, No. 1, pp. 78-91, Jan. 1989. |
Hilewitz, et al., “Comparing Fast Implementations of Bit Permutation Instructions,” Proceedings of the 38th Annual Asilomar Conference on Signals, Systems, and Computers, Nov. 2004. |
Sun Microsystems, The VIS™ Instruction Set Version 1.0, Jun. 2002. |
Franz, et al., “Computer Based Steganography: How It Works and Why Therefore Any Restrictions on Cryptography are Nonsense, at Best,” Information Hiding, Springer Lecture Notes in Computer Science, vol. 1174, pp. 7021, 1996. |
V.E. Benes, “Optimal Rearrangeble Multistage Connecting Networks,” The Bell System Technical Journal, vol. 43, No. 4, Jul. 1964, pp. 1641-1656. |
“Uuencoding,” Wkipedia: The Free Encyclopedia, http://en.wikipedia.org/wiki/Uuencode, printed from website on Sep. 1, 2009. |
National Center for Biotechnology Information, BLAST, http://www.ncbi.nlm.nih.gov/BLAST, printed from website on Sep. 1, 2009. |
Fiskiran, et al., “On-Chip Lookup Tables for Fast Symmetric—Key Encryption,” Proceedings of the IEEE International Conf. on Application-Specific Systems, Architectures and Processors, pp. 356-363, Jul. 2005. |
Burger, et al., “The SimpleScalar Tool Set, Version 2.0,” University of Wisconsin-Madison Computer Sciences Department Technical Report #1342, 1997. |
Ruby B. Lee, “Subword Parallelism with Max-2,” IEEE Micro, vol. 16, No. 4, pp. 51-59, Aug. 1996. |
Shi, et al., “Subword Sorting with Versatile Permutation Instructions,” Proceedings of the International Conference on Computer Design (ICCD 2002) pp. 234-241, Sep. 2002. |
The Mathworks, Inc., “Image Processing Toolbox User's Guide,”: http://www.mathworks.com/access/helpdesk/help/toolbox/images/images.html, printed from website on Sep. 10, 2009. |
Hilewitz, et al., “Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors,” Proceedings of the 18th IEEE Symposium on Computer Arithmetic, Jun. 25-27, 2007. |
Hilewitz, et al., “Fast Bit Compression and Expansion with Parallel Extract and Parallel Deposit Instructions,” Proceedings of the IEEE 17th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 65-72, Sep. 11-13, 2006. |
Number | Date | Country | |
---|---|---|---|
20130103730 A1 | Apr 2013 | US |
Number | Date | Country | |
---|---|---|---|
60931493 | May 2007 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12126616 | May 2008 | US |
Child | 13647861 | US |