The instruction sets of modern computer Central Processing Units (CPUs) typically include a variety of commands to manipulate data stored in operand registers. Each operand register may hold a “word.” A “word” is a fixed-sized piece of data, such as a quantity of data handled as a unit by the instruction set and/or the processor core of the CPU, and can vary from CPU to CPU. For example, a “word” might be 32 bits on one type of CPU, whereas a “word” might be 64 bits on another type of CPU. In some applications such as encryption, it is desirable to go inside a “word” and individually manipulate the stored bits.
For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
Reduced instruction set computing (RISC) instruction sets typically include instructions that support basic “word” manipulation operations such as masks, bit shifts and bitwise logic operations. Instructions to apply “masks” to words are used to set some bits in the word while leaving others untouched. “Shift” operations may shift bits within the words toward the beginning or end of the word. Examples of basic bitwise logic operations include logic operations such as “AND,” inclusive “OR,” or “Exclusive OR” (XOR). In comparison, complex instruction set computing (CISC) instruction sets typically include instructions that can perform a series of these basic operations as multiple steps using a single instruction call.
Unfortunately, using conventional techniques, rearranging the order of bits stored in a source register, when copying the bits to a destination register, can require executing a relatively large number of instructions, particularly if the new order is arbitrary. This act of rearranging bits into a different sequence or order is called “permuting.” On a typical processor (RISC or CISC), a simple permutation operation may require around twenty different instructions to be executed to permute a single byte of data.
The reordering of the bits illustrated in
An “AND” is a logical operation where the binary inputs are multiplied, such that the output of an AND is true (1) if and only if all of the inputs are true (1). Otherwise, if any input is false (0), an AND outputs a false (0). An “OR” is a logical operation where the binary inputs are added, such that the output of an OR is false (0) if and only if all of the inputs are false (0). Otherwise, if any input is true (1), an OR outputs a true (1). An “XOR” is a logical operation that outputs a true (1) if and only if an odd number of inputs are true (1). Otherwise, an XOR outputs a false (0).
There have been past attempts at improving the performance of permutation operations, but the results have had various shortcomings. One well-known solution was based upon bit-matrix multiplication. An example is described in “Bit Matrix Multiplication in Commodity Processors” by Yedidya Hilewitz, Cedric Lauradoux, and Ruby B. Lee, IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 2008, page 7-12, and in Yedidya Hilewitz's related 2008 Princeton PhD dissertation entitled “Advanced Bit Manipulation Instructions: Architecture, Implementation and Applications.”
However, bit matrix multiplication solutions need to store tables of data that specify how to manipulate bits. This need for storage is a major drawback, and limits the practicality of bit-matrix multiplication as a solution on most processors. In a typical implementation, the stored tables are estimated to require approximately 1 Kilobyte (KB) of memory (i.e., 8000 bits). While 1 KB may seem tiny relative to the amount of memory in today's computers, it can be huge relative to the amount of data storage available within a processor core's internal registers, which are what a core uses when executing instructions. An alternative might be to store the tables in memory outside the processor's core and load data from the tables as needed. However, the use of multiple memory swap transactions will considerably reduce the time required to perform bit-matrix multiplication, due to the added latency introduced by the extra transactions.
Disclosed herein are permutation instructions, and circuits for executing the instructions, that can perform arbitrary permutations on a source byte in a single clock cycle. Each bit in the source byte is permuted in accordance with a permutation map. The only storage within the processor core required to execute these instructions is the register holding the source byte “rs”, the register “rd” that will receive the result of the permutation, and the register or registers storing the permutation map. With just two new permutation instructions and a byte swap instruction, it becomes possible to do most any kind of permutation operation on a word, which in the example system is 32 bits.
A first new instruction is a Permute Bits with XOR (“PERBX”) and a second new instruction is a Permute Bits with AND (“PERBA”). With both instructions, each source bit is mapped independently, such that it is possible to map more than one source bit to a single destination bit (e.g., as illustrated in
With the PERBX instruction, if no bit is mapped to a target bit, the target bit is set to zero (0). If a single source bit is mapped to a target bit, then the target bit is set to the state of the source bit. If multiple source bits are mapped to a target bit, than the destination bit is set to a logical Exclusive-OR (XOR) of those mapped source bits. As noted above, an XOR is a logical operation that outputs a true (1) only when an odd number of inputs are true (1). Otherwise, an XOR outputs a false (0).
Applying XOR transforms to data is believed to be particularly advantageous for cryptography. A byte can be permuted to be completely unrecognizable, but if the original permutation map is known, then depending in part on the number of bits copied to a same bit and the duplication of those bits in the result, the original source byte may be recoverable, making the process reversible. This reversibility is possible because each output bit that is written to by multiple source bits will only be true if an odd number of inputs are true, such that the original bit states may be recovered in manner similar to data recovered using a parity bit.
With the PERBA instruction, if no bit is mapped to a target bit, the target bit is set to zero (0). If a single source bit is mapped to a target bit, then the target bit is set to the state of the source bit. If multiple source bits are mapped to a target bit, than the destination bit is set to a logical AND of those mapped source bits. As noted above, an AND is a logical operation that outputs a true (1) only when all of the inputs are true (1). Otherwise, an AND outputs a false (0).
As can be inferred from the example instruction format 420 in
As illustrated in
The “E” data field 858 of each nibble pm[n] 854n specifies whether a source bit rs1[n] is or is not to be mapped to the destination register rd 760. If “E” is equal to true (1), the source bit is not mapped. Otherwise, if “E” is equal to false (0), the source bit is mapped as specified by the offset in the TBO data field. For example, referring back to
To provide context for details relating to the execution of the PERBA and PERBX operations,
The processor core 900 includes a plurality of execution registers 980 that are used by the core 900 to perform operations. The registers 980 may include, for example, instruction registers 982, operand registers 984, and various special purpose registers 986. These registers 980 are ordinarily for the exclusive use of the core 900 for the execution of operations. Instructions and data are loaded into the execution registers 980 to “feed” an instruction pipeline 992. While a processor core 900 may experience no latency (or a latency of one-or-two cycles of the clock controlling timing of a micro-sequencer 991) when accessing its own execution registers 980, accessing memory that is external to the core 900 may produce a larger latency due to (among other things) the physical distance between the core 900 and the memory.
The instruction registers 982 store instructions loaded into the core (e.g., via bus(es) 999) that are being/will be executed by an instruction pipeline 992. The operand registers 984 have the structure 530 and store data that has been loaded into the core 900 that is to be processed by an executed instruction (e.g., registers serving as the source byte register rs1640 and permutation map source register rs2850). The operand registers 984 also receive the results of operations executed by the core (e.g., a register serving as the destination register rd 760). The special purpose registers 986 may be used for various “administrative” functions, such as being set to indicate divide-by-zero errors, to increment or decrement transaction counters, to indicate core interrupt “events,” etc.
Referring to
The instruction pipeline 992 comprises a plurality of “stages,” such as an instruction decode stage, an operand fetch stage, an instruction execute stage, and an operand write-back stage. Each stage corresponds to circuitry.
The instruction fetch circuitry 1020 provides the fetched instruction to instruction decode circuitry 1030 of an instruction pipeline 992. The decode circuitry 1030 decodes (1130) the instruction, and determines the addresses of any source operands that need to be fetched, such as the source byte rs1 specified by the source byte register address 424 and the permutation map rs2 specified by the permutation map register address 426.
The instruction decode circuitry 1030 provides the addresses of the operands that need to be fetched to operand fetch circuitry 1040 of the instruction pipeline 992. The operand fetch circuitry 1040 fetches (1140) the required source operands (e.g., zero, one, or two operands) from the operand registers 984. The operand fetch circuitry 1040 provides the fetched operands to instruction execute circuitry 1050 of the instruction pipeline 992. The instruction execute circuitry 1050 executes (1150) the decoded instruction, using the fetched operands. Certain instructions may be presented by the instruction execute circuitry 1050 to an arithmetic logic unit (ALU) 994 for execution. The ALU may be configured to execute arithmetic and logic operations using the source operands. Typically, execution by the ALU 994 may be performed in a single cycle of the clock, with extended instructions requiring two or more cycles. The instruction execute circuitry 1050 may also use other specialized components to execute instructions, such as a floating point unit (FPU) 996.
Results from the execution (1150) of the decoded instruction (if any) are provided to operand write circuitry 1060 of the instruction pipeline 992. The operand write circuitry 1060 performs 1160 a “write back,” providing the result(s) and the address(es) of the operand register(s) to which the result(s) are to be written to an operand write-back unit 998. The operand write-back unit 998 then writes (1164) the results into the specified operand registers 984. Depending upon the size of the resulting operand(s) and the size of the operand registers, extended operands that are longer than a single register may require more than one clock cycle to write-back.
Register forwarding may also be used to forward an operand result back into the execution instruction execute circuitry 1050 for a next or subsequent instruction in the instruction pipeline 992, to be used as a source operand for execution of that instruction. For example, a compare circuit may compare the register source address of a next instruction with the register result destination address of the preceding instruction, and if they match, the execution result operand may be forwarded between pipeline stages to be used as the source operand for execution of the next instruction, such that the execution of the next instruction does not need to fetch the operand from the registers 984.
The “E” data field bit 858 is inverted by an inverter 1314. An inverter performs a “NOT” operation, with the output of a NOT being the opposite of its input, such that a true (1) becomes false (0), and a false (0) becomes true (1). The output of a NOT operation may be noted by an exclamation point “!” added to its input, such that NOT rs1[0] may be expressed as !rs1[0].
The outputs Y0 to Y7 of the decoder 1312 are each connected to one input of a corresponding two-input AND gate (1316a to 1316h). For example, output Y0 is input into AND gate 1316a, output Y1 is input into AND gate 1316b, output Y2 is input into AND gate 1316c, and so on, with output Y7 being input into AND gate 1316h. The other input of each AND gate 1316a-h receives the output of the inverter 1314 (i.e., the inverted “E” data field value). The eight outputs M0 1320 to M7 1327 of the map decoder 1310m are the outputs of the eight AND gates 1316a-h, where the output of AND gate 1316a is decoder output M0 1320, the output of AND gate 1316b is decoder output M1 1321, and so on, with the output of AND gate 1316h being decoder output M7 1327.
Each M0 output 1320a to 1320h serves as one of the inputs into a corresponding two-input AND gate 1532a to 1532h. The other input of each AND gate 1532a to 1532h receives a corresponding bit rs1[0] 644a to rs1[7] 644h of the source byte rs1[7:0] 642. So the inputs into AND gate 1532a are M0 1320a and rs1[0] 644a, the inputs into AND gate 1532b are M0 1320b and rs1[1] 644b, the inputs into AND gate 1532c are M0 1320c and rs1[2] 644c, and so on, with the inputs into AND gate 1532h being M0 1320h and rs1[7] 644h.
The outputs of all the AND gates 1532a to 1532h are input into an eight-input XOR gate 1534. The output of XOR gate 1534 is the least-significant of perbx[0] 1540a of the PERBX permutation result. The operand write circuitry 1060 provides perbx[0] 1540a to the operand write-back unit 998 to be written to the destination register rd 760 as the least-significant bit rd[0] 764a of the result byte rd[7:0] 762.
The AND gates 1532a to 1532h and the XOR gate 1534 form a circuit bxor[0] 1530a that outputs one bit of the permutation perbx[0] 1540a. This circuit bxor[n] 1530 is duplicated for each of the bits [7:0] of the PERBX result byte. This is further illustrated in
In
The permutation map nibble decoders 1310a-h and the circuits bxor[7:0] 1530a-h may be part of the execute circuitry 1050 of the instruction pipeline 992, or may be included in circuitry associated with the execute circuitry 1050 of the instruction pipeline 992, such as in an ALU 994. In this way, the execute stage (1150) may execute an entirety of a PERBX instruction within a signal clock cycle.
Each M0 output 1320a to 1320h serves as one of the inputs into a corresponding two-input AND gate 1732a to 1732h. Eight inverters 1731a-h invert the bits rs1[0] 644a to rs1[7] 644h of the source byte rs1[7:0] 642. The inverted source byte bits output by the inverters 1731a-h are each input into a corresponding AND gate 1732a to 1732h. So the inputs into AND gate 1732a are M0 1320a and !rs1[0], where the exclamation point indicates that the state of the bit is inverted by the NOT operation of the inverter. Likewise, the inputs into AND gate 1732b are M0 1320b and !rs1[1], the inputs into AND gate 1732c are M0 1320c and !rs1[2], and so on, with the inputs into AND gate 1732h being M0 1320h and !rs1[7].
The outputs of all the AND gates 1732a to 1732h are input into an eight-input NOR gate 1734. A “NOR” operation corresponds to an OR with an inverted output, such that the output of a NOR is true (1) if and only if all of the inputs are false (0). Otherwise, if any input is true (1), a NOR outputs a false (0).
All of the outputs M0 1320a-h from the permutation map nibble decoders 1310a-h are also input into an eight-input OR gate 1736. The output of the OR gate 1736 will be true (1) if any of the bits of the source byte rs1[7:0] 642 are mapped to the result bit rd[0] 764a.
The outputs of the OR gate 1736 and the NOR gate 1734 are input into an AND gate 1738. The output of AND gate 1738 is the least-significant bit perba[0] 1740a of the PERBA permutation result. The operand write circuitry 1060 provides bit perba[0] 1740a to the operand write-back unit 998 to be written to the destination register rd 760 as the least-significant bit rd[0] 764a of the result byte rd[7:0] 762.
The inverters 1731a-h, the AND gates 1732a-h, the NOR gate 1734, the OR gate 1736, and the AND gate 1738 form a circuit mapped_band[0] 1730a that outputs one bit of the permutation perba[0] 1740a. This circuit mapped_band[n] 1730 is duplicated for each of the bits [7:0] of the PERBA result byte. This is further illustrated in
In
The permutation map nibble decoders 1310a-h and the circuits mapped_band[7:0] 1730a-h may be part of the execute circuitry 1050 of the instruction pipeline 992, or may be included in circuitry associated with the execute circuitry 1050 of the instruction pipeline 992, such as in an ALU 994. In this way, the execute stage (1150) may execute an entirety of a PERBA instruction within a signal clock cycle.
Used in conjunction with other bit-field permute instructions or other bit-field swap and rotate instructions, an entire word may be permuted. A “bit-field” is a contiguous block of “r” bit(s), where r>0. Each bit-field of a plurality of bit-fields to be permuted consists of a same number of “r” bits. A bit is a bit-field of one bit, a nibble is a bit-field of four bits, a byte is a bit-field of eight bits, etc. Bit-field permute instructions may be implemented for bit-fields where r>1 in a similar manner to the illustrated bit-wise (i.e., r=1) permutation operations. To support such instructions, the execute circuitry 1050 and/or ALU 994 may include additional versions of the circuits in
For example, a nibble-permute instruction operating on a thirty-two bit word may permute eight bit-fields of four bits each (i.e., r=4). An instruction format like that in
As permuted, the transfer of bits within a source bit-field to a result bit-field maintains the “significance” of each bit. Continuing with the nibble-permute example, if only source nibble rs[3:0] is permuted to result nibble rd[7:4], then rd[7] is set to the state of rs[3], rd[6] is set to the state of rs[2], rd[5] is set to a state of rs[1], and rd[4] is set to a state of rs[0]. If both rs[7:4] and rs[11:8] are permuted to rd[3:0] using a PERBX operation, then rd[3] is set to an XOR of the states of rs[7] and rs[11]. rd[2] is set to an XOR of the states of rs[6] and rs[10], rd[1] is set to an XOR of the states of rs[5] and rs[9], and rd[0] is set to an XOR of the states of rs[4] and rs[8]. If no source bit-field is permuted to result bit-field rd[11:8], then each bit of rd[11:8] is set to a false state.
As noted above, other bit-field swap and rotate instructions may also be used in conjunction with the PERBA and PERBX instructions. Such swap instructions may be configured to rearrange bit-fields in a specific manner, such as reducing the significance of each byte in a word, while moving the least-significant byte to the most significant byte in a circular manner. So, for example, applying such a swap/rotate instruction to an input word in[31:0] to obtain an output word out[31:0], the contents of in[31:24] would be copied to out[23:16], the contents of in[23:16] would be copied to out[15:8], the contents of in[15:8] would be copied to out[7:0], and the contents of in[7:0] would be copied to out[31:24].
As is known in the art, “states” in binary logic may be represented by two voltage levels: high or low. The example circuits herein are discussed in the context of a positive logic convention, sometimes referred to as “active high,” where a “true” equals high, and “false” equals low. However, the principles disclosed herein are equally applicable to a negative logic convention, sometimes referred to as “active low,” where a “true” equals low and a “false” equals high.
In the discussion of
The processor 900 may use any architecture, and may use any instruction set (e.g., RISC or CISC), with the addition of the permutation instructions and circuit enhancements described herein, to add the PERBX and PERBA operations to the architecture's execute circuitry 1050 and/or ALU 994. Also, although the operand registers 984 and instruction format 420 in the examples are 32 bits, other bit widths may be used.
Although the example source and result permutations are of a byte (8 one-bit bit-fields), a smaller permutation (e.g., two bit-fields or four bit-fields) or a larger permutation (e.g., 16 bit-fields) may be used, increasing or decreasing the number of TBO bits 856 accordingly (e.g., one TBO bit for two bit-field permutations, two TBO bits for four bit-field permutations, four TBO bits for sixteen bit-field permutations.
Depending upon the number of the bit-fields permuted and the width of the operand registers 984, more than one operand register 984 may be used to store the permutation map. If more than one operand register is used to store the permutation map, the instruction format 420 may include a single permutation map rs2 register address (e.g., 426), with the register address indicating a first operand register of a series of operand registers containing the permutation map to be fetched for the permutation operation.
Also, as an alternative to including a permutation map register address 426 in the instruction format, depending upon the size of the permutation map and the number of bits afforded by the instruction format, the permutation map may be directly encoded into the instruction as a series of binary values consisting of the E data field values 858 and TBO data field values 856.
Also, although least-significant bit of the source bits rs1642 in
The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art. Persons having ordinary skill in the field of computers, microprocessor design, and pipeline architectures should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.