1. Technical Field of the Invention
This invention relates generally to programmable microprocessors, and more specifically to instructions for a digital signal processor which use bit-wise and byte-wise data movements to accomplish a variety of data manipulations.
2. Background Art
The data elements are conventionally addressed from 0 to N−1, where N is the number of data elements. Conventionally, bits within a byte are addressed 0-7 from the least significant bit to the most significant bit, and are shown ordered right to left. In the conventional little-endian data arrangement, the least significant byte within a multi-byte data element is stored at the lowest address and the most significant byte is stored at the highest address. In the less common big-endian data arrangement, the bytes within a multi-byte data element are stored in the opposite order; however, those skilled in the art know how to handle these differences, and the remainder of this disclosure will be in little-endian terms, for simplicity and consistency. In this disclosure, the data elements will be addressed as indicated by the hexadecimal digits shown above the register in the respective figure. The byte positions will be addressed as indicated by the hexadecimal digits shown in
Microprocessors, microcontrollers, digital signal processors, ASICs, and other programmable digital logic devices are commonly adapted to execute a variety of instruction types, such as addition, subtraction, multiplication, and so forth. One such type of operation is data movement instructions, such as shifts, rotates, and the like. Some data movement instructions are “bit-wise”, meaning that they are capable of moving data on single bit granularity, rather than e.g. byte granularity. Some data movement instructions are “byte-wise”, meaning that they move bytes around but keep the eight bits of any given byte intact, together, and in the same order, as the bytes are moved around. Other data movement instructions operate on larger data elements, such as words, doublewords, or quadwords, and move intact chunks of that size around without reordering the bits within any given chunk.
In general, the wider a shifter or rotator is made, the more complex its logic becomes, and the more time it takes to complete its operation.
Applicant has realized that, by combining byte-wise operations with bit-wise operations, many data manipulation operations can be simplified. Or, more precisely, the hardware required to perform them can be simplified. Additionally, Applicant has realized that a generalized byte-wise data manipulation operation can be used as a powerful, fundamental operation, to implement a wide variety of specific data movement operations upon a variety of element sizes.
The invention will be understood more fully from the detailed description given below and from the accompanying drawings of embodiments of the invention which, however, should not be taken to limit the invention to the specific embodiments described, but are for explanation and understanding only.
A processor according to the present invention may, in one embodiment, include a dedicated byte permutation unit as one of the execution units. It may also include a dedicated bit manipulation unit. Alternatively, the byte permute functionality and/or bit manipulation functionality can be implemented within one or more of the other execution units.
The present invention is centered on two capabilities: the ability to perform byte-wise permute operations, and the ability to perform bit-wise data manipulation operations, and the processor's ability to use one or both of them in implementing a variety of instructions.
The processor may additionally include a permute value table which provides predefined control values for some byte-wise permute operations, and/or permute value calculation logic which generates e.g. operand-dependent control values for other byte-wise permute operations.
The reader should make continued reference to
In the example shown, src3[0] contains the hexadecimal value 00, causing src1[0] to be copied to dest[0]; src3[1] contains the hexadecimal value 15, causing src2[1] to be copied to dest[1]; and so forth. In this disclosure, dashed lines will be used to show data flow from src2 to dest, and solid lines will be used to show data flow from src1 to dest, to help the reader correctly trace the arrows in the drawings. The dashed lines from src2 to dest traverse behind src1 in the drawings.
Other processors, such as the Altivec processor from IBM, Motorola, and Apple, have had such a byte-wise permute instruction in their instruction set architecture (ISA). Applicant is not the originator of this instruction nor its functionality. Applicant believes he is, however, the first to recognize that it may be used, alone or in combination with bit-wise operations, to implement a wide variety of other data manipulation instructions on a variety of data element sizes.
In the example shown, src3[0] contains the hexadecimal value 01, specifying that dest[word 0] should be loaded with src1[word 1], or, in other words, that dest[1:0] should be loaded with src1[3:2]. In one implementation, the processor generates a temporary control word temp from src3. The value 01 in scr3[word 0] specifies src1[word 1], so the processor loads temp[1] with the hex value 03 and temp[0] with the hex value 02. The remaining bytes of temp are loaded appropriately. In one embodiment, the instruction decoder determines that the instruction's opcode specifies a word-wise permute, and the permute value calculation logic generates the values in temp according to the values in src3.
The permute value generation logic which generates temp from scr3 for this instruction can be represented as follows (although it would typically be implemented as parallel circuitry rather than any sort of looping software).
With temp appropriately loaded with byte-wise permute values, the processor can simply execute the byte-wise permute instruction's operation, using temp instead of scr3 as its control source.
The processor implements this functionality using the permute facility. The value from src2 (typically in src2[0]) is copied into each element temp[i]. In one implementation, the facility relies on the programmer to have loaded a valid (less than hexadecimal 10) value into src2. In another implementation, the processor forces each temp[i] value to be valid by performing
The processor then executes the byte-wise permute operation, and the specified byte of src1 is copied into each byte of dest.
The processor implements this functionality using the permute facility. The bytes of a temporary control register, temp[0] through temp[F] are loaded with the values 00 through 0F, except the byte temp[0X] is loaded with the value 1Y, where X is the low-order nibble from the high-order quadword of scr3 and Y is the low-order nibble of the low-order quadword of src3.
The processor then simply executes the permute operation, using temp instead of scr3 as the control register.
The processor implements this functionality by loading a temporary control register temp with the values shown. The values have the following pattern. Each pair, from the low-order pair to the high-order pair, gets a next even value in its bytes' low-order nibbles. Each even-numbered byte gets a 0 in its high-order nibble, and each odd-numbered byte gets a 1 in its high-order nibble. When the processor then executes the byte-wise permute operation using temp as the control source, this picks the low-order (even-numbered) bytes alternately from src1, src2, src1, src2, and so on.
Upon encountering this instruction, the processor performs sign bit replication (not shown by arrows) of the sign bits of src1 into temp2, as explained re
The processor loads the indicated values into the temp3 register, then uses it as the permute control for extracting bytes from the temp2 and temp 1 registers and writing the extracted bytes to the dest register.
In the embodiment shown, the instruction performs an “interleaved pack”—the low-order bytes from the two respective sources' words are written to the destination in alternating order, e.g. even-numbered destination bytes come from src1, and odd-numbered destination bytes come from src2. In another embodiment, the instruction performs a “concatenated pack” in which e.g. destination bytes 0 through 7 come from src1, and destination bytes 8 through F come from src2. The difference is simply that in the latter case, the processor will put different permute control values into temp3.
The processor implements this functionality by loading the temp control register with the values shown. The pattern of the values is that they count upward from 01 by twos. After the temp register is loaded, the processor can them simply execute the byte-wise permute instruction using temp as the control register.
The processor loads the temp control register as shown, then executes the byte-wise permute instruction.
The processor includes a 256-bit shifter (shown as “sh”) which, for ease of implementation, has been constructed such that it is not necessarily able to perform a full-width shift within the available time (e.g. clock cycle). In the implementation shown, the 256-bit shifter is capable of up to a 7-bit-position shift. The processor uses the low-order three bits src3[2:0] to control the shifter. In the particular case shown, 101 (decimal) in scr3 equals 12*8+5, and scr3[2:0] will contain the decimal value 5 (with the remaining 96 represented in the higher-order bits of src3).
The processor writes the shifted 256-bit value to 256-bit temporary register temp3, then copies the high-order 16 bytes into temp2 and the low-order 16 bytes into temp1. Alternatively, the shifter output could be written directly into temp2 and temp1 as indicated.
The processor then writes the value src3[7:3], which happens to be 0C in the case of scr3 =101 decimal, into permute control register location temp4[0], and sequentially higher values into src3[1] through scr3[F]. More specifically, it writes the low-order 5 bits of sequentially higher values into those locations, zeroing the high-order 3 bits of the values written; this accommodates wrap-around if the scr3 value was greater than 128.
The processor then executes the byte-wise permute operation, writing the results to dest. Thus, the combination of a fine-grain (sub-byte) shift is used to get the operand data into a configuration in which a course-grain (byte-wise) permute can be used to effect a shift that is significantly greater (in terms of the shift count) than the shifter can itself perform. This enables the shifter to be significantly simplified and sped up and its area and power consumption reduced.
Rotate instructions can be similarly implemented.
Upon encountering the rotate left byte data instruction, the processor loads the temporary control register temp with the sequential values as shown. Each value is simply the number of its byte position within the register. The processor then executes the byte data left rotate by passing each src [i] byte to its corresponding rotator, and the result from each rotator is written to a respective, corresponding byte of a temporary destination register temp2. The processor then executes the byte-wise permute operation using temp as the control, temp2 as the source, and dest as the destination. With the sequential values in temp, no byte-wise movement is caused.
In one embodiment, the one 256-bit-wide shifter of
Assume that the instruction set architecture (ISA) of the processor mandates that the processor be able to execute up to 32-bit rotates on doubleword (32-bit) data elements. In one implementation, not using this invention, the processor could be provided with four 32-bit rotators each capable of rotating any number of bit positions between 0 and 32. Such a rotator is somewhat complex and its design may limit the maximum clock speed of the processor.
More advantageously, the processor can be constructed to utilize the present invention's byte-wise permute operation in combination with a less capable, simplified rotator. For example, as illustrated, each rotator may be capable of no more than 16-bit rotation.
The processor loads the temporary control register temp with the values shown, and provides each doubleword value from src1 to its respective rotator. The rotator is 32 bits wide, but is capable of only 16 bit positions' rotation at a time. The processor takes the rotate count supplied by the instruction, and provides it modulo 16 (by sending only the low-order four bits) to each of the rotators. The outputs of the rotators are written to respective doublewords in temp2. This is a “fine grain rotate” operation.
The processor then performs a “course grain rotate” operation to complete the rotate instruction. In one implementation, the processor may include a set of multiplexers each wired to receive values from two byte positions in temp2, as shown; one is a straight pass-through, and one is two bytes removed within the doubleword. The processor can then use the fifth bit position of the rotate count specified by the instruction, to control which of these two values is muxed through to the corresponding byte position in dest. The fifth bit position is the “16's value”, and is 1 if the shift count is between 16 and 31.
Alternatively, the processor can use this fifth bit position in determining whether to load temp with the values shown, or with sequential “00 01 02 . . . 0F” (from 1sb to msb, right to left) values. Then, after the fine-grain rotate, the processor can simply invoke the byte-wise permute operation. In this implementation, the course rotate multiplexers are not needed and can be omitted from the machine.
The src1 data are fine grain rotated by the 32-bit rotators, using the rotate count modulo 16, and the results are written to an intermediate destination register temp3. The processor then invokes the permute operation using temp3 as the source and course ctrl as the control, and writes the results to dest.
If, at various points in this disclosure, the inventors state that e.g. “src1 contains the value 128” or simply “src1 is 128” or the like, it is really meant that “src1 points at a register or memory location which contains the value 128”. This looseness in terminology is commonplace in the industry and well understood by those of skill in the art.
The value held in scr3 indicates which bit position in the source identified by src1 is to be copied into the least significant bit (LSB) of the result. The remaining bits of the result are taken from consecutive adjacent bits in the registers identified by src1 and src2, as shown.
For convenience of illustration and clarity of explanation, when a register or a bit position in a register is indicated as containing a specified value, the value is preceded with “VAL”. For example, in the example given, register R3 is shown to hold the value 7, indicated as “VAL 7”. And when a bit position's particular value is unspecified, it is identified simply by its bit position and is preceded with “b”. For example, bits 0 through 127 of register R1 are identified as “b000” through “b127” respectively, and bits 0 through 127 of register R2 are identified as “b128” through “b255” respectively (because, in this form of the instruction, the registers identified by src1 and src2 are treated as one 256-bit conglomerate source). The vertical split between src1 and src2 is merely for visual clarity of the illustration.
Thus, in the example given, 128 bits are selected from the R2:R1 conglomerate source, with the LSB selected from bit position b007 (as specified by R3), and the result written to destination register R4 includes bits b127:b007 from R1 and bits b134:b128 from R2.
It should be noted that scr3 can be used to perform a conditional selection between src1 and src2. If scr3 is 0, src1 will be copied in its entirety to dest, but if scr3 is 128, src2 will be copied in its entirety to dest.
With both source operands pointed at the same register R1, any 128-bit field selection effectively performs a rotate right, because the LSB of the src2 source is adjacent the MSB of the src1 source, implicitly performing the wrap-around of the rotate operation.
In the example given, the rotate count is 6 in R2, and the destination register is written with the values shown, with b006 in the LSB and b005 in the MSB, with the wrap-around occurring 6 bits from the MSB.
Because this is a leftward operation rather than a rightward operation, the compiler has to do a bit of setup, to get the desired result. For a left shift count N specified by the source code, the compiler loads the value 128—N into the scr3 register. In the example given, the shift count was 10, and the compiler has placed the value 118 into src3. This is because the bit field selection needs to select N bits from the MSB end of src1, rather than from the LSB end. When the instruction is then executed, 128 bits are copied from src2[(src3−1):0] and src1[127:src3].
Alternatively, if the machine does not include a special ZeroReg, the compiler can in a previous instruction load the value 0 into some register, then point src1 at that register in the bit field selection instruction that is to perform the shift left.
128 bits are selected from src2[(127−src3):0] and src1[127:src3]. The destination will then contain a value having the number of leading zeroes specified as the shift count.
The compiler sets up the bit field selection instruction such that src1 points at the source register R1, src2 points at this replicated sign bit register R2, and the shift count is loaded into a register pointed at by src3. Then, when the bit field selection instruction is executed, 128 bits are copied from R2[R3−1:0] and R1[127:128−R3]. The result written to R4 will thus be right shifted by R3 bit positions, and will contain R3+1 copies of the original sign bit (including the original sign bit itself).
If the previous instruction had, instead, determined the sign bit to be a zero, the code would then have branched to the bit field instruction of
A multiplexer is coupled to receive the value from the register R2 pointed at by src3, and also the output of a subtractor which subtracts that value from 128. The multiplexer is controlled by a signal which specifies whether the instruction is performing a leftward or a rightward operation. In some embodiments, this control signal is generated as a function of the opcode (not shown) of the instruction, specifically those bit(s) which indicate that it is a shift left. The destination register R3 is written with src2[(127−src3 ):0] and ZeroReg[127:(128−src3 )].
The 8-bit first SIMD element of src1 and the 8-bit first SIMD element of src2 are treated as a 16-bit value. The bit field instruction selects an 8-bit field from that 16-bit value and writes it into the first 8-bit SIMD element of dest. In the example shown, this includes, from the LSB toward the MSB, b03 through b07 from src1 and b32 through b34 of src2. If scr3 had specified a value larger than 7, the bit field selection operation would have written low order dest bits from src2 and then “wrapped around” to continue picking higher order bits from src1.
The same operation is performed simultaneously for corresponding sets of the second, third, and fourth SIMD elements of the sources and destination, as shown.
It is not necessary to provide the user with an exhaustive list detailing every possible way that the flexible permute operation can be used to perform other, more rigid data movement operations. Nor is it necessary to provide the user with an exhaustive list detailing every possible way in which bit-wise data manipulations can be combined with the flexible permute operation to perform bit-wise data movements in the absence of complex, dedicated hardware. After reading this disclosure and studying the examples given in the various drawings, the reader will appreciate these principles and understand how to apply them to any data movement operation that happens to be required in his application at hand. The invention has been discussed in terms of various implementations in which the smallest “course grain” data element is the 8-bit byte, but the invention is not so limited; in other implementations, the smallest course grain data element might be, for example, a 12-bit pixel value, or a 16-bit floating point value, or what have you. The smallest course grain data element can, regardless of its size, be referred to as a “base element” or an element having a “base size”. Rotates, shifts, shuffles, merges, explodes, rotates, shifts, permutes, and the like may collectively be termed “data rearrangement instructions”. Registers, memory locations, latches, gates, and the like may collectively be termed “data storage locations”.
The invention has been described with reference to its use in implementing a machine adapted for performing instructions such as rotate, shift, permute, pack, unpack, bit field selection, merge, expand, and so forth. It may also be used in performing other instructions, such as move, insert, and so forth. The invention may be used in a processor of any type of architecture, whether RISC, CISC, VLIW, or what have you. It may be used in processors that are microcoded, as well as those which are not. It may be used in processors which are primarily designed for digital signal processing, as well as those adapted for more general purpose use. It may be used in any particular type of system, such as embedded control systems, cell phones, personal digital assistants, computers, consumer electronic devices, automotive systems, and so forth. It may be used in a processor which is adapted to execute instructions from exactly one single ISA, or in a processor which is adapted to execute instructions from two or more ISAs.
Instructions may be encoded in a wide variety of manners. In some instances, each instruction field (e.g. opcode, first operand designator, second operand designator, destination designator, immediate data, and so forth) occupies a contiguous group of bits in the instruction. In other embodiments, the bits may be scattered and the designators interleaved with each other. In some, various ones of the designators may be implicit rather than explicit; for example, certain instructions may always use the src1 as the dest, overwriting src1; as another example, certain other instructions may always write to a predetermined dest such as R1.
While the invention has been described with reference to embodiments in which e.g. the scr3 value indicates an offset from the low order end of src1, in other embodiments the scr3 value could indicate an offset from the high order end of src2.
When one component is said to be adjacent another component, it should not be interpreted to mean that there is absolutely nothing between the two components, only that they are in the order indicated. The various features illustrated in the figures may be combined in many ways, and should not be interpreted as though limited to the specific embodiments in which they were explained and shown. Those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present invention. Indeed, the invention is not limited to the details described above. Rather, it is the following claims including any amendments thereto that define the scope of the invention.
In the following claims, designators such as “first” and “second” (e.g. in “first operand designator” and the like) are not intended to imply any particular order of their bit fields or bits within the instruction.
This application is a continuation-in-part of application Ser. No. 11/270,213 “Bit-Wise Operation Followed by Byte-Wise Permutation for Implementing DSP Data Manipulation Instructions” filed Nov. 08, 2005 by Gregory M. Thornton. Both applications are commonly assigned to Stexar Corporation.
Number | Date | Country | |
---|---|---|---|
Parent | 11270213 | Nov 2005 | US |
Child | 11400434 | Apr 2006 | US |