Many processing systems execute instructions. The ability to generate, store, and/or access instructions is thus desirable.
In some processing systems, a Single Instruction, Multiple Data (SIMD) instruction is simultaneously executed for multiple operands of data in a single instruction period. For example, an eight-channel SIMD execution engine might simultaneously execute an instruction for eight 32-bit operands of data, each operand being mapped to a unique compute channel of the SIMD execution engine. An ability to generate, store and/or access such instructions may thus be desirable.
Some embodiments described herein are associated with a “processing system.” As used herein, the phrase “processing system” may refer to any system that processes data. In some embodiments, a processing system includes one or more devices. In some embodiments, a processing system is associated with a graphics engine that processes graphics data and/or other types of media information. In some cases, the performance of a processing system may be improved with the use of a SIMD execution engine. For example, a SIMD execution engine might simultaneously execute a single floating point SIMD instruction for multiple channels of data (e.g., to accelerate the transformation and/or rendering three-dimensional geometric shapes). Other examples of processing systems include a Central Processing Unit (CPU) and a Digital Signal Processor (DSP).
The memory unit 115 may store instructions and/or data (e.g., scalars and vectors associated with a two-dimensional image, a three-dimensional image, and/or a moving image). In some embodiments, the memory unit 115 includes an instruction memory unit 130 and data memory unit 140, which may store instructions and data, respectively. The instruction memory unit 130 and/or the data memory unit 140 might be associated with separate instruction and data caches, a shared instruction and data cache, separate instruction and data caches backed by a common shared cache, or any other cache hierarchy. In some embodiments, the instruction memory unit 130 and/or the data memory unit 140 comprise one or more RAM units. In some embodiments, the memory unit 115, or one or more portions thereof (e.g., the instruction memory unit 130 and/or the data memory unit 140) comprises a hard disk drive (e.g., to store and provide media information) and/or a non-volatile memory such as FLASH memory (e.g., to store and provide instructions and data).
The memory unit 115 may be coupled to the processor 110 through one or more communication links. In the illustrated embodiment, for example, the instruction memory unit 130 and the data memory unit 140 are coupled to the processor through a first communication link 150 and a second communication link 160, respectively.
As used herein, a processor may be implemented in any manner. For example, a processor may be programmable or non programmable, general purpose or special purpose, dedicated or non dedicated, distributed or non distributed, shared or not shared, and/or any combination thereof. If the processor has two or more distributed portions, the two or more portions may communicate with one another through a communication link. A processor may include, for example, but is not limited to, hardware, software, firmware, hardwired circuits and/or any combination thereof.
Also, as used herein, a communication link may comprise any type of communication link, for example, but not limited to, wired (e.g., conductors, fiber optic cables) or wireless (e.g., acoustic links, electromagnetic links or any combination thereof including, for example, but not limited to microwave links, satellite links, infrared links), and/or combinations thereof, each of which may be public or private, dedicated and/or shared (e.g., a network). A communication link may or may not be a permanent communication link. A communication link may support any type of information in any form, for example, but not limited to, analog and/or digital (e.g., a sequence of binary values, i.e. a bit string) signal(s) in serial and/or in parallel form. The information may or may not be divided into blocks. If divided into blocks, the amount of information in a block may be predetermined or determined dynamically, and/or may be fixed (e.g., uniform) or variable. A communication link may employ a protocol or combination of protocols including, for example, but not limited to the Internet Protocol.
As stated above, many processing systems execute instructions. The ability to generate, store and/or access instructions is thus desirable.
In some embodiments, a first processing system is used in generating instructions for a second processing system.
According to some embodiments, the first processing system 210 is used in generating instructions for the second processing system 220. In that regard, in some embodiments, the system 200 may receive an input or first data structure indicated at 240. The first data structure 240 may be received through a second communication link 250 and may include, but is not limited to, a first plurality of instructions, which may include instructions in a first language, e.g., a high level language or an assembly language.
The first data structure 240 may be supplied to an input of the first processing system 210, which may include a compiler and/or assembler that compiles and/or assembles one or more parts of the first data structure 240 in accordance with one or more requirements associated with the second processing system 220. An output of the first processing system 210 may supply a second data structure indicated at 260. The second data structure 260 may include, but is not limited to, a second plurality of instructions, which may include instructions in a second language, e.g., a machine language.
The second data structure 260 may be supplied through the first communication link 230 to an input of the second processing system 220. The second processing system may execute one or more of the second plurality of instructions and may generate data indicated at 270. The second processing system 160 may be coupled to one or more external devices (not shown) through one or more communication links, e.g., a third communication link 280, and may supply some or all of the data 270 to one or more of such external devices through one or more of such communication links.
In some embodiments, the first processing system 210 and/or the second processing system 220 may have a configuration that is the same as and/or similar to one or more of the processing systems disclosed herein, for example, the processing system 100 illustrated in
In some embodiments, the first processing system 210 and/or the second processing system 220 may be used without the other. For example, the first processing system 210 may be used without the second processing system 220. The second processing system 220 may be used without the first processing system 210.
In some embodiments, one or more instructions for the second processing system 220 are stored in one or more memory units (e.g., one or more portions of memory unit 115 (
At 302, a data structure is received in a first processing system. The data structure represents a plurality of instructions for a second processing system. The first processing system may be, for example, an assembler, a compiler and/or a combination thereof. The plurality of instructions might be, for example, a plurality of machine code instructions to be executed by an execution engine of the second processing system. The plurality of instructions may include more than one type of instruction.
At 304, it is determined, for at least one of the plurality of instructions, whether the instruction can be replaced by a compact instruction (e.g., an instruction that represents the instruction and is more compact than the instruction) for the second processing system. According to some embodiments, a criterion is employed in determining whether the instruction can be replaced by a compact instruction. In such embodiments, determining whether the instruction can be replaced by a compact instruction may include determining whether the instruction satisfies the criterion. At 306, if the instruction can be replaced by a compact instruction, a compact instruction is generated based at least in part on the instruction. The compact instruction may have a length that is less than a length of the instruction replaced by such compact instruction. Thus, in some embodiments, less memory may be needed to store the compact instruction. In some embodiments, the compact instruction may include a field indicating that the compact instruction is a compact instruction.
In some embodiments, it may be determined, for each of the plurality of instructions, whether the instruction can be replaced by a compact instruction (e.g., an instruction that represents the instruction and is more compact than the instruction) for the second processing system. In some such embodiments, if the instruction can be replaced by a compact instruction, a compact instruction is generated based at least in part on the instruction.
According to some embodiments, the method may further include replacing the instruction with the compact instruction. For example, the instruction may be removed from the data structure and the compact instruction may be added to the data structure. The position of the compact instruction might be the same as the position at which the instruction resided, prior to removal of such instruction.
In some embodiments, the first processing system 210 may receive the first data structure 240 through the communication link 250. As stated above, the first data structure 240 may include, but is not limited to, a first plurality of instructions, which may include instructions in a first language, e.g., a high level language or an assembly language.
The first data structure 240 may be supplied to an input of the compiler and/or assembler 410. The compiler and/or assembler 410 includes a compiler, an assembler, and/or a combination thereof, that compiles and/or assembles one or more parts of the first data structure 240 in accordance with one or more requirements associated with the second processing system 220.
The compiler and/or assembler 410 may generate a data structure indicated at 440. The data structure 440 may include, but is not limited to, a plurality of instructions, which may include instructions in a second language, e.g., a machine language. In some embodiments, the plurality of instructions may be a plurality of machine code instructions to be executed by an execution engine of the second processing system 220. In some embodiments, the plurality of instructions may include more than one type of instruction.
The data structure 440 may be supplied to an input of the compactor 420, which may process each instruction in the data structure 440 to determine whether such instruction can be replaced by a compact instruction for the second processing system 220. If the instruction can be replaced, the compactor 420 may generate a compact instruction to replace such instruction. In some embodiments, the compactor 420 generates the compact instruction based at least in part on the instruction to be replaced. In some embodiments, the compact instruction includes a field indicating that the compact instruction is a compact instruction.
In accordance with some embodiments, the compactor 420 may replace the instruction with the compact instruction. In that regard, the plurality of instructions may represent a sequence of instructions. The instruction may be removed from its position in the sequence and the compact instruction may be inserted at such position in the sequence such that the position of the compact instruction in the sequence is the same as the position of the instruction replaced thereby, prior to removal of such instruction from the sequence.
In some embodiments, the position of each instruction within a sequence of instructions may be defined in any of various ways, for example, but not limited to, by a physical ordering of the instructions, by use of pointers that define the position or ordering of the instructions in the sequence, or any combination thereof. An instruction may be removed from a sequence by, for example, but not limited to, physically removing the instruction from a physical ordering, by updating any pointer(s) that may define the position or ordering, by creating another data structure that includes the sequence of instructions less the instruction being removed, or any combination thereof. An instructions may be added to a sequence by, for example, but not limited to, physically adding the instruction to a physical ordering, by updating any pointer(s) that may define the position or ordering, by creating another data structure that includes the sequence of instructions plus the instruction being added, or any combination thereof.
The data structure may further have a length and a width. The length may indicate the number of locations and/or addresses in the data structure. The width may indicate the number of bits provided at each location and/or address in the data structure. In some embodiments, each location may include one or more sections, e.g., section 0 through section 1.
In some embodiments, each of the plurality of instructions has the same length as one another, which may or may not be equal to the width of the data structure. In some embodiments, one or more of the plurality of instructions may have a length that is different than the length of one or more other instructions of such plurality of instructions.
The plurality of instructions may define a sequence or sequence of instructions, e.g., instruction 1, instruction 2, instruction 3, instruction 4, instruction 5, instruction 6. Each instruction in the sequence of instructions may be disposed at a respective position in the sequence, e.g., instruction 1 may be disposed at a first position in the sequence, instruction 2 may be disposed at a second position in the sequence, instruction 3 may be disposed at a third position in the sequence, and so on.
The data structure may further have a length and a width. The length may indicate the number of locations and/or addresses in the data structure. The width may indicate the number of bits provided at each location and/or address in the data structure. In some embodiments, each location may include one or more sections, e.g., section 0 through section 1.
One or more of the plurality of instructions may be a compact instruction. In the illustrated embodiment, for example, instruction 1, instruction 3 and instruction 6 are compact instructions that have replaced instruction 1, instruction 3 and instruction 6, respectively, of the data structure 440 (
Each compact instruction, e.g., instruction 1, instruction 3 and instruction 6, may have a length that is less than that of the non-compact instruction replaced by such compact instruction. In some embodiments, each of the compact instructions has the same length as one another. In some embodiments, one or more of the compact instructions has a length equal to one half the width of the data structure. In the illustrated embodiment, for example, each of the compact instructions has a length equal to one half the width of the data structure 260. However, compact instructions may or may not have the same length as one another. In some embodiments, one or more of the compact instructions has a length that is different than the length of one or more other compact instructions. Moreover, in some embodiments, one or more of the compact instructions has a length that is not equal to one half the width of the data structure.
The plurality of instructions may define a sequence or sequence of instructions, e.g., instruction 1, instruction 2, instruction 3, instruction 4, instruction 5, instruction 6, instruction 7, instruction 8. Each instruction in the sequence of instructions may be disposed at a respective position in the sequence, e.g., instruction 1 may be disposed at a first position in the sequence, instruction 2 may be disposed at a second position in the sequence, instruction 3 may be disposed at a third position in the sequence, and so on.
In some embodiments, the position of each instruction, e.g., instruction 1 through instruction 6, in the sequence of instructions is the same as the position of the corresponding instruction, e.g., instruction 1 through instruction 6, respectively, in the data structure 440 (
Thus, the data structure 260 may be able to store additional instructions, e.g., instruction 7 through instruction 9. For example, instruction 7, which may be a compact instruction, may be stored in section 0 of location 604. Instruction 8, which may be a compact instruction, may be stored in section 1 of location 604. Instruction 9 may be stored in section 0 and section 1 of location 605.
In that regard, instruction 1 may be stored in section 0 of location 600. Instruction 2 may be partitioned into two parts. One part of instruction 2 may be stored in section 1 of location 600. The other part of instruction 2 may be stored in section 0 of location 601. Instruction 3 may be stored in section 1 of location 601. Instruction 4 may be stored in section 0 of location 602. Instruction 5 may be stored in section 0 and section 1 of location 603. Instruction 6 may be stored in section 0 of location 604. Instruction 7 may be stored in section 0 of location 605. Instruction 8 may be stored in section 1 of location 605.
In some such embodiments, one or more sections of the data structure 260 may have no instruction. For example, because it is desired to store the first bit of instruction 5 in the first bit of a location, there may not be an instruction stored in section 1 of location 602. Similarly, because it is desired to store the first bit of instruction 7 in the first bit of a location, there may not be an instruction stored in section 1 of location 604.
For example, rather than having no instruction or a no op instruction stored in section 1 of location 602, a stuff instruction may be stored in section 1 of location 602. Similarly, rather than having no instruction stored in section 1 of location 604, a stuff instruction may be stored in section 1 of location 604. As used herein a stuff instruction is an instruction that will not be executed by the second processing system.
An example of a stuff instruction that uses the instruction format of
In some embodiments, a stuff instruction is stored in one or more sections of the data structure such that such sections of the data structure are filled and/or not empty. In some embodiments, the availability of a stuff instruction may avoid the need for a no op instruction, which may thereby increase the speed and/or level of performance of a processor.
At 1304, it is determined, for each of the plurality of instructions, whether the instruction is a type of instruction to be aligned with a location in which the instruction is to be stored. According to some embodiments, a criterion is employed in determining whether the instruction is a type of instruction to be so aligned. In such embodiments, determining whether the instruction is a type of instruction to be so aligned may include determining whether the instruction satisfies the criterion.
At 1305, the instruction is added at a free position in a current location if the instruction is not a type of instruction to be so aligned.
At 1306, the method may further include determining if the instruction can be aligned in a current location. At 1308, the instruction is added to the current location if the instruction can be aligned therewith. At 1310, if the instruction cannot be aligned with the current location, the instruction is added to a subsequent location.
At 1404 the method may further include identifying one or more bit patterns to compact in each of the one or more portions. In some such embodiments, four, eight, sixteen and/or some other number of bit patterns (but less than all patterns that occur) are identified to compact in each of the one or more portions. In some embodiments, one or more of the bit patterns to compact are identified by analyzing bit patterns of instructions in one or more sample programs. In some embodiments, a compiler and/or assembler may be employed in identifying the one or more bit patterns to compact in each portion to compact.
In one such embodiment, the eight most frequently occurring bit patterns are identified for each portion to be compacted, i.e., the eight most frequently occurring bit patterns for the first portion to compact, the eight most frequently occurring bit patterns for the second portion to compact, etc.
At 1406, each of the one or more bit patterns may be assigned a code (or compact bit code). If eight bit patterns are identified for a portion, the codes assigned to such bit patterns might have three bits. For example, a first bit pattern may be assigned a first code (e.g., “000”). A second bit pattern may be assigned a second code (e.g., “001”). A third bit pattern may be assigned a third code (e.g., bit code “010”). A fourth bit pattern may be assigned a fourth code (e.g., “011”). A fifth bit pattern may be assigned a fifth code (e.g., “100”). A sixth bit patterns may be assigned a sixth code (e.g., “101”). A seventh bit pattern may be assigned a seventh code (e.g., “110”). An eighth bit pattern may be assigned an eighth code (e.g., “111”).
In some embodiments, the one or more bit patterns may be stored in one or more tables. For example, a table may be generated for each portion to be compacted. Each table may store the one or more bit patterns to be compacted for that portion.
In some embodiments, the code assigned to a bit pattern may identify an address at which the bit pattern is to be stored in the table. The code may also be used as an index to retrieve the bit pattern from the table.
In some embodiments, the bit patterns may be assigned to the tables in a manner that helps to minimize loading on the memory. In some embodiments, for example, power consumption may be reduced by reducing the number of logic “1” bit states within a memory. Thus, in some embodiments, codes having the least number of logic “1” bit states may be assigned to those bit patterns that occur most frequently in the instructions.
In some embodiments, each portion may have any form. A portion may comprise one or more bits. The bits may or may not be adjacent to one another in the instruction. Portions may overlap or not overlap. Thus, although the portions may be shown as approximately equally sized and non-overlapping, there are no such requirements.
If so, at 1504, each bit pattern to be compacted in each portion to be compacted is replaced by a corresponding compact code. If any of the at least one portion to be compacted does not include a bit pattern to be compacted, then the instruction is not compacted and execution jumps to 1506.
One or more portions of the first instruction may be portions to be compacted. In some embodiments, for example, the second portion 1634, the third portion 1636, the fifth portion 1640 and the seventh portion may be portions to be compacted. One or more other portions may not be portions to be compacted. For example, the first portion 1632, the fourth portion 1638, the sixth portion 1642 and the eighth portion 1646 may not be portions to be compacted.
A compact instruction may also include one or more portions. For example, a second instruction 1630 may include a first portion 1632, a second portion 1634, a third portion 1636, a fourth portion 1638, a fifth portion, 1640, a sixth portion 1642, a seventh portion 1644 and an eighth portion 1646.
One or more portions of the compact instruction may be compacted portions. For example, in some embodiments, the second portion 1634, the third portion 1636, the fifth portion 1640 and the seventh portion may be compacted portions. The first portion 1632, the fourth portion 1638, the sixth portion 1642 and the eighth portion 1646 may be noncompacted portions and may be the same as or similar to the first portion 1602, the fourth portion 1608, the sixth portion 1612 and the eighth portion 1616, respectively, of the first instruction 1600.
In some embodiments, the first instruction 1600 may include a field 1620 to indicate that the first instruction is not a compact instruction. In some embodiments, the second instruction 1630 may include a field 1650 to indicate that the second instruction is a compact instruction
The compact instruction may have fewer bits than the non-compact instruction. That is, the original instruction may have a first number of bits and the compact instruction may have a second number of bits less than the first number of bits. In some embodiments, the second number of bits is less than or equal to one half the first number of bits.
The instruction cache (or other memory) 1710 may store a plurality of instructions, which may define one, some or all parts of one or more programs being executed and/or to be executed by the processing system. In some embodiments, the plurality of instructions may include, but is not limited to, one or more of the plurality of instructions represented by the data structure 260 (
An output of the instruction queue 1720 may supply an instruction, which may be supplied to the decompactor 1730. In accordance with some embodiments, the decompactor 1730 may determine whether the instruction is a compact instruction. One or more criteria may be employed in determining whether the instruction is a compact instruction. In some embodiments, a compact instruction includes a field indicating that the instruction is a compact instruction.
If the instruction is not a compact instruction, the instruction may be supplied to an input of the decoder 1740, which may decode the instruction to provide a decoded instruction. An output of the decoder 1740 may supply the decoded instruction to the execution unit 1750, which may execute the decoded instruction.
If the instruction is a compact instruction, the decompactor 1730 may generate a decompacted instruction, based at least in part on the compact instruction. The decompacted instruction may be supplied to the input of the decoder 1740, which may decode the decompacted instruction to generate a decoded instruction. The output of the decoder 1740 may supply the decoded instruction, which may be supplied to the execution unit 1750, which may execute the decoded instruction.
In some embodiments, if the decompacted instruction is a stuff instruction, such decompacted instruction may not be sent to the decoder and/or the execution unit.
In some embodiments, the processing system includes a SIMD execution engine. The instruction may be, for example, a machine code instruction to be executed by the SIMD execution engine. According to some embodiments, the instruction may specify one or more source operands and/or one or more destinations. The one or more of the source operands and/or one or more of the destinations might be, for example, encoded in the instruction. According to some embodiments, one or more of the plurality of instructions may have a format that is the same as or similar to one or more of the instructions described herein.
At 1804, it is determined whether the instruction is a compact instruction. One or more criteria may be employed in determining whether the instruction is a compact instruction. In some embodiments, a compact instruction includes a field indicating that the instruction is a compact instruction.
At 1806, if the instruction is a compact instruction, a decompacted instruction is generated based at least in part on the compact instruction.
In some embodiments, the method further includes replacing the compact instruction with the decompacted instruction if the instruction is a compact instruction. For example, the compact instruction may be removed from an instruction pipeline and the decompacted instruction may be added to the instruction pipeline. The position of the decompacted instruction may be the same as the position of the compact instruction prior to removal of such instruction.
According to some embodiments, the method may further include decoding the instruction to provide a decoded instruction if the instruction is not a compact instruction and decoding the decompacted instruction to provide a decoded instruction if the instruction is a compact instruction. In some embodiments, the method may further include executing the decompacted instruction and/or a decoded instruction.
One or more other portions of the compact instruction may be noncompacted portions. For example, the second portion 1634, the third portion 1636, the fifth portion 1640 and the seventh portion may be compacted portions. The first portion 1632, the fourth portion 1638, the sixth portion 1642 and the eighth portion 1646 may be noncompacted portions.
The decompacted instruction may also include one or more portions. For example, the decompacted instruction 1600 may include a first portion 1602, a second portion 1604, a third portion 1606, a fourth portion 1608, a fifth portion, 1610, a sixth portion 1612, a seventh portion 1614, and an eighth portion 1616.
One or more portions of the decompacted instruction 1600 may be decompacted portions. For example, in some embodiments, the second portion 1604, the third portion 1606, the fifth portion 1610 and the seventh portion may be decompacted portions.
In some embodiments, one of the compacted portions of the compacted instruction 1630, e.g., the second portion 1634, may be supplied to an input of a first portion 1910 of the decompactor 1730, which may decompact such compacted portion to provide the decompacted portion 1604 of decompacted instruction 1600. A second one of the compacted portions of the compacted instruction 1630, e.g., the third portion 1636, may be supplied to an input of a second portion 1920 of the decompactor 1730, which may decompact such compacted portion to provide the decompacted portion 1606 of the decompacted instruction 1600.
A third one of the compacted portions of the compacted instruction 1630, e.g., the fifth portion 1640, may be supplied to an input of a third portion 1930 of the decompactor 1730, which may decompact such compacted portion to provide the decompacted portion 1610 of decompacted instruction.
A fourth one of the compacted portions of the compacted instruction 1630, e.g., the seventh portion 1644, may also be supplied to an input of the third portion 1930 of the decompactor 1730, which may decompact such compacted portion to provide the decompacted portion 1614 of the decompacted instruction.
One or more other portions of the decompacted instruction 1600, e.g., the first portion 1602, the fourth portion 1608, the sixth portion 1612 and the eighth portion 1616 may be the same as or similar to the first portion 1632, the fourth portion 1638, the sixth portion 1642 and the eighth portion 1646, respectively, of the compact instruction 1630.
In some embodiments, the second portion 1604, the third portion 1606, the fifth portion 1610 and the seventh portion 1614 of the compact instruction 1630 each comprise three bits.
In some embodiments, the second portion 1604 and the third portion 1606 of the decompacted instruction 1600 each comprise a total of eighteen bits and the fifth portion 1610 and the seventh portion 1614 of the decompacted instruction 1600 each comprise a total of twelve bits.
In some embodiments, each of the compacted portions may define a code that may be used as an index to retrieve the appropriate bit pattern from the associated table. For example, the code may define an address (in the associated table) at which the bit pattern corresponding to the code is stored.
For example, the second portion 1634 of the compacted instruction 1630 may define a first code that may be used as an index (e.g., an address in the look-up table storing bit patterns associated with the second portion 1634) to retrieve a bit pattern that defines the second portion 1604 of the decompacted instruction 1600. The third portion 1636 of the compacted instruction 1630 may define a second code that may be used as an index (e.g., an address in the look-up table storing bit patterns associated with the third portion 1636) to retrieve a bit pattern that defines the third portion 1604 of the decompacted instruction 1600. The fifth portion 1640 of the compacted instruction 1630 may define a third code that may be used as an index (e.g., an address in the look-up table storing bit patterns associated with the fifth portion 1640) to retrieve a bit pattern that defines the fifth portion 1610 of the decompacted instruction 1600. The seventh portion 1644 of the compacted instruction 1630 may define a fourth code that may be used as an index (e.g., an address in the look-up table storing bit patterns associated with the seventh portion 1644) to retrieve a bit pattern that defines the seventh portion 1614 of the decompacted instruction 1600.
Although four compacted portions and three look-up tables are shown, other embodiments may also be employed.
In some embodiments, the second processing system 220 may include one or more processing systems that include an SIMD execution engine, for example as illustrated in
In some applications, it may be helpful to access information in a register file in various ways. For example, in a graphics application it might at some times be helpful to treat portions of the register file as a vector, a scalar, and/or an array of values. Such an approach may help reduce the amount of instruction and/or data moving, packing, unpacking, and/or shuffling and improve the performance of the system.
Each region description includes a register identifier and a “sub-register identifier” indicating a location of a first data element in the register file 2420 (illustrated in
Note that an origin might be defined in other ways. For example, the register file 2420 may be considered as a contiguous 40-byte memory area. Moreover, a single 6-bit address origin could point to a byte within the register file 2420. Note that a single 6-bit address origin is able to point to any byte within a register file of up to 64-byte memory area. As another example, the register file 2420 might be considered as a contiguous 320-bit memory area. In this case, a single 9-bit address origin could point to a bit within the register file 2420.
Each region description may further include a “width” of the region. The width might indicate, for example, a number of data elements associated with the described region within a register row. For example, the DEST region illustrated in
Similarly, the SRC0 region is described as being four bytes wide (and therefore two rows or registers high) and the SRC1 region is described as being eight bytes wide (and therefore has a vertical height of one data element). Note that a single region may span different registers in the register file 520 (e.g., some of the DEST region illustrated in
Although some embodiments discussed herein describe a width of a region, according to other embodiments a vertical height of the region is instead described (in which case the width of the region may be inferred based on the total number of data elements). Moreover, note that overlapping register regions may be defined in the register file 2420 (e.g., the region defined by SRC0 might partially or completely overlap the region defined by SRC1). In addition, although some examples discussed herein have two source operands and one destination operand, other types of instructions may be used. For example, an instruction might have one source operand and one destination operand, three source operands and two destination operands, etc.
According to some embodiment, a described region origin and width might result in a region “wrapping” to the next register in the register file 2420. For example, a region of byte-size data elements having an origin of R2.6 and a width of eight would include the last bytes of R2 along with the first six bytes of R3. Similarly, a region might wrap from the bottom of the register file 2420 to the top (e.g., from R4 to R0).
The SIMD execution engine may add each byte in the described SRC1 region to a corresponding byte in the described SRC0 region and store the results the described DEST region in the register file 2420. For example,
In this case, a horizontal stride of two has been described. As a result, each data element in a row is offset from its neighboring data element in that row by two bytes. For example, the data element associated with channel 5 of the execution engine is located at byte 3 of R2 and the data element associated with channel 6 is located at byte 5 of R2. In this way, a described region may not be contiguous in the register file 2620. Note that when a horizontal stride of one is described, the result would be a contiguous 4×2 array of bytes beginning at R1.1 in the two dimensional map of the register file 2620.
The region described in
According to some embodiments, the value of a horizontal stride may be encoded in an instruction. For example, a 3-bit field might be used to describe the following eight potential horizontal stride values: 0, 1, 2, 4, 8, 16, 32, and 64. Moreover, a negative horizontal stride may be described according to some embodiments.
Note that a region may be described for data elements of various sizes. For example,
The region described in
According to some embodiments, a vertical stride might be defined as a number of data elements in a register file (instead of a number of register rows). For example,
Note that different types of descriptions may be provided for different instructions. For example, a first instruction might define a destination region as a 4×4 array while the next instruction defines a region as a 1×16 array. Moreover, different types of regions may be described for a single instruction.
Consider, for example, the register file 3220 illustrated in
In this example, regions are described for an operand within an instruction as follows:
SRC1 is two bytes wide, and therefore four data elements high, and begins in byte 17 of R2 (illustrated in
SRC0 is four bytes wide, and therefore two data elements high, and begins at R1.14. Because the horizontal stride is zero, the value at location R1.14 (e.g., “2” as illustrated in
DEST is four words wide, and therefore two data elements high, and begins at R5.3. Thus, the execution channel will add the value “1” (the first data element of the SRC0 region) to the value “2” (the data element of the SRC1 region that will be used by the first four execution channels) and the result “3” is stored into bytes 3 and 4 of R5 (the first word-size data element of the DEST region).
The horizontal stride of DEST is three data elements, so the next data element is the word beginning at byte 9 of R5 (e.g., offset from byte 3 by three words), the element after that begins at bye 15 of R5 (shown broken across two rows in
The vertical stride of DEST is eighteen data elements, so the first data element of the second “row” of the DEST array begins at byte 7 of R6. The result stored in this DEST location is “6” representing the “3” from the fifth data element of SRC0 region added to the “3” from the SRC1 region which applies to execution channels 4 through 7.
Because information in the register files may be efficiently and flexibly accessed in different ways, the performance of a system may be improved. For example, machine code instructions may efficiently be used in connection with a replicated scalar, a vector of a replicated scalar, a replicated vector, a two-dimensional array, a sliding window, and/or a related list of one-dimensional arrays. As a result, the amount of data moves, packing, unpacking, and or shuffling instructions may be reduced—which can improve the performance of an application or algorithm, such as one associated with a media kernel.
Note that in some cases, restrictions might be placed on region descriptions. For example, a sub-register origin and/or a vertical stride might be permitted for source operands but not destination operands. Moreover, physical characteristics of a register file might limit region descriptions. For example, a relatively large register file might be implemented using embedded Random Access Memory (RAM), and the cost and power associated with the embedded RAM might depended on the number of read and write ports that are provided. Thus, the number of read and write points (and the arrangement of the registers in the RAM) might restrict region descriptions.
The system 3300 may also include an instruction memory unit 330 to store SIMD instructions and a data memory unit 3340 to store data (e.g., scalars and vectors associated with a two-dimensional image, a three-dimensional image, and/or a moving image). The instruction memory unit 3330 and the data memory unit 3340 may comprise, for example, RAM units. Note that the instruction memory unit 3330 and/or the data memory unit 3340 might be associated with separate instruction and data caches, a shared instruction and data cache, separate instruction and data caches backed by a common shared cache, or any other cache hierarchy. According to some embodiments, the system 3300 also includes a hard disk drive (e.g., to store and provide media information) and/or a non-volatile memory such as FLASH memory (e.g., to store and provide instructions and data).
The following illustrates various additional embodiments. These do not constitute a definition of all possible embodiments, and those skilled in the art will understand that many other embodiments are possible. Further, although the following embodiments are briefly described for clarity, those skilled in the art will understand how to make any changes, if necessary, to the above description to accommodate these and other embodiments and applications.
Although various ways of describing source and/or destination operands have been discussed, note that embodiments may be use any subset or combination of such descriptions. For example, a source operand might be permitted to have a vertical stride while a vertical stride might not be permitted for a destination operand.
Note that embodiments may be implemented in any of a number of different ways. For example, the following code might compute the addresses of data elements assigned to execution channels when the destination register is aligned to a 256-bit register boundary:
According to some embodiments, a register region is encoded in an instruction word for each of the instruction's operands. For example, the register number and sub-register number of the origin may be encoded. In some cases, the value in the instruction word may represent a different value in terms of the actual description. For example, three bits might be used to encode the width of a region, and “011” might represent a width of eight elements while “100” represents a width of sixteen elements. In this way, a larger range of descriptions may be available as compared to simply encoding the actual value of the description in the instruction word.
Execution of the first, third, fifth, seventh, ninth and eleventh instructions may each move data (e.g., data stored in an indirectly-addressed register) to a buffer (e.g., a temporary register buffer). Execution of the second, fourth, sixth, eighth, tenth and twelfth instructions may each provide interpolation.
Operands for the instructions may be described as follows:
As can be seen, the list of instructions may include a plurality of portions, e.g., portions 3402, 3406, 3408, with a repeating pattern, which will result in binary language instructions with a repeating bit pattern.
In some embodiments, compaction and/or decompaction may be employed in association with a processing system having instructions with a length of 128 bits.
In some embodiments, compaction may be employed in association with a processing system having one or more instructions with operands that may be described as follows:
As shown above, in some embodiments, such instructions may have one or more portions with a bit pattern that is found in two or more instructions.
In some embodiments, a first instruction 4000 includes a first portion 4002, a second portion 4004, a third portion 4006, a fourth portion 4008, a fifth portion 4010, a sixth portion 4012, a seventh portion 4014, an eighth portion 4016 and a ninth portion 4020. The first portion may specify an op code, the second portion may specify a plurality of control bits (e.g., thread, mask, etc), the third portion may specify a register file and data types, the sixth portion may specify a first source operand description and swizzle, and the eighth portion specifies a second source operand description and swizzle. The ninth portion may specify whether the instruction is a compact instruction.
In some embodiments, the second portion and the third portion each comprise a total of eighteen bits and the sixth portion and the eighth portion each comprise a total of twelve bits.
A compact instruction 4030 may also have nine portions. In some embodiments, the second, third, fifth and seventh portions may be compacted portions, e.g., as shown. The first, fourth, sixth and eighth portions may be noncompacted portions.
In some embodiments, the data structure has a width equal to four double words, e.g., double word 0-double word 3. Each of the six instructions may have a length equal to four double words. The compact instruction may have fewer bits than the non-compact instruction. That is, the original instruction may have a first number of bits and the compact instruction may have a second number of bits less than the first number of bits. In some embodiments, the second number of bits is less than or equal to one half the first number of bits. In some such embodiments, the original instruction comprises a total of 128 bits and the compact instruction comprises a total of 64 bits. In some embodiments, each of the compacted portions comprises three bits.
In some embodiments, decompaction may be employed in association with a processing system having one or more instructions with operands that may be described as follows:
In some embodiments, for example, such decompaction may correspond to and/or be used in association with the compaction described hereinabove with respect to
Unless otherwise stated, terms such as, for example, “based on” mean ““based at least on”, so as not to preclude being based on, more than one thing. In addition, unless stated otherwise, terms such as, for example, “comprises”, “has”, “includes”, and all forms thereof, are considered open-ended, so as not to preclude additional elements and/or features. In addition, unless stated otherwise, terms such as, for example, “a”, “one”, “first”, are considered open-ended, and do not mean “only a”, “only one” and “only a first”, respectively. Moreover, unless stated otherwise, the term “first” does not, by itself, require that there also be a “second”.
Some embodiments have been described herein with respect to a SIMD execution engine. Note, however, that embodiments may be associated with other types of execution engines, such as a Multiple Instruction, Multiple Data (MIMD) execution engine.
The several embodiments described herein are solely for the purpose of illustration. Persons skilled in the art will recognize from this description other embodiments may be practiced with modifications and alterations limited only by the claims.