1. Technical Field
This invention relates generally to digital data processor architectures and, more specifically, relates to program instruction decoding and execution hardware.
2. Description of Related Art
A number of data processor instruction set architectures (ISAs) operate with fixed length instructions. For example, several Reduced Instruction Set Computer (RISC) architecture data processors feature instruction words that have a fixed width of 32 bits. One such example is the PowerPC™, which is a product available from International Business Machines Corporation (IBM). Another conventional architecture, known as IA-64 EPIC (Explicitly Parallel Instruction Computer), uses a fixed format of three operations per 128 bits. In other architectures such as the IBM System/360 and zSeries architectures, the Intel 8086 architecture, the Advanced Microdevices'AMD64 architecture, or the Digital Equipment VAX architecture, each instruction is of variable length, the length being specified by length field which is part of the instruction word.
As instruction pipelines become deeper and memory latencies become longer, more instructions must be executing simultaneously so as to keep data processor execution units well utilized. However, in order to increase the number of non-memory operations in flight, it is generally necessary to increase the number of registers in the data processor, so that independent instructions may read their inputs and write their outputs without interfering with the execution of other instructions. Unfortunately, in most RISC architectures there is not sufficient space in a 32-bit instruction word for operands to specify more than 32 registers, i.e., 5-bits per operand, with most operations requiring three operands and some requiring two or four operands. Other architectures, such as the MIPS architecture (a product of MIPS Technologies, Inc.) and the ARM architecture (a product of ARM Ltd.), offer a mode that allows for selecting between two different instruction encoding formats. For example, in one mode, all instructions are of a first width (e.g., 32 bits for the MIPS32 and ARM architectures, respectively), and in another mode, all instructions are of a second width (e.g., 16 bits for the MIPS 16 and Thumb architectures, respectively). Thumb architecture is an extension to the 32-bit ARM architecture. The Thumb instruction set features a subset of the most commonly used 32-bit ARM instructions which have been compressed into 16-bit wide opcodes.
In addition, as conventional fixed-width data processor architectures age, new applications become important, and these new applications may require new types of instructions to run efficiently. For example, in the last few years, multimedia vector extensions have been made to several ISAs, such as the MMX, SSE, SSE2, and SSE3 extensions for the Intel 8086 architecture and Altivec/VMX for the PowerPC™ architecture. However, with only a fixed number of bits in an instruction word, it has become increasingly difficult or impossible to add new instructions and specifically operation code encodings (opcodes) and wide register specifiers to many architectures.
Several techniques for extending instruction word length have been proposed and used in the prior art. For example, Complex Instruction Set Computer (CISC) architectures generally allow the use of a variable length instruction. However, traditional variable instruction lengths, e.g., as those employed by the Intel 8086 architecture, have at least three significant drawbacks. A first drawback to the use of variable length instructions is that they complicate the decoding of instructions, as the instruction length is generally not known until at least a part of the instruction has been read, and because the positions of all operands within an instruction are likewise not generally known until at least part of the instruction is read. A second drawback to the use of variable length instructions is that instructions of variable width are not compatible with the existing code for fixed width data processor architectures. A third drawback is that conventional variable length instructions require complex decoders which can start at arbitrary instruction addresses, complicating and slowing down instruction decode logic.
Although the use of a fixed width 64-bit instruction word (or other higher powers of two) may allow for avoiding the first and third problems mentioned above, the use of a fixed width 64-bit instruction word still does not overcome the second problem. In addition, the use of 64-bit instructions introduces the further difficulty that the additional 32-bits beyond the current 32-bit instruction words are far more than what is needed to specify the numbers of additional registers required by deeper instruction pipelines, or the number of additional opcodes likely to be needed in the foreseeable future. The use of excess instruction bits wastes space in main memory and in instruction caches, thereby slowing the performance of the data processor.
An approach of encoding instructions in a first fixed width (e.g., 2 bytes) and a second double fixed width (e.g., 4 bytes) has been previously used in the IBM RT PC ROMP processor and is disclosed by P. Hester et al. “The IBM RT PC ROMP and Memory Management Unit Architecture”, IBM RT Personal Computer Technology, 1986. To prevent crossing of page boundaries for doublewide instructions, the encoded instructions can further be required to start at a doublewide instruction address boundary (e.g., an instruction byte address being an integral multiple of 4) or an address not within 3 bytes before a boundary not to be crossed.
For example, the XL2067 and XL8220, products of Weitek Corporation, use a method to subdivide a 4 byte space to support into a 1 byte and a 3 byte instruction. This is a means to embed multiple short instructions efficiently in an instruction stream.
In addition, U.S. Pat. No. 5,625,784, entitled “Variable Length Instructions Packed in a Fixed Length Double Instruction”, also discloses a method to subdivide the number of bits used by two instructions to provide up to 4 variable length instructions. Optionally, two short “flexible” instructions can be present. This method is undesirable as variable length instructions are inherently slow and hard to decode. In one aspect of the cited invention, an extended variable length instruction can be generated by concatenating one of a first and second base instruction with additional instruction bytes distributed over two adjacent instruction words. The teachings of this patent require base instructions to be aligned at instruction word boundaries, leading to restrictions in possible instructions to be used. The encoding is undesirable for hardware implementations because it requires performing alignment of instruction bits. Such signal crossing is costly in modern designs. Finally, while this encoding allows for the insertion of one long instruction in a double instruction space, it requires the second instruction to be shorter. Thus, this invention is directed at packing multiple variable length instructions and not at supporting the pervasive use of wide instructions.
Having described instruction word oriented architectures such as RISC and CISC architectures, we now describe bundle-oriented architectures wherein an instruction consists of several operations.
The above-mentioned IA-64 EPIC architecture packs three operations into 16 bytes (128-bits), for an average of 42.67 bits per operation. While this type of instruction encoding avoids problems with page and cache line crossing, this type of instruction encoding also exhibits several problems, both on its own, and as a technique for extending other fixed instruction width ISAs. First, without incurring significant implementation difficulty (likely slowing the execution speed and requiring significantly more integrated circuit die area), this instruction encoding technique permits branches to go only to instructions starting with an operation encoded as the first of the three operations in a 128 b instruction word, whereas most other architectures allow branches to any instruction. Second, this technique also “wastes” bits for specifying the interaction between instructions. For example, instruction stops are used to indicate if all three operations can be executed in parallel, or whether they must be executed sequentially, or whether some combination of the two is possible. This approach is known as “variable length very long instruction word (VLIW)” or “variable width VLIW”. In one particular encoding used by the IA-64 architecture, the stop information and issue logic data is encoded in an instruction header, as described by Intel in “IA-64 Application Developer's Architecture Guide”. In another form of VLIW instruction encoding used by IBM's Binary-translation Optimized Architecture (BOA) processor, the stop bits are explicit, as described by Gschwind et al., “Dynamic and Transparent Binary Translation”, IEEE Computer, March 2000. Third, the three operation packing technique also forces additional complexity in the implementation in order to deal with three instructions at once. Finally, the three operation packing format for IA-64 has no requirement to be compatible with existing 32-bit instruction sets. As a result, there is no obvious mechanism to achieve compatibility with other fixed width instruction encodings, such as the conventional 32-bit RISC encodings.
Several VLIW instruction sets instruction words use an instruction format specifier to specify the internal format of operations. Examples of these architectures include the DAISY architecture described by Ebcioglu et al. in “Dynamic Binary Translation and Optimization”, IEEE Transactions on Computers, 2002, the IA-64 architecture described by Intel, and the IBM elite DSP architecture described in Moreno et al. in “An Innovative Low-Power High-Performance Programmable Signal Processor for Digital Communications,” IBM Journal of Research and Development, vol. 47, No. 2/3, pp. 299-326, 2003.
Another operation encoding technique for variable width VLIW architectures is disclosed by Moreno in U.S. Pat. No. 5,669,001 entitled, “Object Code Compatible Representation of Very Long Instruction Word Programs”, and U.S. Pat. No. 5,951,674 entitled, “Object Code Compatible Representation of Very Long Instruction Word Programs”. This encoding technique is similarly are not applicable to maintaining object code compatibility with fixed width RISC ISA architectures, but between several generations of VLIW architectures, being specifically directed towards the encoding of operations in a long instruction word.
In addition, a copending application entitled, “Method and Apparatus to Extend the Number of Instruction Bits in Processors with Fixed Length Instructions, in a Manner Compatible with Existing Code”, Ser. No. ______, attorney docket no. YOR920030405US1, filed on Nov. 24, 2003, assigned to the same assignee as the present application, describes a mechanism that allows for extending all instructions by a fixed amount. The mechanism operates by allocating an extension area, wherefrom each instruction derives several extension bits. The mechanism allows for maintaining the traditional 32-bit instruction boundaries of the PowerPC™ architecture, and for broadly maintaining compatibility with the pre-existing environment. However, because the presence of the extensions in accordance with the mechanism is indicated by a bit in the page table, all instructions on a page must be extended when even a single instruction uses the extension. This has at least two drawbacks. The first drawback stems from the fact that all instructions must be extended, even when only a few instructions on a page require the extension, leading to possibly significant inefficiency of such a page. The second drawback limits the free interlinking of binary object modules compiled with and without this extension, and specifically requires the link editor to either separate functions compiled employing the extensions from those not employing those extensions, or to patch the precompiled object modules not using the extensions to employ the extensions.
Another way to embed longer instructions is the use of indirection, that is, by storing a long instruction in a separate memory, or memory region, and referring to such instruction word by an indexing means embedded in the instruction stream. An example of an architecture employing indirection is the Billions of Operations Per Second (BOPS) architecture. BOPS has ‘indirect’ VLIW instructions that can also access all the processing elements inside the core via a 32-bit instruction path. These “indirect” instructions allow longer instruction words to be accessed by specifying which long instruction to access with a short indirect pointer fitting in a narrower instruction word, e.g., as those present in the PowerPC™ architecture. However, this architecture is optimized for such applications as digital signal processing (DSP), and thus is limited to DSP and similar applications.
Specifically, indirect methods in instruction words suffer from the following drawbacks. For instance, link editing must merge indirect tables and adjust indirect points during the final linkage step. When the indirect table overflows, no straightforward resolution is possible which allows for preserving high performance. In addition, in a multiprocessing system, different applications may require separate indirect tables, requiring to load and unload indirect tables on each context switch, thereby significantly degrading achievable performance by increasing context switch time. Not all code points can be accessed using an indirect pointer, or the pointer would have to be the same size as the expanded code space, thereby defeating the compression advantage given by the indirect approach.
For example, U.S. Patent Application No. 20030023960A1 entitled, “Microprocessor Instruction Format Using Combination Opcodes and Destination Prefixes”, describes an indirect method wherein a combination opcode is used to obtain two opcodes for two instructions from a table using the combination opcode to perform a table access.
Another existing mechanism that uses an instruction format specifier to specify the internal format of operations is found in Jani et al., “Long Words and Wide Ports: Reinventing the Configurable Processor”, Proc. of Hot Chips 16, August 2004; this method being publicly described after the invention date of the present invention, which describes a method of inserting a VLIW in a scalar instruction stream. A 32-bit or 64-bit VLIW instruction consisting of a format specifier and several operations can be embedded in a CISC instruction set containing 16-bit and 24-bit scalar instructions, based on the Flexible Length Instruction Xtensions (FLIX) extension technology, a product of Tensilica, Inc. However, while each FLIX instruction can be independently encoded and scheduled, the VLIW format requires that slots be properly coordinated, and globally shared functions between several execution operation types not be encoded in a single FLIX instruction. As all operations are executed in parallel, this would create a resource conflict, and hence it is illegal to bundle multiple operations that use the same globally shared functions. Thus, because the FLIX instruction words encoded operations which must be executed in parallel, and not instructions which can be scheduled and executed independently from each other, this makes the encoding unsuitable for dynamically scheduled machines that require the instruction scheduler to resolve execution resource dependences, and serialize resource and data dependent instructions. The Tensilica instruction set does not use fixed width instructions, yielding an instruction stream consisting of 16-bit, 24-bit, 32-bit, and 64-bit variable length instructions with arbitrary 8-bit alignment for any instruction address, resulting in the same instruction alignment issues as traditional variable length (CISC) instruction sets. This limitation makes this approach unsuitable for inclusion in a fixed length RISC ISA.
Therefore, in view of the above, it would be advantageous to have a mechanism for allowing the use of wide instructions words in an instruction set in conjunction with instruction sets that use fixed width instructions.
The present invention provides a method, apparatus, and computer instructions for including wide instruction words in an instruction set in conjunction with instruction sets that use fixed width instructions. The extra instruction word bits are added in a manner that is designed to minimally interfere with the encoding, decoding, and instruction processing environment in a manner compatible with existing conventional fixed instruction width code. The mechanism of the present invention permits the mixing of conventional and augmented instructions within an instruction encoding group, wherein control may be directly transferred, without operating system intervention, between one type of instruction to another.
The present invention provides many advantages over existing encoding methods. With the present invention, the number of bits that are added to an instruction set as an extension is not excessive compared to what is required to specify a reasonable number of additional registers and/or opcodes. The extension may be performed only locally to a small set of instructions, where at least one instruction uses the feature, as opposed to requiring an entire page of code to be encoded in a wider encoding. The mechanism of the present invention also allows for encoding instruction addresses with the current instruction addressing infrastructure (specifically, a 32-bit or 64-bit value), and does not require additional words to store instruction addresses for purposes of indicating exceptions, function call return addresses, and register-indirect branch targets. This functionality may be combined with a preferred branch target alignment for relative and absolute addressed branches of at least the instruction encoding group size.
In addition, the mechanism of the present invention provides an encoding format where an extended instruction of the present invention may be wider in basic instruction width than the basic instruction unit size. A feature of this invention is a group-centered decoding approach for instruction encoding groups, wherein groups of instructions are decoded. A still further feature of this extension is that an instruction encoding group is an integral multiple of the original instruction size. A still further feature is that an extended instruction can be wider than the basic instruction unit size, but is not required to be an integral multiple of the basic instruction size, to avoid excessive instruction footprint growth. For example, in one embodiment, the instruction encoding group includes an extended width instruction paired with another extended width instruction of the same size, wherein the extended width instructions correspond to three fixed width instructions. In this example, the instruction encoding group is an integral multiple of the original fixed width instruction size.
Another feature of the present invention is widened instructions may be placed within the instruction stream to integrate with the fixed width instructions without permanently changing the alignment of all following instructions (e.g., even after a 48-bit instruction, a 32-bit instruction stream will remain aligned at 32-bit). For example, in one embodiment, the instruction encoding group includes an extended width instruction paired with a fixed width instruction. The fixed width instructions are padded with bit groups in order to align the fixed width instructions within the extended instruction encoding group. In this manner, extended width instructions are allowed to integrate with fixed width instructions without the alignment problems associated with variable width instruction words. In one embodiment, the bit groups used for padding are unused. In another embodiment, they extend the meaning of the included base instruction, e.g., including but not limited to providing additional bits for one or more instruction fields.
Another feature of the present invention is an instruction encoding group may encode shared information across several instructions or a modifier can be applied to several instructions. The shared field may be used to encode an instruction or indicate the selection of a specific rounding mode for all floating point instructions encoded in such a group. For example, a shared field may be an address space identifier to be used by all memory access instructions encoded in the group. In another embodiment of the present invention, at least one of predicates and predicate condition can be specified in a shared field.
In addition, the present invention provides a group-centered decoding approach, wherein groups of instructions (“instruction encoding groups”, or “encoding group”) are decoded. While previous ISAs have supported bundles, they have not supported the concept of instruction encoding groups. Thus, instruction extensions such as the FLIX instructions require supporting the start of instructions at arbitrary byte addresses. Furthermore, FLIX bundles are VLIW instructions which encode multiple operations to be executed in parallel, restricting the freedom of the instruction scheduler, as well as of microarchitects in choosing what resources to share in a specific implementation of a processor. In contrast, the instruction encoding groups of the present invention do not imply the presence or absence of parallelism, as used by previous bundle uses. Instead, instruction encoding groups allow the efficient encoding of fixed width and extended width instructions in a fixed width ISA coding system without specifying a required parallel or non-parallel execution, the presence of stop bits, or other information restricting the instruction scheduler of a RISC processor.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
It is noted at the outset that this invention will be described below in the context of an extension of 32-bit instruction words, of a type commonly employed in RISC architectures, to include extended instruction words. However, instruction width augmentation for other fixed width instruction sizes (e.g., 64-bits, or 128-bits) are also within the scope of this invention. Similarly, the extension configurations used for exemplary exposition are an encoding group of 2 instructions of 48 b width, or a group consisting of an encoding group of 128 b width containing three instructions. Again, other widths of encoding groups are in the scope of the present invention, and can be practiced using any instruction width and group width. Examples are also made using a variety of instruction sets, and particularly the IBM PowerPC™ instruction set architecture. Again, extensions of other ISAs are within the scope of the present invention. Thus, those skilled in the art should realize that the ensuing description, and specific references to numbers of bits, instruction widths, and code systems are not intended to be read in a limiting sense upon the practice of this invention.
The present invention may be implemented in a computer system. Therefore, the following
With reference now to
In the depicted example, local area network (LAN) adapter 110, small computer system interface SCSI host bus adapter 112, and expansion bus interface 114 are connected to PCI local bus 106 by direct component connection. In contrast, audio adapter 116, graphics adapter 118, and audio/video adapter 119 are connected to PCI local bus 106 by add-in boards inserted into expansion slots. Expansion bus interface 114 provides a connection for a keyboard and mouse adapter 120, modem 122, and additional memory 124. SCSI host bus adapter 112 provides a connection for hard disk drive 126, tape drive 128, and CD-ROM drive 130. Typical PCI local bus implementations will support three or four PCI expansion slots or add-in connectors.
An operating system runs on processor 102 and coordinates and provides control of various components within data processing system 100 in
Those of ordinary skill in the art will appreciate that the hardware in
The processes of the present invention are performed by processor 102 using computer implemented instructions, which may be located in a memory such as, for example, main memory 104, memory 124, or in one or more peripheral devices 126-130.
Turning next to
In a preferred embodiment, processor 210 is a single integrated circuit superscalar microprocessor, preferably implementing the PowerPC architecture. Accordingly, as discussed further herein below, processor 210 includes various units, registers, buffers, memories, and other sections, all of which are formed by integrated circuitry. As shown in
BIU 212 connects to an instruction cache 214 for storing instruction words in accordance with the present invention and to data cache 216 of processor 210. Instruction cache 214 outputs instructions encoded in accordance with the to sequencer unit 218. In response to such instructions from instruction cache 214, sequencer unit 218 selectively outputs instructions to other execution circuitry of processor 210.
In addition to sequencer unit 218, in the preferred embodiment, the execution circuitry of processor 210 includes multiple execution units, namely a branch unit 220, a fixed-point unit A (“FXUA”) 222, a fixed-point unit B (“FXUB”) 224, a complex fixed-point unit (“CFXU”) 226, a load/store unit (“LSU”) 228, and a floating-point unit (“FPU”) 230. FXUA 222, FXUB 224, CFXU 226, and LSU 228 input their source operand information from general-purpose architectural registers (“GPRs”) 232 and fixed-point rename buffers 234. In prior art, these are addressed by a number of bits encoded in the instruction word of a fixed width RISC ISA. In accordance with the present invention, wide instruction words can be embedded in the instruction stream to optionally address more architected GPRs. Moreover, FXUA 222 and FXUB 224 input a “carry bit” from a carry bit (“CA”) register 239. FXUA 222, FXUB 224, CFXU 226, and LSU 228 output results (destination operand information) of their operations for storage at selected entries in fixed-point rename buffers 234. Also, CFXU 226 inputs and outputs source operand information and destination operand information to and from special-purpose register processing unit (“SPR unit”) 237.
FPU 230 inputs its source operand information from floating-point architectural registers (“FPRs”) 236 and floating-point rename buffers 238. FPU 230 outputs results (destination operand information) of its operation for storage at selected entries in floating-point rename buffers 238. In prior art, these are addressed by a number of bits encoded in the instruction word of a fixed width RISC ISA. In accordance with the present invention, wide instruction words can be embedded in the instruction stream to optionally address more architected FPRs.
In response to a Load instruction, LSU 228 inputs information from data cache 216 and copies such information to selected ones of rename buffers 234 and 238. If such information is not stored in data cache 216, then data cache 216 inputs (through BIU 212 and system bus 211) such information from a system memory 239 connected to system bus 211. Moreover, data cache 216 is able to output (through BIU 212 and system bus 211) information from data cache 216 to system memory 239 connected to system bus 211. In response to a Store instruction, LSU 228 inputs information from a selected one of GPRs 232 and FPRs 236 and copies such information to data cache 216.
Sequencer unit 218 inputs and outputs information to and from GPRs 232 and FPRs 236 by decoding instruction words. In accordance with the present invention, instruction words can either have a fixed width instruction length, or contain embedded wide instruction words. From sequencer unit 218, branch unit 220 inputs instructions and signals indicating a present state of processor 210. In response to such instructions and signals, branch unit 220 outputs (to sequencer unit 218) signals indicating suitable memory addresses storing a sequence of instructions for execution by processor 210. In response to such signals from branch unit 220, sequencer unit 218 inputs the indicated sequence of instructions from instruction cache 214. If one or more of the sequence of instructions is not stored in instruction cache 214, then instruction cache 214 inputs (through BIU 212 and system bus 211) such instructions from system memory 239 connected to system bus 211.
In response to the instructions input from instruction cache 214, sequencer unit 218 selectively dispatches the instructions to selected ones of execution units 220, 222, 224, 226, 228, and 230. Each execution unit executes one or more instructions of a particular class of instructions. For example, FXUA 222 and FXUB 224 execute a first class of fixed-point mathematical operations on source operands, such as addition, subtraction, ANDing, ORing and XORing. CFXU 226 executes a second class of fixed-point operations on source operands, such as fixed-point multiplication and division. FPU 230 executes floating-point operations on source operands, such as floating-point multiplication and division.
As information is stored at a selected one of rename buffers 234, such information is associated with a storage location (e.g., one of GPRs 232 or carry bit (CA) register 242) as specified by the instruction for which the selected rename buffer is allocated. Information stored at a selected one of rename buffers 234 is copied to its associated one of GPRs 232 (or CA register 242) in response to signals from sequencer unit 218. Sequencer unit 218 directs such copying of information stored at a selected one of rename buffers 234 in response to “completing” the instruction that generated the information. Such copying is called “writeback.”
As information is stored at a selected one of rename buffers 238, such information is associated with one of FPRs 236. Information stored at a selected one of rename buffers 238 is copied to its associated one of FPRs 236 in response to signals from sequencer unit 218. Sequencer unit 218 directs such copying of information stored at a selected one of rename buffers 238 in response to “completing” the instruction that generated the information.
Processor 210 achieves high performance by processing multiple instructions simultaneously at various ones of execution units 220, 222, 224, 226, 228, and 230. Accordingly, each instruction is processed as a sequence of stages, each being executable in parallel with stages of other instructions. Such a technique is called “pipelining.”
In the fetch stage, sequencer unit 218 selectively inputs (from instruction cache 214) one or more instructions from one or more memory addresses storing the sequence of instructions discussed further hereinabove in connection with branch unit 220, and sequencer unit 218. In the decode stage, sequencer unit 218 decodes up to four fetched instructions.
In the dispatch stage, sequencer unit 218 selectively dispatches up to four decoded instructions to selected (in response to the decoding in the decode stage) ones of execution units 220, 222, 224, 226, 228, and 230 after reserving rename buffer entries for the dispatched instructions' results (destination operand information). In the dispatch stage, operand information is supplied to the selected execution units for dispatched instructions. Processor 210 dispatches instructions in order of their programmed sequence.
In the execute stage, execution units execute their dispatched instructions and output results (destination operand information) of their operations for storage at selected entries in rename buffers 234 and rename buffers 238 as discussed further hereinabove. In this manner, processor 210 is able to execute instructions out-of-order relative to their programmed sequence.
In the completion stage, sequencer unit 218 indicates an instruction is “complete.” Processor 210 “completes” instructions in order of their programmed sequence.
In the writeback stage, sequencer 218 directs the copying of information from rename buffers 234 and 238 to GPRs 232 and FPRs 236, respectively. Sequencer unit 218 directs such copying of information stored at a selected rename buffer. Likewise, in the writeback stage of a particular instruction, processor 210 updates its architectural states in response to the particular instruction. Processor 210 processes the respective “writeback” stages of instructions in order of their programmed sequence. Processor 210 advantageously merges an instruction's completion stage and writeback stage in specified situations.
In the illustrative embodiment, each instruction requires one machine cycle to complete each of the stages of instruction processing. Nevertheless, some instructions (e.g., complex fixed-point instructions executed by CFXU 226) may require more than one cycle. Accordingly, a variable delay may occur between a particular instruction's execution and completion stages in response to the variation in time required for completion of preceding instructions.
Completion buffer 248 is provided within sequencer 218 to track the completion of the multiple instructions that are being executed within the execution units. Upon an indication that an instruction or a group of instructions have been completed successfully, in an application specified sequential order, completion buffer 248 may be utilized to initiate the transfer of the results of those completed instructions to the associated general-purpose registers.
An encoding scheme for a two-byte instruction 312 is also shown. The first three bits in instruction 312 comprise the opcode for identifying the instruction width. As opcode 314 in instruction 312 is “001”, instruction 312 is two-bytes long. The remaining bits in instruction 312, such as bits 4 through 16316, are used to encode the two-byte instruction.
An encoding scheme for a three-byte instruction 320 is provided. In a similar manner to two-byte instruction 312, the first three bits in instruction 320 comprise the opcode for identifying the instruction width. However, as the first three bits in opcode 322 are “000”, instruction 320 is indicated to be three bytes long. The remaining bits in instruction 320 such as bits 4 through 24324, are used to encode the three-byte instruction.
However, conventional variable length instructions, such as those instruction described above, are not compatible with the existing code for fixed width data processor architectures. Conventional variable length instructions also require complex decoders that can start at arbitrary instruction addresses; complicating and slowing down instruction decode logic. For example,
In particular,
Turning now to
For example,
With regard to instructions 504 and 506, instruction 504 represents a format typically used for instructions requiring immediates. An immediate is a constant value stored in the instruction itself. In addition to opcode field 510 and first and second source register operands 524 and 526, instruction 504 includes immediate field 528 that codes an immediate operand, a branch target offset, or a displacement for a memory operand. Instruction 506 represents a format typically used for jump instructions. These instructions require a memory address to specify the target of the jump 530.
Although the use of fixed width instructions by RISC processors may overcome some of the issues in using variable length instructions, fixed width instructions still contain many disadvantages. As more instructions must be executing at the same time so as to keep data processor execution units well utilized, it is generally necessary to increase the number of registers in the data processor, so that independent instructions may read their inputs and write their outputs without interfering with the execution of other instructions. Yet in most RISC architectures, there is not sufficient space in a 32-bit instruction word for operands to specify more than 32 registers. In addition, with only a fixed number of bits in an instruction word, it has become increasingly difficult or impossible to add new instructions and specifically opcode encodings and wide register specifiers to many architectures.
Turning next to
For example, U.S. Pat. No. 5,922,065 entitled, “Processor Utilizing a Template Field for Encoding Instruction Sequences in a Wide-Word Format”, discloses the format used in the IA-64 architecture. It should be noted that this patent uses a different naming scheme, referring to operations as used in this application as “instructions”, and to instructions as used in this application as “instruction group”. That an instruction group is in fact a group of operations to be executed concurrently is specified in the description and claims of the U.S. Pat. No. 5,922,065, such as claim 17 which specifies that an instruction group is “comprising a set of statically contiguous instructions that are executed concurrently”. The specific bundle architecture described in this patent further limits certain instruction slots to specific execution units based on a limited amount of template codes as shown in
Finally, in operation bundle based ISAs, all instructions follow this encoding scheme and thus cannot be properly integrated into a pre-existing fixed width RISC ISA.
For example, instruction bundle 702 comprises a memory operation (M) 704 and two integer (I) operations, 706 and 708. Stop bit 710 is positioned after integer operations 708, terminating a single instruction consisting at least of operations 704, 706 and 708; thus, instruction bundle 712 is executed in the next clock cycle for a program having a sequence of operation bundles corresponding to those shown in
However, this type of instruction encoding also exhibits several problems, both on its own, and as a technique for extending other fixed instruction width ISAs. First, this coding technique is used for encoding operations which are part of a long instruction word which is to be scheduled in parallel, not as part of independent instructions as used in RISC processors. Secondly, this instruction encoding technique permits branches to go only to instructions beginning with the first of the three operations without incurring significant implementation difficulty, and “wastes” bits for specifying the interaction between instructions (i.e., instruction stop bits). Thirdly, this three operation bundle format not only forces additional complexity in the implementation in order to deal with three operations at once, but it has no requirement to be compatible with existing fixed width instruction encodings, such as the conventional 32-bit RISC encodings.
Turning back to step 1004, if the information in the format specifier field indicates that the instruction bundle contains two operations of 30 bits each, the processor decodes the first 30-bit operation of the instruction bundle (step 1008), and then decodes the second 30-bit operation (step 1010). The processor then shifts the instruction buffer by 64 bits (step 1024), and the process returns to step 1002 if additional instruction words are to be decoded.
The information in the format specifier field may also indicate that the format specifier contains three operations. If the format specifier discloses that the three operations are of equal length, the processor decodes the first 20-bit operation of the instruction bundle (step 1012), decodes the second 20-bit operation (step 1014), and then decodes the third 20-bit operation (step 1016). The processor then shifts the instruction buffer by 64 bits (step 1024), and the process returns to step 1002 if additional instruction words are to be decoded.
If the format specifier discloses that the three operations are of varying length, the processor decodes the each operation. For example, the processor may decode the first operation in the instruction bundle (e.g., 20-bits) (step 1018), decode the second operation (e.g., 24-bit) (step 1020), and then decode the third operation (e.g., 16-bit) (step 1022). The processor then shifts the instruction buffer by 64 bits (step 1024), and the process returns to step 1002 if additional instruction words are to be decoded.
As other LIW or VLIW instruction formats, this format is designed to encode multiple operations to be executed in parallel, and not independent instructions to be issued dynamically by the instruction issue logic of a RISC processor. Furthermore, the specific encoding format is to be used for all instruction words executed by an LIW or VLIW processor, and thus cannot be included compatibly in a fixed width RISC ISA.
An exemplary diagram of an ARM instruction set format is shown in
Next, the microprocessor determines if there is a mode switch to another instruction mode (step 1208), such as, for example, to a 16-bit instruction mode. Switching to another instruction format mode occurs with an instruction mode switching instruction, i.e., an instruction specifying a switch between instruction modes. If not, the process returns to step 1202, and the microprocessor selects another 32-bit instruction to decode.
If a switch is detected in step 1208, the microprocessor selects the next single 16-bit instruction bytes for decoding (step 1210). The microprocessor decodes the single 16-bit instruction (step 1212), and then shifts the instruction buffer by 16 bits (step 1214) to allow the microprocessor to view the next instruction.
Next, the microprocessor determines if there is a mode switch to the 32-bit instruction mode (step 1216). If not, the process returns to step 1210, and the microprocessor selects another 16-bit instruction to decode. If a switch is detected in step 1216, the microprocessor returns to step 1202 and selects the next 32-bit instruction bytes for decoding.
Turning now to
According to the PowerPC instruction encoding scheme, PowerPC instruction 1300 includes a first primary opcode (POP) 1302. Primary opcode 1302 comprises 6 bits, numbered bits 0 to 5. The primary opcode establishes the broad encoding format for the remaining instruction bits. Several instruction formats exist, with the format shown in
In contrast,
In addition, another feature of the present invention shown in
In this exemplary process, the RISC processor first selects instruction bytes for decoding (step 1402). A determination is then made as to whether the opcode for the instruction indicates that an encoding group exists (step 1404). If not, the processor decodes the single 32-bit instruction (step 1406), and then shifts the instruction buffer by 32 bits (step 1408) to allow the processor to view the next instruction. The process then returns to step 1402.
If it is determined that the opcode indicates that an encoding group is present in step 1404, the processor decodes the first instruction in the encoding group (step 1410). The processor then decodes the second instruction in the encoding group (step 1412), and then shifts the instruction buffer by 96 bits (step 1414) to allow the microprocessor to view the next instruction words in the instruction stream. The process then returns to step 1402.
While previous ISAs have supported bundles, they have not supported the concept of encoding groups which represent instructions which can be executed sequentially, or in parallel, in accordance with data dependences established by the instruction scheduler of a processor. Thus, instruction extensions such as the FLIX instructions require supporting the start of instructions at arbitrary byte addresses. Furthermore, FLIX bundles represent VLIW instructions encoding multiple operations to be executed in parallel, restricting the freedom of the instruction scheduler, as well as of microarchitects in choosing what resources to share. On the other hand, instruction encoding groups do not imply the presence or absence of parallelism, as is the case in previous encoding formats such as operation bundles. Instead, they allow the efficient encoding of fixed width and extended width instructions in a fixed width ISA coding system.
The 6-bit base ISA opcode 1502 is allocated to indicate the presence of an encoding group. In
With the present invention, a set of extended width instructions may be allocated at an appropriate fixed width instruction boundary, and ending at such boundary. Thus, while longer instruction words may be added, the overall architecture, and specifically aspects such as the branch architecture, continues to operate on word boundaries. In one embodiment using instruction encoding groups, branch targets must branch to the beginning of an encoding group having an extended with instruction. In another embodiment, the unused two lower bits of instruction addresses (indicating byte addresses which are not a multiple of 4, and which are currently unused) are used to indicate a branch target of a second instruction (wi1) 1506 or a third instruction (wi2) 1508, rather than a specific address.
In one implementation of group instruction encodings, shared field 1520 may comprise a facility selector and facility bits. Thus, one encoding group may contain a selector indicating the shared resource modifies the floating point rounding mode, and the facility bits would indicate the rounding mode. Another encoding group in the same program may have a facility selector indicating the shared resource modifies the address space selection for memory access instructions, and the facility bits would specify the specific address space, and so forth. In this manner, the shared resource can be used to select from a variety of shared facilities, based on the programmer's wishes on how to modify the specific instructions in a specific instruction encoding group.
The process begins with having the RISC processor select the instruction bytes to decode (step 1602). The process then determines if the opcode in the instruction indicates that the selected instructions bytes are part of an encoded group (step 1604). If not, the RSIC processor decodes the single 32-bit instruction (step 1606), and shifts the instruction buffer by 32-bits (step 1608), with the process returning to step 1602.
Turning back to step 1604, if the opcode in the instruction indicates that the selected instruction bytes are part of an encoding group, the RISC processor processes and skips the encoded header (step 1610). Next, the RISC processor decodes the first instruction in the encoding group (step 1612). The RISC process decodes the second instruction of the encoding group (step 1614), and then decodes the third instruction in the encoded group (step 1616). Once each instruction in the encoding group is decoded, the RISC processor shifts the instruction buffer by 128-bits (step 1618), with the process returning to step 1602.
Although the example process in
While the aspects of this present invention have been presented in the context of fixed width RISC instruction set architectures, some aspects of instruction encoding groups may be advantageously practiced in conjunction with other ISAs. In one such use, instruction encoding groups may be used to specify shared fields. In one such advantageous use of instruction encoding groups for other instruction set architectures, a predicate field may be shared between several instructions.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the best method and apparatus presently contemplated by the inventors for carrying out the invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. As but some examples, and as was noted above, this invention is not limited to the use of any specific instruction widths, instruction extension widths, code page memory sizes, specific sizes of partitions or allocations of code page memory and the like, nor is this invention limited for use with any one specific type of hardware architecture or programming model, nor is this invention limited to a particular instruction pipeline. The use of other and similar or equivalent embodiments may be attempted by those skilled in the art. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.
Further, some of the features of the present invention could be used to advantage without the corresponding use of other features. As such, the foregoing description should be considered as merely illustrative of the principles of the present invention, and not in limitation thereof.