1. Technical Field of the Invention
This invention relates generally to programmable processing apparatus, and more specifically to microarchitectural details of instruction handling between decoding and execution.
2. Background Art
For convenience, the various machines are illustrated herein in a generally top-to-bottom data flow orientation, such that instructions flow more or less from the top of the drawing to the bottom. The reader should note that this means that instructions appear in bottom-to-top order when shown within an executable code block, with the earliest (oldest) instructions shown at the bottom, closest to the machine, and the latest (newest) instructions shown at the top, generally closer to the compiler.
The term “ISA instruction” will be used when referring to an instruction which is in the native terms of an Instruction Set Architecture (ISA). The terms “micro-instruction” or “μop” will be used when referring to an instruction which results from decoding an ISA instruction into one or more instructions which are in the native terms of a microarchitecture or other characterization of a low-level implementation of a processor. The term “instruction” will be used when referring generically to an ISA instruction and/or a μop. The term “sequential instructions” will be used to refer to instructions which are not organized as VLIW instruction words, such as RISC/CISC code in ISA or μop form.
The VLIW processor includes an instruction word fetcher which fetches a VLIW instruction word from the executable code, and a dispatcher which issues the fetched instruction word to a plurality of execution units. The execution units may include, for example, two Add/Sub units for performing addition and subtraction operations, a Mul/Div unit for performing multiplication and division operations, a Shifter unit for performing shift and rotate operations, a Logical unit for performing bitwise operations such as AND, OR, and XOR, and a Branch unit for performing control flow branching operations such as jumps and conditional branches.
The VLIW compiler must know certain architectural details of the VLIW processor, such as how many execution units it has, what types of instructions each is capable of executing, which “slots” each instruction occupies across the machine, whether certain instructions can or cannot coexist within the same VLIW instruction word, and so forth. It must also be capable of determining certain things about the source code it is compiling, such as identifying data dependencies, to ensure that it generates valid code that will correctly execute to produce the intended result.
In the interest of clarity of illustration, many well-known features of the VLIW processor have been omitted, such as the register file, as showing them would not add to the skilled reader's understanding of the present invention.
The six instructions of each VLIW instruction word are issued in lock-step to their respective slots' execution units. Virtually all of the scheduling intelligence is in the VLIW compiler; the VLIW processor itself makes no decisions about data dependencies (other than waiting to issue a decoded VLIW instruction word until all of its input data operands are ready), code reordering, and the like. As soon as the longest-latency instruction in the prior VLIW instruction word has completed execution and the next VLIW instruction word's operand data are available, the scheduler ships the next VLIW instruction word to the execution units.
The hardware of the VLIW processor can be significantly simplified, because the instruction scheduling intelligence has been incorporated into the compiler. A significant and unfortunate side-effect of this is that VLIW code suffers greatly from “NOP code bloat”, with typically 25% to 50% of the instruction slots being occupied with “NOP” (no-operation or null operation) instructions that were not present in the source code but were, for any of a variety of scheduling reasons, injected by the compiler.
The processor operates upon executable code which has been generated from source code by a compiler. The executable code differs from conventional VLIW code in two respects. First, the compiler does not pad the executable code with “NOP” instructions. And second, the instruction slots are not strictly aligned with the execution unit slots.
The compiler constructs “fetch packets” which are 256 bits and 8 instructions wide (although, for ease of illustration, they are shown as only 6 wide). Similarly, the processor is 8 execution units wide (although it is shown as only 6 wide). In
The processor includes a packet fetcher and dispatcher which retrieves a next fetch packet of the executable code, and 8 instruction decoders. The packet fetcher and dispatcher uses the LSB markers to dispatch exactly one execution packet's instructions simultaneously to the decoders.
The output of the decoders is presumably fed to some sort of steering logic, which routes the decoded instructions of the current execution packet to their appropriate execution units. This is necessary because the execution packet is not a full-width, slot-aligned VLIW instruction word. For ease of illustration,
If the current fetch packet includes a second (or subsequent) execution packet, that execution packet's instructions are dispatched together at the following clock cycle, after the previously-dispatched instructions have been executed.
For example, the current fetch packet may include: (1) a first execution packet comprising the instructions in slot0, slot1, and slot2; (2) a second execution packet comprising the instructions in slot3 and slot4; and (3) a third execution packet comprising the instruction in slot5. The LSBs b2, b4, and b5 will be “1” and the others will be “0”. The dispatcher will send the instructions in slot0, slot1, and slot2 to the decoders. The decoders will determine what kinds of instructions those are, and the steering logic will route them to their appropriate execution units. After that first execution packet completes execution, the dispatcher will send the instructions in slot3 and slot4 to the decoders, which will determine what those instructions are, then the steering logic will route them to the appropriate execution units. After that second execution packet completes execution, the dispatcher will send the instruction in slot5 to the decoders, which will determine what kind of instruction it is, and the steering logic will route it to the appropriate execution unit.
At each cycle, the steering logic will presumably indicate to the unused execution units that they are unused, enabling them to remain idle and reduce power consumption.
Thus, this processor enables the use of what is essentially VLIW executable code and a VLIW processor, without “NOP” padding. Execution packets are executed in program order, just as they would have been in a conventional, NOP-padded VLIW processor.
A RISC/CISC compiler generates RISC/CISC executable code according to the source code. The compiler knows about the processor's instruction set architecture (ISA), which includes e.g. the number and identities of registers and the available instructions. The compiler generates sequential instructions, rather than multi-instruction words (like a VLIW compiler would).
The processor may include a prefetcher which is used to bring instructions and/or data into an instruction cache and a data cache, respectively. The processor typically utilizes a microarchitecture which is somewhat different than the ISA. The processor includes execution units which executes microinstructions or “μops” which are typically of a very different format than the ISA instructions, especially in a CISC architecture. It also includes a register file for holding data.
An instruction fetcher sequentially retrieves instructions from the executable code, usually via the instruction cache, which are then decoded by an instruction decoder. Some instructions, typically the more “RISCy” ones, are directly decoded into “μops”. Other instructions, typically the more “CISCy” ones, are not directly decoded into μops, but trigger the processor to retrieve a sequence of μops from a microcode read-only memory (ROM). Regardless of whether the μops come from the instruction decoder, from the microcode ROM, or from elsewhere, a micro-instruction scheduler controls their issuance to the appropriate execution units.
If the processor is an “in-order” machine, it executes the ISA executable code instructions' corresponding μops strictly in the order specified by the compiler. For example, the “ADD” instruction shown in the first (bottommost) position in the executable code (of
However, if the processor is an “out-of-order” machine, it will further include a reordering mechanism enabling the processor to, under certain conditions, execute the μops in a somewhat different order than that specified by the ISA code. The compiler may have applied some level of intelligence to the source code already, for example moving long-latency instructions (e.g. memory reads) to positions earlier in the code stream than the source code would indicate; it can do this as long as it does not e.g. cause a data dependency error by moving a consumer instruction ahead of a producer instruction, where the consumer instruction uses the producer instruction's result as an input operand. The compiler may also apply other types of optimizations, such as loop unrolling.
The processor's reordering mechanism adds some additional intelligence to the processor, enabling it to reorder instructions (still without violating data dependencies and the like) under certain other conditions. For example, the compiler might not be able to know, for certain, whether the processor will hit or miss the cache when executing a particular instruction. By executing out of programmatic order, the processor can get work done during such instances which would otherwise stall the execution pipeline.
Some out-of-order processors also perform “speculative execution”, in which they execute down both the “taken” and “not taken” targets of a conditional branch instruction, without retiring those instructions' results to “machine state”. Then, when it becomes known whether the branch is or is not taken, the instructions that were down the wrong branch target can simply be discarded, and those that were down the correct branch target can be committed to machine state and retired.
The hardware necessary for maintaining correct program functionality in such machines is generally quite significant, both in die area and design complexity.
Operation of the back end begins (160) with the scheduler waiting (162) until it receives an instruction from the decoder. Then, the scheduler waits (164) until that instruction's input operand data are all available, and (166) an appropriate execution unit is available. Then, the scheduler issues (168) the instruction to that execution unit, which executes (170) the instruction. The scheduler then returns to waiting (162) for an instruction, which may have already been received.
Eventually, the front end reaches (152) the end of the executable code, and its operation ends (154), at which point the back end will be left waiting (162) for another instruction to execute.
In order to increase performance by exploiting instruction level parallelism (ILP), conventional RISC/CISC processors are made “wider” with multiple execution pipelines, multiple instruction decoders, and so forth. But at some relatively small width number—typically in the range of 2 to 4, depending upon the architecture and the implementation—the performance increase from going wider quickly approaches zero in an in-order machine. An out-of-order execution machine is better able to keep a wider set of execution units busy. Unfortunately, out-of-order implementations are much more complicated, take more die area, consume more power, and are harder to scale in frequency than in-order machines. Many manufacturers are now going to dual-core and multi-core devices, in essence pushing ILP exploitation back to the software writers and the compiler.
What is desirable is a hybrid machine which offers the simple, efficient, fast, and scalable advantages of a VLIW execution engine, without suffering from VLIW NOP code bloat, and which can execute conventional RISC/CISC code and thereby decouple the VLIW-like aspects of the implementation from the compiler's view, such that the code does not need to be recompiled for each implementation of the architecture. In other words, what is desirable is a machine whose software and front end offer the advantages of a RISC/CISC machine, and whose back end offers the advantages of a VLIW machine.
The invention will be understood more fully from the detailed description given below and from the accompanying drawings of embodiments of the invention which, however, should not be taken to limit the invention to the specific embodiments described, but are for explanation and understanding only.
The DSP includes a cache which interfaces to the external memory/storage system (not shown), and one or more instruction decoders which decode incoming ISA instructions into their respective corresponding μop(s). An instruction packer receives the μops from the instruction decoders, packs them into an instruction packet (described below) which an instruction scheduler receives and schedules for execution by a plurality of execution units. A register file provides data storage for instruction results.
The DSP includes a μop buffer which is receives the μops from the decoder and provides them to the instruction packer. The μop buffer decouples the instruction packer from the instruction decoder, and can be constructed as a FIFO, ring buffer, etc.
The instruction packer includes a packing rules engine which determines whether each new μop can be packed into the same instruction packet as previously packed μops, or whether there is a packet breaking condition which prevents it from being packed with them.
An instruction packet is, in essence, a VLIW instruction word, for execution by the DSP's execution units in VLIW fashion, meaning that each “slot” or μop in the instruction packet is aligned with and uniquely bound to a particular, corresponding execution unit. The instruction packer constructs an instruction packet referred to as the UCPacket (for “Under Construction Packet”), which it eventually passes on to the instruction scheduler.
The packing rules determine which of the μops can be packed into the UCPacket. The packing rules can be any constraints whatsoever, depending upon the architecture, microarchitecture, and design implementation of the particular DSP. Exemplary rules for an in-order implementation may include such constraints as:
The impending breakage of any packing rule is a “packet breaking condition”. The packer stops packing the UCPacket when any rule would otherwise be broken. Any unfilled slots in the UCPacket are then filled with “NOP” instructions, either literally by being filled with the NOP opcode bit pattern, or effectively by having a flag bit or valid bit cleared or the like.
The instruction packer also includes a resource binder which controls the slot positioning of the μops as they pass through the packing rules engine. The resource binder determines which type of execution unit the particular μop calls for, and also determines whether there is one of those slots still available in the UCPacket. The absence of a suitable slot is a packet breaking condition, which the resource binder signals to a packet accumulation engine and the packing rules engine.
The instruction packer includes a packet accumulation engine which determines whether the instruction packer should continue trying to pack more μops into the UCPacket, or whether the UCPacket should be shipped off to the packet storage of the instruction scheduler “as is”. If the packing rules engine or the resource binder indicates a packet breaking condition, the packet accumulator attempts to ship the UCPacket to the instruction scheduler. Even if there is no packet breaking condition, the packet accumulation engine may decide to end packing of the current UCPacket, for example if the instruction scheduler is about to run out of previous instruction packets. (It may typically prove more beneficial to keep the scheduler fed with even sub-optimally-packed packets, than to let it starve.) The packet storage of the instruction scheduler decouples the instruction packer from the execution units.
The DSP includes a plurality of execution units, each in a predetermined “slot”. For example, the DSP may include two Add/Sub (addition and subtraction) units, a Mult/Div (multiplication and division) unit, a shifter, a logical unit for performing AND, OR, etc. instructions, and a branch unit for performing branch instructions. The DSP may include any number of execution units. For ease of illustration, it is shown with six, but in other embodiments there may be e.g. eight execution units or sixteen execution units, or any suitable number. The UCPacket includes corresponding instruction slots—corresponding in number, location, and functionality type.
In one embodiment, as long as there is at least one packet waiting in the scheduler, the packer is allowed to continue packing the currently under-construction packet. This will, in many instances, enable overall performance to be increased by reducing the number of “NOP” instructions in the packets when they arrive at the execution units.
However, when the packer encounters a “packet-breaking” condition, it cannot perform any further packing, and, as long as there is at least one empty entry in the ring buffer, the packet accumulation engine sends the UCPacket to the scheduler. For example, if all packet slots have been filled with non-NOP instructions, no further packing is possible. Or, if the programmatically-next instruction is e.g. a conditional branch which cannot share a packet with other instructions, no further packing is possible. Or, if all of the Add/Sub slots have been filled and the next instruction is another ADD instruction, no further packing is possible.
The DSP issues and executes instructions in VLIW fashion. The DSP is an in-order machine. One reason that this is significant is that, because the executable code is constructed as in-order code and not VLIW instruction words, the DSP must be able to correctly handle precise exceptions.
For example, in the code example given, if the MUL, ADD, and ROR instruction sequence (shown in
The UCPacket includes six instructions in slot0 through slot5. These slots correspond to the physical positioning of the various execution units, and do not necessarily correspond to the order of the instructions in the program. In the example given above, the MUL would be in slot2, the ADD in slot0, and the ROR in slot3; the ADD comes before the MUL in the UCPacket in slot order, even though the MUL comes before the ADD in the program order.
The slot further includes an “age” field which indicates the relative age of that instruction within the UCPacket. For example, the MUL may be assigned an age value of 0, the ADD an age value of 1, and the ROR an age value of 2. Thus, the age field simply indicates the programmatic order of the instructions in the UCPacket. In one embodiment, age fields of slots holding packer-generated NOP instructions may be assigned sequential values greater than the largest age value assigned to an actual instruction.
The slot further includes an issued flag bit which indicates whether the instruction has been issued for execution. The slot further includes a complete flag bit which indicates that the instruction has been completely executed, including the handling of any events.
The slot includes a μopcode field which indicates the opcode of the μop. The slot further includes one or more source identifier fields (e.g. src1, src2, src3), each of which identifies a source from which operand data will be taken in executing the instruction, and a destination identifier field (dest) which identifies a destination to which result data will be written. The sources may include immediate data.
When an instruction causes an event, each instruction whose age field has a value larger (indicating that it is programmatically younger) than that of the instruction which caused the event, will need to be prevented from committing state and from setting the complete flag. After the event condition is resolved, the valid and/or issued and/or completed bits of all older instructions in the same packet, including the one that caused the exception, can be cleared, to prevent those from being re-executed—thus they will be treated as though they were NOPs, by their execution units. Valid, non-complete μops can then be re-executed to finish execution of the packet.
The following segments of pseudo-code illustrate two different methods of operation of the packer. The primary difference between the two is this. If the first method reaches the end of the group of μops received by the packer without shipping the UCPacket to the scheduler, it starts over, attempting to do better packing, with a newly received group of instructions which may be larger. Any μops that were packed the first time will simply be re-packed the second time. If the second method reaches the end of the group of μops received by the packer without shipping the UCPacket to the scheduler, it continues by sliding to a new group of μops retrieved from the μop buffer, leaving the previously-packed μops in their slots in the UCPacket.
These and a variety of other algorithms may be used in implementing the instruction packer's method of operation.
When one component is said to be “adjacent” to another component, it should not be interpreted to mean that there is absolutely nothing between the two components, only that they are in the order indicated.
The various features illustrated in the figures may be combined in many ways, and should not be interpreted as though limited to the specific embodiments in which they were explained and shown.
Those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present invention. Indeed, the invention is not limited to the details described above. Rather, it is the following claims including any amendments thereto that define the scope of the invention.