The instant application claims priority to Chinese Patent Application No. 201010624755.0, filed Dec. 30, 2010, which application is incorporated herein by reference in its entirety.
This Summary is provided to introduce, in a simplified form, a selection of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
An embodiment of an instruction pipeline includes first and second sections. The first section is operable to provide first and second ordered instructions, and the second section is operable, in response to the second instruction, to read first data from a data-storage location, is operable, in response to the first instruction, to write second data to the data-storage location after reading the first data, and is operable, in response to writing the second data after reading the first data, to cause the flushing of some, but not all, of the pipeline.
In an embodiment, such an instruction pipeline may reduce the processing time lost and the energy expended due to a pipeline flush by flushing only a portion of the pipeline instead of flushing the entire pipeline. For example, a superscalar processor may perform such a partial pipeline flush in response to a mis-speculative load instruction, which is, a load instruction that is executed relative to a memory location before the execution of a store instruction relative to the same memory location, where the store instruction comes before the load instruction in the instruction order. The processor may perform such a partial pipeline flush by reloading the instruction-issue queue from the reorder buffer such that a fetch-decode section of the pipeline need not be and, therefore, is not, flushed.
A superscalar processor may include an instruction pipeline that is operable to simultaneously execute multiple (e.g., four) program instructions out of order, i.e., in an order other than the sequence in which the instructions are ordered in a program. By simultaneously executing multiple instructions out of order, a superscalar processor may be able to execute a software or firmware program faster than a processor that is operable to execute instructions only in order or only one at a time.
The instruction pipeline 10 includes an instruction-fetch-decode section 12, an instruction-queue section 14, an instruction-issue section 16, and an instruction-execute section 18.
The instruction-fetch-decode section 12 includes an instruction-fetch (IF) stage 20, an instruction-decode (ID) stage 22, and a register-mapping (RM) stage 24.
The IF stage 20 fetches program instructions from a program memory (not shown in
The ID stage 22 decodes the fetched instructions in the order received from the IF stage 20.
The RM stage 24 prevents potential physical-register conflicts by remapping the processor's physical register(s) (not shown in
The instruction-queue section 14 includes an instruction enter-queue (EQ) stage 26, which includes one or more instruction queues that are further discussed below in conjunction with
The instruction-issue section 16 includes an instruction-issue (IS) stage 28, which issues instructions from the EQ stage 26 to the instruction-execute section 18. The IS stage 28 may issue multiple instructions simultaneously, and may issue an instruction out of the program order if the instruction is ready to be executed before a previous instruction in the program order. For example, an add instruction may sum together two values that are presently available, but a previous subtract instruction may subtract one value from another value that is not yet available. Therefore, to speed up the instruction execution, instead of waiting for the other subtraction value to become available before issuing any subsequent instructions the IS stage 28 may issue the add instruction to the instruction-execute section 18 before issuing the subtract instruction to the instruction-execute section even though the subtract instruction comes before the add instruction in the program order.
The instruction-execute section 18 includes one or more instruction-execution branches 301-30n, which are each operable to execute a respective instruction in parallel with the other branches, and to retire instructions in parallel. For example, if the pipeline 10 is operable to simultaneously execute four instructions, then the pipeline may include four or more instruction-execution branches 30. Furthermore, each branch 30 may be dedicated to a particular type of instruction. For example, a branch 30, may be dedicated to executing instructions that call for mathematical operations (e.g., add, subtract, multiply, divide) to be performed on data, and another branch 30 may be dedicated to executing instructions (e.g., data load, data store) that call for access to cache or to other memory. Furthermore, each branch 30 may retire an executed instruction after all of the instructions that come before the executed instruction in the program order are also retired or ready to be retired. As part of retiring an instruction, a branch 30 removes the instruction from all of the queues in the EQ stage 26.
Still referring to
During a first cycle of the pipeline 10, the IF stage 20 fetches one or more instructions from a program-instruction memory (not shown in
During a next cycle of the pipeline 10 cycle, the ID stage 22 decodes the one or more instructions received from the IF stage 20.
During a next cycle of the pipeline 10 cycle, the RM stage 24 remaps the physical registers of the one or more decoded instructions received from the ID stage 22 as is appropriate.
During a next cycle of the pipeline 10, the EQ stage 26 receives and stores, in one or more queues, the one or more remapped instructions from the RM stage 24.
During a next cycle of the pipeline 10, the IS stage 28 issues one or more instructions from the EQ stage 26 to one or more respective instruction-execution branches 30.
During a next cycle of the pipeline 10, each instruction-execution branch 30 that receives a respective instruction from the IS stage 28 executes that instruction.
Then, during a subsequent cycle of the pipeline 10, each of the branches 30 that executed a respective instruction retires that instruction.
The above-described sequence generally repeats until the processor 8, e.g., stops running the program, takes a branch, or encounters a pipeline-flush condition.
The EQ stage 26 includes the following five queues/buffers that may have any suitable lengths: an instruction-issue queue (ISQ) 40, a store-instruction queue (SQ) 42, a load-instruction queue (LQ) 44, a reorder buffer (ROB) 46, and a branch-instruction queue (BRQ) 48.
The ISQ 40 receives all of the instructions provided by the RM stage 24, and stores these instructions until they are issued by the IS stage 28 to one of the execution sections 30. As discussed above in conjunction with
The SQ 42 receives from the RM stage 24 only store instructions—a store instruction is an instruction that writes data to a memory location, such as a cache location—but holds these store instructions in the program order. The SQ 42 holds a store instruction until the store instruction is both executed and retired by the load/store execution section 30n. The operation of an embodiment of the SQ 42 is further discussed below in conjunction with
The LQ 44 receives from the RM stage 24 only load instructions—a load instruction is an instruction that reads data from a memory location, such as a cache location, and then writes this data to another memory location, such as a physical register R of the processor 8—and stores these load instructions in the program order. The LQ 44 stores a load instruction until the load instruction is both executed and retired by the load/store execution section 30n. The operation of an embodiment of the LQ 40 is further discussed below in conjunction with
The ROB 46 receives from the RM stage 24 all instructions, and stores these instructions in the program order. The ROB 46 stores an instruction until the instruction is both executed and retired by one of the execution sections 30. The operation of an embodiment of the ROB 46 is further discussed below in conjunction with
The BRQ 48 receives from the RM stage 24 only branch instructions—a branch instruction is an instruction that causes the program counter (not shown in
The load/store execution section 30n includes an operand-address-generator (AG) stage 50, a data-access (DA) stage 52, a data-write-back (DW) stage 54, and an instruction-retire/commit (CM) stage 56. The load/store execution stage 30n executes only instructions that read data from or write data to a memory location. Therefore, in an embodiment, the load/store execution stage 30n executes only load and store instructions of the type that are stored in the LQ 44 and SQ 42, respectively.
The AG stage 50 receives a load or store instruction from the IS stage 28, and generates the physical address or addresses of the memory location or locations specified in the instruction. For example, a store instruction may specify writing data to a memory location, but the instruction may include only a relative address for the memory location. The AG stage 50 converts this relative address into an actual address, for example, to the actual address of a cache location. And if the data to be written is obtained from another memory location specified in the instruction, then the AG stage 50 also generates the actual address for this other memory location in a similar manner. The AG stage 50 may use a memory-mapping look-up table (not shown in
The DA stage 52 accesses the destination memory location specified by a store instruction (using the actual address generated by the AG stage 50), and accesses the source memory location specified by a load instruction (also using the actual address generated by the AG stage). In a first example, suppose a store instruction specifies writing data D1 from a physical register R1 to a cache location C1 (D1, R1, and C1 not shown in
The DW stage 54 effectively ignores a store instruction, and performs the second operation (e.g., the “write-back” portion) of a load instruction. For example, although the DW stage 54 may receive a store instruction from the DA stage 52, it performs no operation relative to the store instruction except to provide the store instruction to the CM stage 56. For a load instruction, continuing the second example from the preceding paragraph, the DW stage 54 writes the data D2 from its temporary storage location to its destination, which is the memory location M1.
The CM stage 56 monitors the other execution sections 301-30n-1, and retires a load or store instruction only when all of the instructions preceding the load or store instruction in the program order have been executed and retired. For example, suppose a load instruction is fifteenth in the program order. The CM stage 56 retires the load instruction only after the first through fourteenth instructions in the program have been executed and retired. Furthermore, as part of retiring an instruction, the CM stage 56 removes the instruction from all of queues/buffers in the EQ stage 26 where the instruction was stored. The CM stage 56 may perform such removal by actually erasing the instruction from a queue/buffer, or by moving a header or tad pointer associated with the queue/buffer such that the instruction is in a portion of the queue/buffer where it will be overwritten by a subsequently received instruction.
Referring to
Referring to block 60 of
Referring to block 62, the DA stage 52 stores (writes) a data value D2 into the memory location at M1.
Referring to block 64, the DA and DW stages 52 and 54 cooperate to load the contents (the data value D2 in this example) of the memory location at M1 into another memory location at an actual address M2. That is, the DA stage 52 reads D2 from the memory location at M1, and the DW stage 54 writes D2 into the memory location at M2. Therefore, after the load operation of block 64 is executed, the data value D2 is stored in the memory location at M2.
Referring to block 66, one of the execution sections 301-30n-1 multiplies the contents (the data value D2 in this example) of the memory location at M2 by a data value D3. Therefore, the multiply operation of the block 66 generates a correct result, D2×D3, as shown in block 68.
Referring to
Referring to block 70 of
Referring to block 72, because the pipeline 10 executes the store and load instructions out of order, the DA and DW stages 52 and 54 cooperate to load the contents (the data value D1 in this example) of the memory location at M1 into the memory location at M2.
Referring to block 74, the DA stage 52 writes the data value D2 into the memory location at M2. But because this store instruction is executed after the load instruction, the DA and DW stages 52 and 54 do not load D2 into the memory location at M1 as indicated by the program.
Referring to block 76, one of the execution sections 301-30n-1 multiplies the contents (the data value D1 in this example) of the memory location at M2 by a data value D3. Therefore, in this example, the multiply operation of the block 76 generates an incorrect result, D1×D3, as shown in block 78, instead of generating the correct result of D2×D3 per the block 68 of
Therefore, by executing load and store instructions out of program order, the pipeline 10 may generate an erroneous result.
Still referring to
In more detail, when the DA stage 52 executes a load instruction, it may “look back” at the SQ 42 and ISQ 40 to determine whether there are any unexecuted store instructions that come before the load instruction in the program order, and may look back to the AG stage 50 to determine whether there is a store instruction being executed concurrently with the load instruction. For example, referring to
If such a store instruction exists, then the DA stage 52 determines whether the actual memory address corresponding to the memory address specified by the store instruction has already been resolved, and, thus, is available. For example, the AG stage 50 may have resolved the actual memory address specified by the store instruction in conjunction with executing a prior load or store instruction involving the same memory address. For example, continuing the example from the preceding paragraph with reference to
If the actual memory address corresponding to the store instruction is available, then the DA stage 52 next determines whether this actual memory address is the same as the actual memory address corresponding to the load instruction. For example, continuing the example from the preceding paragraph, the DA stage 52 determines that the actual address M1 is specified by both the load and store instructions.
If the actual memory address corresponding to the store instruction is the same as the actual memory address corresponding to the load instruction, then the DA stage 52 may, in response to the load instruction, not read the data from the actual memory address, but instead read the data directly from the store instruction. For example, continuing the example from the preceding paragraph, instead of reading the incorrect data D1 from the location at M1 in response to the load instruction, the DA stage 52 reads the data D2 from the store instruction (or from the memory location where D2 is currently stored, this memory location being specified by the store instruction). Consequently, the pipeline 10 still generates the correct result of D2×D3 per block 68 of
Unfortunately, this technique may work only when the actual memory address corresponding to the store instruction is available to the DA stage 52 while the DA stage is executing a load instruction corresponding to the same address.
But if the actual memory address corresponding to the store instruction is unavailable (e.g., the actual address M1 corresponding to the store instruction is unavailable to the DA stage 52 at the time it is executing the load instruction corresponding to M1), then the processor may flush the entire pipeline 10 in response to the pipeline “realizing” that it has executed a store instruction relative to a memory location after it has executed a load instruction relative to the same memory location, where the load instruction comes after the store instruction in the program order. For example, when the DA stage 52 detects, in block 74, that it has executed the store instruction after it and the DW stage 54 have executed the load instruction in block 72, and detects that the actual address corresponding to the store instruction was not available at the time that the load instruction was executed in block 72, it may signal the processor 8 to flush the entire pipeline 10, to reload the program counter (not shown in
But flushing the entire pipeline 10 may increase the processing time required to execute the program, and may also increase the amount of energy that the processor consumes—the latter may be particularly undesirable in battery-powered devices.
Referring to
Referring to
Next, during the operating state of the pipeline 10 represented in
Referring to
Referring to
Still referring to
Referring to
Still referring to
Referring to
Referring to
In the next operating states one and two cycles after the operating state represented in
The system 60 includes computing circuitry 62, which, in addition to the processor 8, includes a memory 64 coupled to the processor, and the system also includes an input device 66, an output device 68, and a data-storage device 70.
The processor 8 may process data in response to program instructions stored in the memory 64, and may also store data to the memory and load data from the memory, or may load data from one location of the memory to another location of the memory. In addition, the processor 8 may perform any functions that a processor or controller may perform.
The memory 64 may be on the same die as, or on a different die relative to, the processor 8, and may store program instructions or data as discussed above. Where disposed on the same die as the processor 8, the memory 64 may be a cache memory. Furthermore, the memory 64 may be a non-volatile memory, a volatile memory, or may include both non-volatile and volatile memory cells.
The input device (e.g., keyboard, mouse) 66 allows, e.g., a human operator, to provide data, programming, and commands to the computing circuitry 62.
The output device (e.g., display, printer, speaker) 68 allows the computing circuitry 62 to provide data in a form perceivable by e.g., a human operator.
And the data-storage device (e.g., flash drive, hard disk drive, RAM, optical drive) 70 allows for the non-volatile storage of, e.g., programs and data.
From the foregoing it will be appreciated that, although specific embodiments have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of the disclosure. Furthermore, where an alternative is disclosed for a particular embodiment, this alternative may also apply to other embodiments even if not specifically stated.
Number | Date | Country | Kind |
---|---|---|---|
201010624755.0 | Dec 2010 | CN | national |