Embodiments of the invention relate to microprocessor architecture. More specifically, at least one embodiment of the invention relates to reducing latency within a microprocessor.
“Pipelining” is a term used to describe a technique in processors for performing various aspects of instructions concurrently (“in parallel”). A processor “pipeline” may consist of a sequence of various logical circuits for performing tasks, such as decoding an instruction and performing micro-operations (“uops”) corresponding to one or more instructions. Typically, an instruction contains one or more uops, each of which are responsible for performing various sub-tasks of the instruction when executed. Multiple pipelines may be used within a microprocessor, such that a correspondingly greater number of instructions may be performed concurrently within the processor, thereby providing greater processor throughput.
In pipelining, a task associated with an instruction or instructions can be performed in several stages by a number of functional units within a number of pipeline stages. For example, a processor pipeline may include stages for performing tasks, such as fetching an instruction, decoding an instruction, executing an instruction, and storing the results of executing an instruction. In general, each pipeline stage may receive input information relating to an instruction, from which the pipeline stage can generate output information, which may serve as inputs to a subsequent pipeline stage. Accordingly, pipelining enables multiple operations associated with multiple instructions to be performed concurrently, thereby enabling improved processor performance, at least in some cases, over non-pipelined processor architectures.
In some prior art pipeline architectures, synchronization among the pipeline stages can be achieved by using a common clock signal for each pipeline. The frequency of the common clock signal may be set according to a critical path delay, including some safety margin. However, the critical path delay may not remain constant throughout the operation of the pipeline due, in part, to variation in semiconductor manufacturing process parameters, device operating voltage, device temperature, and pipeline stage input values (PVTI). In order to account for PVTI variations, some prior art architectures set the common clock frequency to account for the worst-case critical path delay, which may result in setting the common clock to a frequency slightly or significantly lower than that necessary to accommodate the worst-case critical path delay.
As semiconductor device sizes continue to scale lower in size, PVTI-related variability and corresponding safety margins may increase to accommodate the worst-case critical path delay. For example, for semiconductor process technology, such as technology in which a minimum device dimension is below 90 nanometers (nm), PVTI variations may contribute substantially to a critical path delay between pipeline stages. However, delay experienced by information propagated among the various pipeline stages may be smaller than worst-case critical path delays in a typical situation, due in part to the fact that worst-case PVTI delay conditions may not occur as frequently as less-than worst-case PVTI conditions. Therefore, pipelined processing architectures, in which a clock for synchronizing the pipeline stages is set according to a worst-case critical path delay, may operate at relatively low performance levels.
Furthermore, prior art architectures, in which a clock synchronizing the various pipeline stages is set according to a more common-case delay through the pipeline, must typically operate two copies of the pipeline at half-speed, wherein the two copies of the pipelines operate asynchronously with each other. Unlike prior art architectures, which use worst-case critical path delays as a basis for the common clock frequency, however, an input to a pipeline stage of one pipeline in a so-called “common-case clock” pipeline architecture does not typically depend upon the output of a previous pipeline stage of the other pipeline (i.e., there typically is no “bypass” from one stage to another). Therefore, the “common-case” clocked pipeline architecture may use two clocks to synchronize the two pipelines, respectively, that may have the same frequency and be out of phase with each other. Moreover, common-case clock pipeline architectures typically incur more cost in terms of die real estate and power consumption, as they require the processor pipeline to be duplicated.
The preferred embodiments of the invention will hereinafter be described in conjunction with the appended drawings provided to illustrate and not to limit the invention, wherein like designations denote like elements, and in which:
At least one embodiment of the invention relates to a processor having a number of pipeline stages and a technique for processing one or more operations prescribed by an instruction, instructions, or portion of an instruction within the processor using one or more processing pipelines having one or more pipeline stages. Advantageously, at least some embodiments of the invention can reduce latency of performing an operation within a processor pipeline.
Moreover, embodiments of the invention may reduce latency within one or more processing pipelines by exploiting the fact that a common-case delay of an instruction, instructions, or portion of an instruction in propagating among the stages of a processor pipeline is typically less than the corresponding worst-case critical path delay of the pipeline. In one embodiment of the invention, the frequency of the clock or clocks used to synchronize the pipeline stages may be set according to the worst-case critical path delay of a processing pipeline, while enabling stages of the pipeline to yield a correct result, or “output”, in less than a full period of the clock.
In at least one embodiment of the invention, a pipeline stage may speculatively generate an output result (“speculative output”) based on input information to the pipeline stage within one clock period. Furthermore, in at least one embodiment, a mis-speculated output of a pipeline stage may be corrected. In one embodiment, speculative processing in a pipeline stage may be performed by using intermediately generated output results (“intermediate output”) of the pipeline stage, which may be observed within one period, or “cycle”, of the clock signal, and typically substantially around half of a clock cycle.
Further at 106, the subsequent pipeline stage may re-process the most recent output of the first pipeline stage (e.g., the worst-case delay output), if an error is detected in the earlier intermediate output of the first stage.
In one embodiment, an error may be detected by comparing the most recent output of the first stage to the earlier intermediate output provided to the subsequent pipeline stage for speculative processing. If the most recent output and the intermediate output of the first stage do not match, an error is detected. If an error is detected, the error is corrected, in one embodiment, by providing the most recent output of the first stage, which is expected to be correct, to the input of the subsequent stage. In one embodiment, the most recent output of the first stage may be stored to compare with subsequent outputs of the first stage. Operation 106 may be performed a number of times for a number of intermediate outputs of the first stage. However, in one embodiment, the operation described in 106 is performed only until an output is received by the subsequent stage that is deemed to be the correct output (e.g., the worst-case delay output).
Some embodiments of the invention described herein relate to a multiple instruction issue, in-order pipeline architecture. In one embodiment, in particular, an in-order pipeline architecture has five stages: a fetch stage, a decode stage, an execute stage, a memory access, and memory writeback. However, other embodiments of the invention may also be used in other processor architectures, such as those using an out-of-order processing pipeline, in which instructions or uops are executed out of program order.
Various implementations of the embodiment described in conjunction with
The first and second latches may store a logical value presented to the latch inputs with enough setup and hold time to be latched by a clock signal. Furthermore the first and second latches may output a logical value when triggered by a clock signal and thereafter maintain the value for a subsequent circuit to receive until a new value is presented to the latch with enough setup and hold time to be latched by a clock signal. In one embodiment of the invention, the latches are triggered by a rising edge of a clock signal, such as the clock signal shown in
In one embodiment, the first storage circuit 210 stores the output of the processing logic and provides the output to a subsequent pipeline stage so that the subsequent pipeline stage may speculatively process the output of the processing logic. The second storage circuit 212 may store the most recent output of the processing logic, which in some embodiments may correspond to the correct output (e.g., worst-case delay output).
In one embodiment, error detection logic 214 compares the values stored in first storage circuit 210 and second storage circuit 212 in order to detect the occurrence of an error in the output of the pipeline stage. Error detection logic 214 may also provide an error signal (not shown) to selection logic 208. Therefore, while an error in the output of the pipeline stage is not detected, selection logic 208 provides the output of processing logic 204 to first storage circuit 210. However, if an error in the output of the pipeline stage is detected, selection logic provides the value stored in second storage circuit 212 to first storage circuit 210, in one embodiment.
In one embodiment of the invention, pipeline stage 200 uses clock signals CK1 and CK2 to synchronize the various latches illustrated in
In one embodiment, input logic 202, first storage circuit 210 and second storage circuit 212 are triggered on the rising edge of a clock signal. In other embodiments, any of the input logic, first storage circuit, and second storage circuit may be triggered by the falling edge of a clock signal. In one embodiment, input logic 202 provides the input to processing logic 204 with enough setup and hold time to be latched with a first rising edge of CK1 (denoted by CK11). Processing logic 204 may process the input, to produce a correct output before the second rising edge of CK1 (denoted by CK12). First storage circuit 210 stores an intermediate output of processing logic 204 when triggered by a rising edge of CK2 (denoted by CK21) that succeeds CK11. The intermediate output is provided to the subsequent pipeline stage in the pipeline array for further processing. However, the intermediate output is a speculative output that may be determined to be incorrect. The second storage circuit 212 stores the output of processing logic 204 that is expected to be correct (e.g., worst-case delay output) when the second storage circuit 212 is triggered by CK12. In one embodiment, error detection logic 214 compares the intermediate output stored in first storage circuit 210 with the output expected to be correct, stored in second storage circuit 212, to detect the occurrence of an error in the generation of the intermediate output by the processing logic 204. If no error is detected, the error signal may be set a value to cause selection logic 208 to continue to provide the output of processing logic 204 to first storage circuit 210. On the other hand, if an error is detected by error detection logic 214, the error signal may be set to instruct selection logic 208 to provide the expected correct output stored in second storage circuit 212 to first storage circuit 210.
In one embodiment, the error signal also causes the processing pipeline to stall in order to recover from the error. In one embodiment, the pipeline is stalled for a full cycle, allowing the speculatively generated intermediate value to be removed from the pipeline (“squashed”), including processing logic and storage circuits, and the expected correct value to be delivered to appropriate pipeline stage. At the second rising edge of CK2 (denoted by CK22), the expected correct value is stored in first storage circuit 210, and provided to the subsequent pipeline stage for processing. After the expected correct output is stored in first storage circuit 210, error detection logic 214 ceases to detect the error resulting from the mis-speculated intermediate output, and the processing pipeline may resume operation.
Although embodiments discussed in reference to
Pipeline stage 200 described above may double the processing throughput of the stage in relation to some embodiments of the invention by using two clocks differing in phase by 180 degrees. In another embodiment of the invention, pipeline stage 200 achieves even greater throughput by decreasing the phase difference of the two clocks or by using more clocks shifted in phase by smaller amounts. In one embodiment, pipeline stage throughput is increased by using two clocks differing in phase by 90 degrees. For example, in one embodiment, the throughput is quadrupled when CK1 and CK2 differ by a phase of 90 degrees. In this case, the intermediate output can be provided to the next pipeline stage for speculative processing in one-fourth the clock period of CK1 or CK2. However, the expected correct output (e.g., worse-case delay output) may be available after the full clock cycle. Therefore, pipeline stage 200 operates at four times the throughput when there are no errors in the intermediate outputs. If an error occurs, pipeline stage 200 may be stalled for a full cycle as described earlier.
Embodiments previously described may reduce pipeline latency and increase the throughput of the pipeline. Furthermore, in embodiments previously described, errors in pipeline stage output due to delays within the pipeline stages being greater than some common-case delay may be detected and corrected. Other subsequent pipeline stages may be coupled to pipeline stage 200 and the techniques previously described may be extended to the other subsequent pipeline stages, such that the same benefits described above may be achieved for the other subsequent pipeline stages.
For example,
If no errors occur, (i.e., the value latched in R1S at CK12 is equal to the value latched in R1 at CK21) then I2/21 is latched in R2S at CK22, and I1/22 is latched in R1 at CK22. However, if an error occurs, (i.e., the value latched in R1S at CK12 does not equal the value latched in R1 at CK21) the error is detected and corrected by stalling the pipeline by a full clock cycle such that I1/21 may be latched in R1 at CK22.
The operation of each pipeline stage of
However, if an error occurs in the output of processing logic 204, the expected correct output stored in second storage circuit 212 of first pipeline stage 702 is passed to input logic 202 of the second pipeline stage 704 through second selection logic 708 of the pipeline array. First selection logic 706 may work in a similar manner as described above, which enables the pipeline array 700 to function in a manner described earlier. Further, a third selection logic 710 can select any one of the outputs from among the outputs of all the storage circuits of
In one embodiment, the third selection logic, illustrated in
In some embodiments of the invention, a pipeline or pipeline array may operate without using selection logic 208, second storage circuit 212, or error detection logic 214 if there is no phase difference between CK2 and CK1. Furthermore, in one embodiment, a pipeline may use arithmetic logic unit (ALU) result value loopback buses to provide output of one stage to another, thereby enabling relatively expedient movement of data through the pipeline stages. In an embodiment of the invention, the number of errors in a pipeline array is monitored and if the number of errors is found to be greater than a particular threshold number of errors, then the pipeline array may be reconfigured to operate in a manner such that output data from each pipeline stage is latched after a worst-case delay through the stage logic. In an embodiment in which the pipeline or pipeline array is reconfigured to latch data after a worst-case delay, each reconfigured pipeline stage may comprise an input logic 202 and first storage circuit 210, both of which are clocked by the same clock.
For the sake of illustration, only two stages are shown in pipeline 400 and pipeline array 700. In general, however, the number of stages may be higher depending on the number of instructions to be executed simultaneously or other considerations. Further, both pipeline 400 and pipeline array 700 make use of two clocks in one embodiment. However, the number of clocks may be higher depending on the desirable pipeline throughput. In an embodiment of the invention, the throughput through each pipeline stage is up to four times the clock frequency.
In the embodiment illustrated in
For example, an intermediate output may be stored in first storage circuit 210 of second pipeline stage 804 at the triggering edge of CK3. The intermediate output may also be provided as input to input logic 202 of third pipeline stage 806 at the triggering edge of CK3. The intermediate output is provided by a selection logic 814. In one embodiment, instructions are bypassed to a subsequent stage every one-fourth clock cycle of the clocks if no errors occur, and the throughput is quadrupled. If an error occurs, the pipeline may be stalled for three clock cycles at four times the clock frequency or until the error is resolved.
Although various embodiments of the invention have been described with respect to two and four storage circuits, the number of storage circuits that are clocked by simultaneous phase-delayed clock pulses can vary depending on the difference between the common-case delay and the worst-case delay.
Embodiments of the invention may reduce latency in one or more processor pipelines. Furthermore, throughput of a pipeline stage may be increased by varying the number of clocks in some embodiments. In at least one embodiment, errors in a speculative pipeline stage output due to worst-case delays through a processing stage or processing stage delays otherwise greater than a more common-case delay may be detected and subsequently corrected by using a worst-case delay output from the erroneous stage.
Embodiments of the invention may be implemented in hardware logic in some embodiments, such as a microprocessor, application specific integrated circuits, programmable logic devices, field programmable gate arrays, printed circuit boards, or other circuits. Furthermore, various components in various embodiment of the invention may be coupled in various ways, including through hardware interconnect or via a wireless interconnect, such as radio frequency carrier wave, or other wireless means.
Further, at least some aspects of some embodiment of the invention may be implemented by using software or some combination of software and hardware. In one embodiment, software may include a machine readable medium having stored thereon a set of instructions, which if performed by a machine, such as a processor, perform a method comprising operations commensurate with an embodiment of the invention.
While the various embodiments of the invention have been illustrated and described, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions and equivalents will be apparent to those skilled in the art without departing from the spirit and scope of the invention as described in the claims.