Computing devices may include one or more processors to execute instructions of software and/or firmware. Such processors commonly include a pipeline to execute a single instruction in a series of pipeline stages Each stage may perform a separate sub-operation during the execution of a given instruction. Due to the division of labor across the series of stages, the processor may execute several instructions simultaneously with each instruction being processed by a different stage. The stages may be driven by a clock signal in order to control the flow of an instruction from one stage to the next stage of the pipeline. Further, each stage of the pipeline consumes substantial power due to synchronous logic of the stages being clocked by the clock signal.
The invention described herein is illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. For example, the dimensions of some elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements.
The following description describes operating pipeline stages of a processor in a manner that attempts to reduce power consumption. In the following description, numerous specific details such as logic implementations, resource partitioning/sharing/duplication implementations, types and interrelationships of system components, and logic partitioning/integration choices are set forth in order to provide a more thorough understanding of the present invention. However, one skilled in the art will appreciate that the invention may be practiced without such specific details. In other instances, control structures, gate level circuits, and full software instruction sequences have not been shown in detail in order not to obscure the invention. The included descriptions are submit to be sufficient to enable those of ordinary skill in the art to implement appropriate functionality without undue experimentation.
References in the specification to “one embodiment”, “an embodiment”, “an example embodiment”, and other similar phrases indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Embodiments of the invention may be implemented in hardware, firmware, software, or any combination thereof. Embodiments of the invention may also be implemented as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others. Further, firmware, software, routines, instructions may be described herein as performing certain actions. However, it should be appreciated that such descriptions are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, instructions, etc.
The following description may refer to various signals as being asserted or de-asserted to indicate at least two distinct states of the respective signal. Whether a particular signal is asserted or de-asserted via a high signal, a low signal, a positive differential signal, a negative differential signal, or some other signaling technique is implementation dependent. An embodiment may use one or more of these signaling techniques to assert and de-asset various signals.
The following description may reference similar components using a reference label and subscript (e.g. REFSUB). When referring to a specific component of the similar components, a reference label with a numeric subscript (e.g. REF1) will generally be used. A group of similar components that may include a variable number of members may be identified with a list of reference labels having numeric subscripts and a last reference label having an alphabetic subscript to represent the variable number (e.g. REF1, REF2 . . . REFX). Finally, for brevity purposes, the reference label (REF) alone associated with similar components may be used to generally refer to such similar components as a whole or may be used to generally refer to a component of the similar components where pointing out a specific component does not aid in understanding. However, such designations are merely to aid the description and are not meant to limit the scope of the appended claims. Embodiments may have multiple components of a component described in the singular, only a single component of components described in the plural, and may not include some components whether described in the singular or plural.
An embodiment of a computing device 100 such as for example, a network router, network switch, a laptop computer system, a desktop computer system, a server computer system, a set-top device, a hand phone, a hand-held computing device, or other similar device is illustrated in
The network interface 130 may provide an interface between the computing device 100 and a network to facility data communication between the computing device 100 and other devices coupled to a network. In particular, the network interface 110 may comprise analog circuitry, digital circuitry, antennae, and/or other components that provide physical, electrical, and protocol interfaces to transfer packets between the computing device 100 and a wired and/or wireless network.
The memory 140 may comprise dynamic random access memory (DRAM), a static random access memory (SRAM), read only memory (ROM), flash memory, and/or other types of memory devices. The memory 140 may store instructions and data to be executed and processed by the processor 150. In particular, the memory 280 may store multi-threaded applications, operating systems, services, and/or other multi-threaded software. The memory 280 may further store single threaded applications, operating systems, services, and/or other single-threaded software.
The processor 150 may comprise one or more pipelines 160 to process instructions. For example, the processor 150 may comprises an Intel® IXP2400 network processor, an Intel® Pentium® 4 processor, an Intel® Itanium® 2 processor, an Intel® Xeon® processor, an NVIDIA® GeForce™ graphics processor, and/or some other type of pipelined processor. The pipeline 160 may execute or process a single instruction in a series of pipeline stages 1700, 1701 . . . 170N such as 5 stages, 10 stages, 20 stages, or some other implementation dependent number of stages. Each stage 170 may perform a separate sub-operation during the execution of a given instruction. For example, an instruction may pass through a fetch instruction phase, an instruction decode phase, a fetch operands phase, an execution phase, and a write data phase where each phase may be implemented by one or more of stages 170 of the pipeline 160.
Due to the division of labor across the series of stages 1700, 1701 . . . 170N, the processor 150 may execute several instructions simultaneously with each instruction being processed by a different stage 170. The stages 170 may be driven by a clock signal clk of the oscillator 120 or a gated clock signal gclk derived from the clock signal of the oscillator 120 in order to control the flow of an instruction from one stage 170X to the next stage 170X+1. Due to interdependencies between stages 170, the frequency of the clock signal may be based upon the stage 170 having the longest execution time to ensure each stage 170 completes its phase of an instruction before processing its phase of the next instruction in the pipeline 160.
Further, the stages 170 may generate signals and update values of various registers in response to processing instructions. In particular, the stages 170 may assert a kill signal k to flush partially executed instructions from the pipeline 160. For example, an execution stage 170 may assert the kill signal k in response to determining to branch to another address and/or in response to determining that the destination of a branch was mispredicted. Other components may also may assert the kill signal k. Further, the kill signal k may be asserted to flush the pipeline 160 in response to other stimuli such as execution of other instructions or receipt of various interrupt and/or control signals.
The stages 170 may also assert an idle signal id to indicate an idle condition of the pipeline 160. For example, the stages 170 in one embodiment may assert the idle signal id in response to a swap instruction that causes the processor 150 to change to another thread of instructions at a time when no other thread is ready to be executed. Other components may also assert the idle signal id. Further, the idle signal id may be asserted in response to other stimuli such as execution of other instructions or receipt of various interrupt and/or control signals.
Pseudo code that introduces a “bubble” into the pipeline 160 due to a branch in a thread of instructions is depicted in
In clock cycle T3, an execution stage 1703 may receive and execute the branch instruction that was loaded in clock cycle T0. In response to processing the branch instruction, the execution stage 1703 may determine the current thread of execution is to branch to a multiply instruction at an address identified by label @NEW. As a result of such a determination, the execution stage 1703 may assert a kill signal and/or some other signals to inform the other stages 170 of the pipeline 160 that execution of the current thread is branching or jumping to an address identified by label @NEW. In response to assertion of the kill signal, the stages 1700, 1701 . . . 1702 preceding the execution stage 1703 flush to prevent the partially executed add, shift and add instructions of stages 1700, 1701, 1702 from completing. Since the flushed partially executed instructions occur after the branch instruction, proper execution of the thread dictates that such instructions only complete if the branch instruction determines not to branch to address @NEW.
As a result of branching to address @NEW, the fetch instruction stage 1700 loads the multiply instruction at address @NEW in clock cycle T4. However, due to flushing of the pipeline 160 in clock cycle T3, each of stages 1701, 1702, 1703, 1704 have no instruction to process and thus each is idle in clock cycle T4. Further, each of stages 1702, 1703, and 1704 is idle in clock cycle T5. In particular, all stages 170 of the pipeline 160 will not fill with an instruction to process until clock cycle T8 or possibly later. Despite being idle, conventional processors continue to drive the synchronous logic of all stages 170 with a common clock signal which causes the synchronous logic of idle and non-idle stages 170 to consume power each time the logic is triggered by the clock signal. Accordingly, power may be conserved if idle pipeline stages such as stages 1701, 1702, 1703, 1704 in clock cycle T4 are gated from the clock signal until which time the respective stage 170 has an instruction to process.
To gate pipeline stages 160 that have no instruction to execute from the clock signal of the oscillator 120, the processor 150 as depicted in
The decision logic 220 may comprise circuitry such as, for example, the depicted AND gate, OR gates, and latches of
The pipeline clock logic 230 comprise circuitry such as, for example, the depicted AND gates and latches that respectively generate gated clock signals gclk0, gclk1, gclk2, and gclk3 for the pipeline stages 1700, 1701, 1702 and 1703. In particular, the pipeline clock logic 230 may receive the control signals ctrl and the clock signal clk. The pipeline clock logic 230 may gate the clock signal clk from each stage 170 having a corresponding asserted control signal ctrl and may permit the clock signal clk to drive each stage 170 having a corresponding de-asserted control signal ctrl.
In one embodiment, the decision logic 220 may determine to assert all the control signals ctrl while the kill signal k is asserted and may determine to sequentially de-assert each control signal ctrl in response to the kill signal k being de-asserted. As depicted in
In one embodiment, the gated clock logic 200 may further comprise a local clock logic 250 to generate the local clock signal lclk used to drive synchronous logic of the decision logic 220. The local clock logic 250 may generate the local clock signal lclk as a gated version of the clock signal clk. The local clock logic 250 may gate the clock signal clk in response to determining that the decision logic 220 may maintain the current state of control signals ctrl generated by the decision logic 220. Gating the clock signal clk from the decision logic 220 may reduce power consumption of the gated clock logic 200 by not driving synchronous circuitry of the decision logic 220 when the decision logic 220 maintains the current state of the control signals ctrl despite being driven by a clock signal.
Further, the local clock logic 250 may permit the clock signal clk to drive the decision logic 220 in response to determining that the decision logic 220 may change one or more control signals ctrl. In particular, the local clock logic 250 may determine that the decision logic 220 may change one or more control signals ctrl in response to (i) a new assertion of the kill signal k, or (ii) an indication that gating the clock signal clk in response to a previous assertion of the kill signal k has ceased.
Referring now to
As mentioned above, the processor 150 may comprise gated clock logic 180 to gate pipeline stages 160 that have no instruction to execute from the clock signal clk of the oscillator 120. Another embodiment of gated clock logic 180 is depicted in
The decision logic 620 may comprise circuitry such as, for example, the depicted AND gate and latches of
In one embodiment, the decision logic 220 may determine to sequentially assert each control signals ctrl in response to the idle signal id being asserted and may determine to sequentially de-assert each control signal ctrl in response to the idle signal id being de-asserted. As depicted in
A method of gating a clock signal from stages of a pipeline is depicted in
In block 830, the gated clock logic 180 may permit a clock signal clk to drive active stages 170 and may gate the clock signal clk from driving idle stages 170. In one embodiment, the pipeline clock logic 230 may received control signals from the decision logic 220, 620. Further, the pipeline clock logic 230 may drive stages 170 associated with asserted control signals with the clock signal clk and may gate the clock signal clk from stages associated with de-asserted control signals.
Certain features of the invention have been described with reference to example embodiments. However, the description is not intended to be construed in a limiting sense. Various modifications of the example embodiments, as well as other embodiments of the invention, which are apparent to persons skilled in the art to which the invention pertains are deemed to lie within the spirit and scope of the invention.