The invention relates generally to a reduction of power consumption in microprocessors, both load-store architectures (i.e., RISC-based machines) and memory-oriented architectures (i.e., CISC-based machines). More specifically, the invention provides a technique and method for avoiding unnecessary read operations from a register file thereby resulting in a lower power dissipation from the microprocessor.
Many modern computing systems utilize a processor having a pipelined architecture to increase instruction throughput. In theory, pipelined processors can execute one instruction per machine cycle when a well-ordered sequential instruction stream is being executed. Pipelined processors operate by breaking up the execution of an instruction into several stages, each stage requiring one machine cycle to complete. In a typical system, an instruction could require many machine cycles to complete (e.g., fetch, decode, ALU operations, etc.). However, latency is reduced in pipelined processors by initiating the processing of a second instruction before the actual execution of the first instruction is completed. Consequently, multiple instructions can be in various stages of processing at any given time. Thus, the overall instruction execution latency of the system (which may be considered as a delay between the time a sequence of instructions is initiated and the time the execution of the instructions is completed) can be significantly reduced.
Most modern microprocessors are using pipelined datapaths to allow for higher clock frequencies and prevent or reduce the number of pipeline stalls. As stated supra, a principle behind pipelining is to divide an instruction into several smaller operations and execute each operation in subsequent clock cycles on hardware dedicated to the substrate-operations. Such a system may be modeled as a linear pipeline where instructions flow through hardware units. A typical pipeline implements the following operations; each operation being performed by dedicated hardware:
Furthermore, in a pipeline, results may be ready long before an instruction has reached the writeback stage 117 of the pipeline. One way to increase an executional speed through the pipeline is through incorporation of a forwarding technique. A forwarding pipeline 200 of
The ID forward control unit 201A forwards data written into the register file 109 by the writeback stage 117 to outputs of the register file 109 if the register read from the register file 109 is the same register that is being written by the writeback stage 117. The EX forward control unit 201B listens to readrega and readregb from the instruction decode and register file read stage 107 pipeline registers and write_addr from the memory access stage 115 or the writeback stage 117 in order to determine if the instruction in the execute stage 111 reads a register that was written by the instruction in the memory access stage 115 or the writeback stage 117. If so, a result from the instruction in the memory access stage 115 or the writeback stage 117 is input to the ALU 113. The EX forward control unit 201B selects whether to use values read from the register file 109 or values forwarded from the memory access stage 115 or the writeback stage 117 by controlling fwda and fwdb signals. The fwda and fwdb signals are multiplexer selectors to the two forwarding multiplexers 203.
As pipelines in a forwarding pipeline grow deeper, many instructions obtain operands from the technique of forwarding and not having to read them from a register file. This ability to receive forwarded operands follows from a sequential property of most programs where instructions produce data that are used by directly following instructions. The typical prior art data forwarding scheme reads the register file for operands as part of every instruction decode cycle. This register read occurs without regard to whether data forwarding is either possible or not, or even if the forwarded data are needed. Therefore, what is needed is a way to enjoy benefits of forwarded operands while eliminating unnecessary register file reads and the concomitant increase in power caused by unnecessary register file reading.
An exemplary embodiment of the present invention includes a register file access method resulting in reduced power consumption. In accordance with the exemplary embodiment, if one or more registers to be read out of the register file is written by instructions located further downstream in a pipeline, the register file read of a forwardable register(s) is not initiated. Rather, the forwarded register value is used directly.
The present invention is therefore a system and method for preserving power in a microprocessor pipeline. The system includes a register file read control unit, the read control unit being configured to monitor one or more outputs from a control/decode unit of the pipeline and monitor write addresses from one or more other stages of the pipeline. The system also includes one or more read inhibit units each having an input, an output, and an enable terminal, the output of each of the one or more read inhibit units being coupled to a unique register port of a register file within the pipeline. The input of each of the one or more read inhibit units being coupled to the control/decode unit, and the enable terminal of each of the one or more read inhibit units being coupled to a unique output of the read control unit.
The method includes providing a read inhibit unit and a read control unit, the read inhibit unit being coupled to read a content of at least one file in a register file contained in the pipelined architecture. The read control unit provides a control signal to the read inhibit unit. A determination is made, based on the control signal, whether a register file read operation should occur. An enabling signal from the read control unit to the read inhibit unit is sent if a determination is made to read the content of the at least one file in the register file and, after receiving the enabling signal, reading the content of the at least one file in the register file.
An exemplary embodiment of a pipeline 300 not requiring access of a register file each clock cycle of
Most modern central processing units (CPUs) are implemented using CMOS logic. Most of the power dissipated in CMOS logic is drawn when a CMOS logic value toggles (i.e., from “1” to “0” or “0” to “1”). One primary function of the read inhibit units ria 301, rib 303 is therefore to prevent logic inside the register file 109 from toggling if no read access is needed, thereby causing the register file 109 to draw a minimal amount of power. To prevent internal logic (not shown) of the register file 109 from toggling, the read inhibit units ria 301, rib 303 include a state-keeping element (discussed in more detail with respect to
The read inhibit units ria 301, rib 303 may be implemented in one of several ways, dependent, in part, on how the register file 109 is implemented. In some register file implementations, the state-keeping element is built into a register file macro. In the case of such a register file macro, the RCU 305 may control the state-keeping element in the register file macro directly and no additional read inhibit units ria 301, rib 303 are needed.
A skilled artisan will recognize that other delays, both larger and smaller, may be used by substituting “clk” by adding one or more delay elements with different propagation delay times. Consequently, the read address “readregi” propagates to the register file 401 port only if “rix” is high and in the last half period of the clock cycle. If “rix” is low, the level-sensitive latch 405 is locked (i.e., not enabled) and inputs to the register file 401 are kept static. The register file 405 read port does not toggle in this case; thus, minimal power is consumed. In a specific exemplary embodiment, there is one RIU 403 per register file read port. The register file of
In another exemplary embodiment (not shown), a latch is built into the register file read port. In these cases, no latch is required in the RIU 403. The RCU 305 will then control the latch 405 inside the register file 401 read port directly.
In the foregoing specification, the present invention has been described with reference to specific embodiments thereof. It will, however, be evident to a skilled artisan that various modifications and changes can be made without departing from the broader spirit and scope of the invention as set forth in the appended claims. Skilled artisans will appreciate that although the methods have been presented with reference to a specific architecture, a similar result may be achieved in various ways that are still within a scope of the described specification. For example, a skilled artisan will recognize other embodiments (not shown) in which it may be desirable to use an edge-triggered flip-flop rather than a level-sensitive latch. The RCU 305, described supra, may still be used with appropriate connections and delays. Due to the complexity of an actual microprocessor pipeline, the specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.