1. Technical Field
The present invention relates generally to an improved data processing system and method. More specifically, the present invention provides an apparatus and method for dependency tracking and register file bypass controls using a scannable register file.
2. Description of Related Art
The basic structure of a conventional computer system includes one or more processing units connected to various input/output devices for the user interface (such as a display monitor, keyboard and graphical pointing device), a permanent memory device (such as a hard disk, or a floppy diskette) for storing the computer's operating system and user programs, and a temporary memory device (such as random access memory or RAM) that is used by the processor(s) in carrying out program instructions. The evolution of computer processor architectures has transitioned from the now widely-accepted reduced instruction set computing (RISC) configurations, to so-called superscalar computer architectures, wherein multiple and concurrently operable execution units within the processor are integrated through a plurality of registers and control mechanisms.
An illustrative embodiment of a conventional processing unit is shown in
BIU 30 is connected to an instruction cache 32 and a data cache 34. The output of instruction cache 32 is connected to a sequencer unit 36. In response to the particular instructions received from instruction cache 32, sequencer unit 36 outputs instructions to other execution circuitry of microprocessor 12, including six execution units, namely, a branch unit 38, a fixed-point unit A (FXUA) 40, a fixed-point unit B (FXUB) 42, a complex fixed-point unit (CFXU) 44, a load/store unit (LSU) 46, and a floating-point unit (FPU) 48.
The inputs of FXUA 40, FXUB 42, CFXU 44 and LSU 46 also receive source operand information from general-purpose registers (GPRs) 50 and fixed-point rename buffers 52. The outputs of FXUA 40, FXUB 42, CFXU 44 and LSU 46 send destination operand information for storage at selected entries in fixed-point rename buffers 52. CFXU 44 further has an input and an output connected to special-purpose registers (SPRs) 54 for receiving and sending source operand information and destination operand information, respectively. An input of FPU 48 receives source operand information from floating-point registers (FPRs) 56 and floating-point rename buffers 58. The output of FPU 48 sends destination operand information to selected entries in rename buffers 58.
Microprocessor 12 may include other registers, such as configuration registers, memory management registers, exception handling registers, and miscellaneous registers, which are not shown. Microprocessor 12 carries out program instructions from a user application or the operating system, by routing the instructions and data to the appropriate execution units, buffers and registers, and by sending the resulting output to the system memory device (RAM), or to some output device such as a display console.
A high-level schematic diagram of a typical general-purpose register 50 is further shown in
There are five read ports in this particular prior art GPR. Read ports 70a-70e (0 through 4) are accessed through read decoders 72a-72e (RD0_DEC through RD4_DEC), respectively. Select lines 74a-74e (rd0_sel<0:79> through rd4_sel<0:79>) for each decoder are generated as described for the write address decoders above. Read data for each port 76a-76e (rd0_data<0:63> through rd4_data<0:63>) follows the same format as the write data. The data to be read is driven by the content of the entry selected by the corresponding read select line.
In a microprocessor such as the one shown in
To support forwarding of results, a multiplexer is used to either select the register file output or select the forwarding data from stages six, seven and eight. A method to generate the multiplexer control efficiently is needed.
One possible solution is to compare the source register of an instruction against the destination target of the executing instruction. In a processor where an instruction has several sources, many instructions, or several possible stages of bypass, the amount of compares may be large and thus, the corresponding area on the chip used by the comparison circuitry may be large.
Another possible solution is to track the location of instructions as they flow through the pipeline. An array of master-slave flip-flops (MSFFs) can be used to store the location of the pipeline stage an instruction is currently executing in. Since instructions move to a different stage every cycle, every entry in the array needs to be updated to reflect this. By using an array of MSFF, the area is larger than a register file of the same size. This approach may also have a slow read path causing timing problems.
Yet another approach may be to use a register file and read each entry, update the entry, and then write it back into the array. A read-update-write path would take at least one extra cycle causing performance degradation. Therefore, it would be beneficial to have an improved apparatus and method for tracking dependency and providing register file bypass controls.
The present invention provides an apparatus and method for dependency tracking and register file bypass controls using a scannable register file, also referred to herein as a “target vector.” The “target vector” consists of a scannable register file array and stores data identifying the bypass control information of the instructions in the pipeline. The location of this bypass control information in the “target vector” is indicative of a position of the instruction within the execution pipeline.
Based on this position within the execution pipeline, a dependent instruction can determine whether or not it can bypass the remaining pipeline and obtain a result of the instruction for use by the dependent instruction. If the dependent instruction is not able to bypass the pipeline, the dependent instruction is halted in the pipeline while the instruction upon which it is dependent continues to progress through the pipeline. Once the instruction upon which the dependent instruction is dependent, is at a stage of the pipeline where the bypass is able to be performed, the halted instruction is allowed to proceed through the pipeline.
With the apparatus and method of the present invention, a scannable register file array, e.g., in GPR/FPRs, is provided and used to track the stage of any instruction in the execution unit. Every entry in the target vector is updated every cycle to stay synchronized with the instructions in the execution unit. By using existing circuitry present for scan, the cells in the register file array grow a minimal amount in order to facilitate the present invention. The present invention has the advantage of being smaller and faster than using master-slave flip-flops (MSFFs).
With a standard register file, the register file must read, and then update to keep the contents of the register file synchronized with the instructions in the execution unit. Since the data in the entries of the register file array need to be updated every cycle, a standard register file is unsuitable for this application.
To keep the register file array synchronized with the instructions in the execution unit, a right shift of all the data in each entry of the register file array occurs every cycle. The scan port of the register file array cells is used as the shift function. In known systems, the scan port is only used during register initialization or test mode but not during functional mode. The present invention uses the scan port during a functional mode to facilitate the shifting of the data in the register file array every cycle in order to maintain synchronization of the register file array with the instructions in the execution unit.
These and other features and advantages of the present invention will be described in, or will become apparent to those of ordinary skill in the art in view of, the following detailed description of the preferred embodiments.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The present invention provides a mechanism for using the scan port of register file array cells to shift data contents of the register file array cells during each instruction cycle of a pipelined processor. With such a mechanism the register file array is maintained synchronized with the instructions being processed by the pipelined processor.
Stage D1 is the instruction selection stage in which a determination is made as to whether a source is ready for the instruction. Stage E0 is the register file access stage in which the register file array is accessed based on read addresses from the source registers and bypass controls are provided to the output multiplexers of the register file array. Stages E1-E7 are the execute stages in which the data output from the register file array is processed by the instruction. Stage E8 is the write-back stage in which data is written back to the register file array.
As stated above, in stage D1, the bypass control values stored in the registers 412-416 are used by the source ready logic 420-424 to determine if a source is ready. The source ready logic checks if the output of the bypass controls equals “00000xxx” and if the destination in stage E0 does not match the source. Bits 5 to 7 of the bypass controls are passed to stage E0 for use as bypass controls by storing these bits in bypass control registers 430-434 which then provides these bits as input to the bypass multiplexers 470-474.
If the instruction is not ready, the instruction will stall the pipeline and stages D0 and D1 are held. When the pipeline is held, the data in the bypass control latches 430-434 of D1 are shifted right to stay synchronized with the corresponding instruction in the execution unit. For example, when an instruction is in E0, the instruction writes bit0 of the corresponding entry in the target vector and writes bit0 of the bypass controls of any dependent source.
The target vector tracks which stage an instruction is currently located in. For example, if there is an Add instruction with a destination target of register 1 and if the Add instruction is in E0, then bit0 (which corresponds to E0) of Row 1 (which corresponds to register 1) is set. If there is a younger instruction, such as a Subtract that has register 1 as a source, then the Subtract instruction must not issue because it is dependent upon the Add. The Subtract instruction must wait until the Add has progressed far enough down the pipeline such that the Add can bypass the result to the Subtract instruction. The bypass control bits of the Subtract instruction indicate which stage the Add is currently residing in. Thus, in the previous example where bit0 of the bypass controls of any dependent source are set is performed because if the Subtract instruction reads the target vector before the Add instruction has a chance to write the target vector, the Subtract instruction will not have the latest information about which registers are used by older instructions. Thus when the Add instruction is in E0 and a Subtract instruction is in D1, the Add instruction needs to set bit0 of the bypass controls to prevent the Subtract instruction from issuing.
During the register file access stage, E0, the contents of the register file array 460 are read based on the read addresses from sources 402-406 obtained via registers 440-454. The output data from the register file array 460 is provided to bypass multiplexers 470-474 which multiplex the data with data fed-back from stages E6, E7 and E8. Bypasses are selected with bypass control (bit 5) having the highest priority and bypass control (bit 7) having the lowest priority. Bypass control (bit 5) selects E6 data, bypass control (bit 6) selects E7 data, and bypass control (bit 7) selects E8 data. If all three bypass controls are inactive, the output of the register file array is selected.
The bypass multiplexers 470-474 output the source data corresponding to sources 402-406 as read from the register file array 460 or the bypass data as received from execution units 494-496 of stages E6-E8. The source data is stored in registers 480-484 prior to being provided to execution units 490-496. The execution units 490-496 perform instruction execution on the source data. The output of the execution units 494-496 is fed-back to the bypass multiplexers 470-474 for use in determining whether to bypass the data read from the register file array 460. Also, in stage E8, the stage E8 data is written back to the register file array 460.
It should be appreciated that the pipeline depicted in
The register file array 460 of the present invention uses the scan path to perform the shifting of the necessary fields to maintain the register file array 460 consistent with the instructions being executed by the execution units of the pipeline. The scan path is essentially a shift register by nature. Thus, the present invention leverages the power of this shift register with minimal extra logic. A master-slave flip-flop is essentially what is needed to perform the shift function. The scan path is composed of master-slave flip-flops, thus the need to add extra latches to the register file array 460 is eliminated by using the scan path to perform the shifting in accordance with the present invention.
The scan clocks of the present invention are controlled through gating logic to prevent shifting during a write operation or during scan mode. During regular operation, the c2 clock is permitted to run free and causes the cells to shift their data values via the scan input/scan output ports from one cell to the next in the register file array. The c1 clock is used to control writing of input values to the first cell of the register file array. Thus, data may be written only to the first register file array cell and is shifted into each of the other cells with each clock c2 cycle. In this way, the scan path through the cells of the register file array resemble the flow of an instruction through the pipeline and may be used to keep track of the instruction as it progresses through the pipeline.
The following description of the circuitry of an exemplary embodiment of the present invention permits the novel register file array implementation using a target vector according to the present invention. The novel register file array contains all the necessary circuitry to implement the requirements of the target vector which is used to keep track of the present location of the instructions in the pipeline and its bypass controls.
P-channel pass gate transistors Q5510 and Q6515 are turned on by the write word line wr_wl going low, passing true, i.e. 1 or “high,” and complement, i.e. 0 or “low,” into the cell storage nodes blb 525 and bl 520, respectively. Likewise, wr_wl_b going low turns on p-channel pass transistors Q7530 and Q8535 and the opposite of the previous write is written into the cell. That is bl node 520 is set low and blb node 525 is set high.
P-channel pass gates are chosen in the depicted implementation to take advantage of CMOS NAND gates as write word line drivers. However, other implementations of the present invention may make use of different logic elements which may be organized to achieve a similar purpose as the present invention without departing from the spirit and scope of the present invention.
The cell storage circuitry is made up from two back to back inverters consisting of p-channel device Q4521, n-channel device Q3522 and p-channel device Q2540, n-channel device Q1545 (which function like an inverter because n-channel QSC13550 and p-channel QSC14555 are on during functional operation).
The scan_in input 560 is connected to the input of the transmission gate consisting of n-channel QSC1565 and p-channel QSC2570. The gates of QSC1565 and QSC2 570 are connected to scan clocks scan_c1 and its complement scan_c1_b (only issued in scan mode). The output of the transmission gate is connected to the true storage node bl 520. The scan clocks are connected to p-channel QSC14555 and n-channel QSC13550. When scan clocks scan_c1 and its complement scan_c1_b are active, both QSC14555 and QSC13550 are off to block the feedback path through Q1545 and Q2540. Hence the scan operation functions like a single ended write into the register file cell.
The inverter consisting of Q40523 and Q41524 inverts the complement latch node blb 525, which then drives the true value (rd_data) into the gate of n-channel Q51575. The source of Q51575 is connected to the drain of n-channel Q52580. The gate of Q52580 is connected to the read word line (rd_wl) which goes high when the row this register file cell is located in is accessed by a read operation. The drain of Q52580 is connected to a standard domino bit-line (bl_cell), whose pre-charge device, half latch and output inverter are shown in
The output of the primary register file cell (read_data) is also connected to the transmission gate consisting of n-channel device QSC3590 and p-channel device QSC4591. The gates of QSC3590 and QSC4591 are connected to clocks c2 and its complement c2_n which are free-running. The output of the transmission gate (scan_data) is connected to the scan latch consisting of QSC5592 through QSC8596, QSC11597 and QSC12596. This collection of transistors functions like the master register file cell consisting of Q1545 through Q4521, QSC13550 and QSC14555, where the scan_c1 clocks (test only) are in phase with functional write word line wr_wl but out of phase with the c2 clock. The inverter consisting of p-channel QSC9598 and QSC10599 inverts and connects the scan latch to the scan_in input of the adjacent register file cell.
As shown in
The following description of the operation of the register file array shown in
With reference to
When data<0>is high and wr_wl<0> is high, the output of NAND0—1 670 is low. Consequently, wr_wl_b is high and a zero may not be written into CELL0—0 610. When c1 goes high the output of NAND0—0 671 goes low (wr_wl of CELL0—0 610), a one is written into CELL0—0 610.
The shift function is implemented with the shift clock steering logic (block CLK_CNTRL 680) which is shown in greater detail in
Referring again to
As described above, whenever c1 is high, data is written into the cell as described above. That is, when wr_wl<0> is high, a write operation writes the corresponding data_in<0> into CELL0—0 610. If no write operation is desired (wr_wl<0> is low) a default zero is written into CELL0—0 610 when c1 is active. Since the shift clock steering logic pulses c1_scan_c1 and c1_scan_c1_b high and low, respectively, for the duration of the c1 clock, data is shifted from CELL0—0 610 to CELL0—1 620 and from CELL0—1 620 to CELL0—2 630. Hence the data in CELL0—2 630 is overwritten. Simultaneously, new data is written into CELL0—0 610 as described above. Thus, this master/slave relationship between the cells in the register file array permits the data to be shifted from cell to cell in the register file array.
For detailed internal cell master slave operation, the description of
The read is performed during the c2 phase. Only representative read operations for CELL0—0 610 are described, however the other cells of the register file array operate in a similar manner. When rd_wl<0> goes high and c2 goes active (high) rd_wl signals of CELL0—0 610 through CELL0—2 630 go high. The drains of CELL0—0 610 and CELL1—0 640 are dotted together forming the bit-line for column 0. Likewise, bit lines are formed for each column between all cells 620 and 650, and 630 and 660, in the column. If a one is stored in CELL0—0 610, the bit line is pulled low and data_out<0> goes high. If a zero was stored in CELL0—0 610, the bit line remains at its pre-charge state with data_out<0> remaining low.
The bit line (bl_b) is pre-charged high, when c2 is low and p-channel QR1710 is on, all read operations are disabled. Half latch p-channel device QS1720 turns on when bl_b goes high as latch (lat) goes low inverted by the inverter formed by QF1730 and QF2740. After c2 goes high, the half latch maintains the high pre-charge state of the bit line until a cell with a one is read and the bit line is pulled low.
The inverter 810 receives the scan_enable signal and inverts the signal to generate a shift_enable signal which is also provided as an input to AND gate 830. The outputs from the AND gates 820 and 830 are provided to NOR gate 840. The output of NOR gate 840 is scan_c1_clk_b. The output of the NOR gate 840 is also provided to inverter 850 which inverts the scan_c1_clk_b signal to generate the scan_c1_clk signal.
With the circuitry of
When the scan_enable signal is high and the scan_c1 signal is high, the output of AND gate 820 is high. When the scan_enable signal is high and the scan_c1 signal is low, then the output of the AND gate 820 is low. Similalry, when the scan_enable signal is low and the scan_c1 signal is high, the output of AND gate 820 is low. When the scan_enable signal is low and the scan c1 signal is low, the output of the AND gate 820 is low.
With regard to the outputs of the AND gates 820 and 830, when the output of AND gate 820 is low and the output of AND gate 830 is high, or vice versa, the NOR gate 840 output is low. When the output of AND gate 820 is low and the output of AND gate 830 is low, the output of NOR gate 840 is high. The inverter 850 inverts the output of the NOR gate 840 to generate the scan_c1 _clk signal.
Using the circuitry shown in
With the circuitry described above, when an instruction enters the pipeline, bit0 of the first cell of the register file array is written to indicating that the instruction is present at the first stage of the pipeline. With each cycle, the bit0 value is shifted to a next cell of the register file array via the scan path. If a dependent instruction is to be issued into the pipeline, the dependent instruction first reads the register file array to determine the location of the instruction from which it is dependent. The location of the bit0 value in the register file array is indicative of which stage of the pipeline the instruction is currently in. Based on this location, it is determined whether the remainder of the pipeline for the instruction may be bypassed and the result of the instruction provided to the dependent instruction. If the remainder of the pipeline cannot be bypassed, the dependent instruction is halted in the pipeline while the instruction upon which it is dependent continues to progress through the pipeline with each cycle. Once the instruction is at a stage where the result may be utilized by the dependent instruction, the dependent instruction is permitted to progress through the pipeline.
Thereafter, a bypass control bit is set in a first cell of the register file array (step 980). For the next clock cycle, the bypass control bit is shifted to a next cell in the register file array (step 990). A determination is made as to whether the instruction's execution in the pipeline has completed (step 1000). If so, the operation terminates. Otherwise, the operation returns to step 990 for the next cycle of pipeline execution.
Thus, the present invention provides a mechanism for tracking the progress of an instruction through an execution pipeline using a scannable register file array. With the mechanism of the present invention, the scan path through a register file array is used to write a bypass control bit into the cells of the register file. This bypass control bit is right shifted from cell to cell in the register file array with each pipeline clock cycle. In this way, the position of the bypass control bit in the scannable register file array is indicative of the position of the instruction within the pipeline. The location of the bypass control bit for an instruction in the scannable register file array may be used by a dependent instruction to determine if the dependent instruction may bypass the remainder of the pipeline and obtain the result of the instruction from which it is dependent.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.