IBM® is a registered trademark of International Business Machines Corporation, Armonk, N.Y., U.S.A. S/390, Z900 and z990 and other names used herein may be registered trademarks, trademarks or product names of International Business Machines Corporation or other companies.
This invention relates to computer systems that execute floating point instructions, and more particularly, to a method and system for processing limited out-of-order execution of floating point loads.
A floating point unit typically consists of several pipeline stages, such as multiple pipeline stages for arithmetic computation (e.g., addition and multiplication), a normalization stage, and a rounding stage. Each pipeline stage may contain a separate instruction and the stages are connected in an ordered manner. As an instruction enters the pipeline, the necessary input data operands are accessed and are put into the first stage of the pipeline. The instruction advances from stage to stage within the pipeline as permitted. An instruction is considered to “stall” within the pipeline when forward progress is not allowed. An instruction is not permitted to advance to a new stage in the pipeline when the successive pipeline stage contains another previous instruction that itself cannot advance. An instruction cannot commence to operate until it has data to operate on. It may not have data to operate upon when an earlier instruction will update the data that a successive instruction will operate upon. This is referred to as a data dependency. For this reason, the successive instruction will “stall” at the entrance to the pipeline until it receives the updated data. When each instruction is executed in the order in which it is received in the pipeline, the system may be referred to as an “in-order” processing system. In order execution in a microprocessor simplifies the design of the multiprocessor, but it may result in poorer performance than that achieved by “out-of-order” processing systems that allow instructions to be executed in a different order than they are received in the pipeline.
It would be desirable to not only be able to utilize the simplified design of an in-order processing system but to also allow an out-of-order execution of floating point loads to provide data to arithmetic instructions as early as possible. This would result in a smaller elapsed time for the arithmetic instruction because the arithmetic instruction would not have to wait for the previous load instruction to complete before beginning execution.
Exemplary embodiments of the present invention include a system for performing limited out-of order execution of floating point loads. The system includes a plurality of stages making up a pipeline, the stages including an early stage. The system also includes a mechanism for inputting an arithmetic instruction into the pipeline, the arithmetic instruction including a result address. The mechanism also determines if the arithmetic instruction causes a write after write (WAW) condition to occur before writing a result of the arithmetic instruction to the result address. The determining includes comparing the result address to a load address associated with a load instruction subsequent to the arithmetic instruction in the pipeline. The load data associated with the load instruction was written to the load address in the early stage of the pipeline. A WAW condition occurs if the result address is equal to the load address. Writing a result of the arithmetic instruction is suppressed in response to the WAW condition occurring.
Additional exemplary embodiments include a method for performing floating point arithmetic operations. The method includes inputting an arithmetic instruction into a pipeline. The arithmetic instruction includes a result address and the pipeline with a plurality of stages including an early stage. Before writing a result of the arithmetic instruction to the result address, a determination is made to see if the arithmetic instruction causes a write after write (WAW) condition to occur. The determining includes comparing the result address to a load address associated with a load instruction subsequent to the arithmetic instruction in the pipeline. The load data associated with the load instruction was written to the load address in the early stage of the pipeline. A WAW condition occurs if the result address is equal to the load address. Writing a result of the arithmetic instruction is suppressed in response to the WAW condition occurring.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
Exemplary embodiments of the present invention detect dependencies between loads and arithmetic operations, and allow load dependencies to be immediately resolved. Dependencies are immediately resolved by bypass paths or by allowing subsequent instructions to read directly from the floating point register and not marking the dependency. Allowing load dependencies to be immediately resolved causes a problem ordering the loads with the arithmetic instructions because the load instructions are being issued in order but completing out-of-order. In particular, there is a “write after write” (WAW) hazard. For example, a multiply instruction that writes to FPR5 (floating point register five) may get issued before a load to FPR5, but the load writes to FPR5 early. In this case, exemplary embodiments of the present invention include an issue queue that detects the WAW hazard and sends a signal to the floating point unit (FPU) to block the write of the multiply instruction. The multiply instruction still updates the floating point state and control register (FPSCR) but does not update the register file.
Exemplary embodiments of the present invention include limited out-of-order execution of floating point loads. The term limited out-of-order execution of floating point loads, as used herein, refers to an in-order processing system with loads being written to an FPR in an early cycle (i.e. not waiting to the end of the pipeline to write to the FPR). All other instructions in the pipeline are executed in order. The mechanism to perform this limited out-of-order execution of floating point loads resolves dependencies in the issue queue early and also detects for WAW hazards. The FPU writes loads in an early pipeline stage and also has a mechanism for blocking writes due to WAW hazards.
A sample instruction stream for input to a seven stage pipeline follows:
In the above instruction stream it would be desirable for the load instructions (e.g., lfd instruction 1.1) to be performed early, for example in the first stage of the pipeline rather than the seventh, so that an arithmetic instruction (e.g., a fused multiply add instruction such as fmadd instruction 1.2) may utilize the data from the load without having to wait seven cycles for the load instruction to complete. In addition, it would be desirable to avoid having the fmadd instruction 1.2 overwrite (e.g., in cycle 7) the value loaded into FPR5 by the second load instruction (i.e. lfd instruction 2.1). Several approaches may be taken to executing floating point loads in a pipeline while still being able to start dependent instructions.
In this manner, an arithmetic instruction does not have to wait for the load instruction to complete all of the pipeline stages and actually write to the FPR in order to access the loaded data. Referring back to the previous sample instruction stream, the fmadd instruction 1.2 may receive the data loaded by the load instruction 1.1 via the first feedback path going from the output of the data one register 104 to the input of the multiplexer 102. If an arithmetic instruction and a load instruction were separated by one intervening instruction, then the second feedback path going from the output of the data two register 108 to the input of the multiplexer 102.
As depicted in
Another approach to executing floating point loads in a pipeline while still being allowed to start dependent instructions includes register renaming with early write. This would solve the problem of executing a load to FMADD forwarding, as well as the WAW hazard of the second load. However, this approach is very complex and requires hardware to scoreboard instruction execution. It is better suited to a full out-of-order execution design and is not to a limited out-of-order execution design as described herein.
As depicted in
Several comparators (e.g., Comparator 1421, Comparator 2422, Comparator 3423, Comparator 4424, Comparator 5425 and Comparator 6426 in
Referring back to the sample instruction stream described previously, this allows both the fmadd instruction 1.2 to immediately follow the lfd instruction 1.1 and for the second lfd instruction 2.1 to be started before the previous fmadd instruction 1.2 has completed. The lfd instruction 1.1 stores the value of mem1 into FPR5 during the first stage in the pipeline as depicted in
Exemplary embodiments of the present invention may be extended to include pipelines of other sizes, the early load to the register occurring in a cycle other than the first cycle, and the write by the arithmetic instruction occurring in a cycle other than the last cycle.
Exemplary embodiments of the present invention assist in optimizing the execution of floating point loads in a floating point pipeline. The design is to modify an in-order execution machine to be slightly out-of-order and to create a mechanism for suppressing WAW hazards. Exemplary embodiments of the present invention allow only loads to be executed out-of-order and reduce the wiring that would be required in an in-order machine with many bypasses. In addition, the WAW hazard mechanism suppresses arithmetic instruction writes but allows their feedback paths to dependent instructions to be maintained.
The capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.
As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention, can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims. Moreover, the use of the terms first, second, etc. do not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another.