The subject matter, which is regarded as the invention, is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
One aspect of the exemplary embodiments is a dual structure for stores. Another aspect of the exemplary embodiments is a mechanism for tracking store order and for allowing stores to forward their data to loads.
Specifically, the exemplary embodiments of the present application divide the Store Reorder Queue (SRQ) into two parts. The first part is the RSTQ (Retirement Store Queue), which is a list of in-flight stores, sorted by the program order of the stores. However, each entry in the RSTQ can be smaller than an SRQ entry, and in particular need not contain the address to which the store writes its data. As a result, such addresses that store write data are kept in another structure or a second location called the FSTQ. In order to mitigate the problems with area, power, and cycle time described above, the FSTQ has a structure similar to a cache. In particular, the FSTQ is divided into a set of congruence classes, each congruence class being able to hold information concerning a small number (e.g., 4 or 8) stores at any one time. With these congruence classes, loads need only check a small number of stores (e.g., 4 or 8) in order to determine if there is an in-flight store from which the load should have data forwarded. As noted above, the traditional solution must check 16, 32, 64, or more entries in the SRQ to achieve the same ends. In the exemplary embodiments of the present application, as a result of having to check far fewer stores, less area and power is required, and a smaller cycle time can be achieved that is approximately 30-35% improved over previous in-flight stores in out-of-order processors.
The congruence class into which each store is placed in the FSTQ depends on some subset of the bits in the address to which the store writes. Typically the bits determining congruence class are from the lower order bits of the address, as these tend to be more random and help spread entries around, and avoid over-subscribing any particular congruence class. Stores retiring (in program order) from the RSTQ inform the FSTQ that entries can be eliminated. If a congruence class in the FSTQ is full with other store instructions when attempting to add a new store instruction, then this new store instruction may be stalled or rejected, and reissued.
Also, the FSTQ and the RSTQ need to be kept synchronized. The description below discusses mechanisms by which this synchronization is achieved. The detailed solution also discusses how the exemplary embodiments of the present application behave during different phases of load and store execution.
The purpose of the dual structure of the exemplary embodiments of the present application is (1) to track store order and (2) to allow stores to forward their data to loads. The FSTQ is a cache-like structure used to forward data from in-flight stores to load instructions. Like a cache, it has congruence classes determined in the preferred exemplary embodiment by some subset of low order address bits. Below is one embodiment of an FSTQ. Variations on this embodiment for fine tuned control, error detection/correction, etc. would be obvious to anyone skilled in the art.
If an FSTQ entry holds only one store, then this field would have only one value. If an FSTQ entry can merge values from multiple stores, this field could have one entry for each byte in the block of data (e.g., 16 SSQN's). These SSQN values can be used as indices into the other major structure, the RSTQ.
Like SSQN, if an FSTQ entry holds only one store, then this field would be one bit. If an FSTQ entry can merge values from multiple stores, this field could have up to one entry for each byte in the block of data (e.g. 16 valid bits).
Like SSQN and the “Valid Bit(s)”, if an FSTQ entry can hold only one store, then this field would be ceil [log 2 (MAX_THREADS)] bits, e.g., log 2(4)=2 bits. If an FSTQ entry can merge values from multiple stores, this field could have up to one entry for each byte in the block of data (e.g. 16*log 2 (MAX_THREADS)=16*2=32 bits.
Furthermore, unlike a traditional cache, the same address could appear multiple times in the same congruence class of the FSTQ. This situation would occur if multiple stores to the same address are simultaneously in flight. The SSQN, thread number, and valid bits indicate which, if any, of the entries should have its value forwarded to a given load.
As far as the structure of the RSTQ is concerned, the RSTQ is a true First-Input First-Output (FIFO) behaving system that permits each of the plurality of stores to enter into a program order executed by the predetermined program only after being decoded. Unlike traditional store queues, the RSTQ has no associative search capability. In fact, the searching is done via the FSTQ.
The RSTQ serves as a place to hold store data until the store completes, as a retirement queue of stores for in-order completion, and as a FIFO queue to determine stores that need to be flushed due to mispredicted branches or other reasons.
Below is one embodiment of an RSTQ. Variations on this embodiment for fine tuned control, error detection/correction, etc., would be obvious to anyone skilled in the art.
If the FSTQ has N entries, then this pointer need not have more than ceil {log 2(N)} bits. For example, if the FSTQ has 64 entries, this pointer could require up to log 2(64)=6 bits. (Note that the RSTQ entry can point directly to the FSTQ entry holding data for the store, and avoid the need for any associative search.)
Global Instruction ID: Useful for flushes due to branch mispredicts and other events.
Moreover, in a processor with Simultaneous Multi-Threading (SMT), the RSTQ could be partitioned among the threads in a manner obvious to anyone skilled in the art, and in much the same manner that a traditional store queue could be partitioned.
Referring to
Referring to
As far as additional micro-architectural registers are concerned, a power and area efficient implementation of the RSTQ could be implemented as a circular buffer. A circular buffer avoids the need to shift or compact entries. To manage the RSTQ as a circular buffer, at least two micro-architectural registers are useful. One is the RSTQ_TAIL: The location in the RSTQ into which store instructions are initially placed. The other is the RSTQ_HEAD: The location in the RSTQ from which store instructions are removed, with their data placed into the memory hierarchy. Other means of managing a circular buffer or of implementing the RSTQ are obvious to anyone skilled in the art. Likewise, having N RSTQ_TAIL registers and N RSTQ_HEAD registers in an SMT processor with N threads, so as to manage a partitioned RSTQ are obvious to anyone skilled in the art.
In addition, a definition of the actions of each of the structures just defined at key points during execution is provided.
DISPATCH means the placement—in program order—into (issue) queue(s), of an instruction or set of microinstructions corresponding to one architectural instruction.
ISSUE means the launch—not necessarily in program order—of an instruction or microinstruction from an (issue) queue into a function unit capable of executing the instruction. This “launch” includes actual execution of the instruction.
RETIRE means the completion—in program order—of an instruction whose execution has finished, and for which the execution of all prior instructions has finished. Thus, the architected state visible to the programmer or other entity viewing program execution is updated at RETIRE time.
When a DISPATCH store instruction is executed, the following process is followed: (1) If the RSTQ is full, stall dispatch of the store. (2) If the RSTQ is not full, put the store instruction at the RSTQ_TAIL position. Remember this value of RSTQ_TAIL, and then bump the RSTQ_TAIL pointer. The RSTQ_TAIL represents the Store Sequence Number (SSQN), and provides a means of ordering store instructions (as well as load instructions, as described below.) (3) Include the RSTQ_TAIL/SSQN with the store instruction in the Issue Queue from which the store came. The Issue Queue should also pass this SSQN as a tag to the portion of the store that generates the data to be stored.
When an ISSUE store instruction is executed, the following process is followed: (a) Compute the address to which this store writes its data. This address could be a real address or an effective/virtual address. The preferred embodiment is to use a real address, as it avoids problems of synonyms (the same data being available at more than one address). However, management of these structures using effective/virtual addresses are obvious to anyone skilled in the art.
Using the address for this store, and using the SSQN value received from the issue queue (which received it during store DISPATCH, as described above):
If there is no room for a new entry in the FSTQ congruence class, stall the issue of the store or cause it to be reissued later when room may have become available in the RSTQ. In most modern processors, loads expect to be able to receive forwarded data from any store that has issued, but not yet RETIRED.
If an FSTQ entry was created, update the RSTQ entry with the FSTQ index.
(b) When get data for the store, accompanied by the SSQN value as a tag (as described in the discussion of store DISPATCH above):
Use the SSQN val to find where data should go in RSTQ.
Set the Valid bit for this data in the FSTQ.
Moreover, the SSQN value gives a direct address into the RSTQ, and the “Index to FSTQ” field in the RSTQ gives direct access to the corresponding FSTQ entry.
When an RETIRE store instruction is executed, the following process is followed:
Pass the “Index to FSTQ” field of the retiring RSTQ entry to invalidate the corresponding FSTQ entry. (The FSTQ must have a corresponding entry, as the mechanism of this invention keeps the RSTQ and FSTQ contents in lockstep.)
Pass the store address and data to the memory hierarchy, just as is done in traditional store queues at retire time.
Bump the RSTQ_HEAD pointer.
When an RETIRE store instruction is executed, the following process is followed:
Note the value of RSTQ_TAIL register, and include it with the load in this issue queue. Later, when the load issues and checks if any store value should be forwarded from the FSTQ, the check examines stores in priority order starting with stores at SSQN and moving to progressively older stores.
When an ISSUE store instruction is executed, the following process is followed:
Using the address for this load, and using the SSQN value received from the issue queue (which received it during load DISPATCH, as described above):
The address dictates one congruence class in the FSTQ.
Check entries in that congruence class with matching addresses.
Forward the youngest store value that is at least as old as SSQN.
Furthermore, there may be multiple matching addresses in the congruence class. The rule above selects the proper value if there are one or multiple matching addresses. Also, if there are no matching addresses in the FSTQ, the load should obtain data from the caches in the memory hierarchy in the “normal” fashion.
The capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.
As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.
This invention was made with Government support under contract No.: NBCH3039004 awarded by Defense Advanced Research Projects Agency (DARPA). The government has certain rights in this invention.