The present invention relates to memory ordering, and more particularly to processing of memory operations according to a memory order model.
Memory instruction processing must act in accordance with a target instruction set architecture (ISA) memory order model. For reference, Intel Corporation's two main ISAs: Intel® architecture (IA-32 or x86) and Intel's ITANIUM processor family (IPF) have very different memory order models. In IA-32, load and store operations must be visible in program order. In the IPF architecture, they do not in general, but there are special instructions by which a programmer can enforce ordering when necessary (e.g., load acquire (referred to herein as “load acquire”), store release (referred to herein as “store release”), memory fence, and semaphores).
One simple, but low-performance strategy for keeping memory operations in order is to not allow a memory instruction to access a memory hierarchy until a prior memory instruction has obtained its data (for a load) or gotten confirmation of ownership via a cache coherence protocol (for a store).
However, software applications increasingly rely upon ordered memory operations, that is, memory operations which impose an ordering of other memory operations and themselves. While executing parallel threads in a chip multiprocessor (CMP), ordered memory instructions are used in synchronization and communication between different software threads or processes of a single application. Transaction processing and managed run-time environments rely on ordered memory instructions to function effectively. Further, binary translators that translate from a stronger memory order model ISA (e.g., x86) to a weaker memory order ISA (e.g., IPF) assume that the application being translated relies on the ordering enforced by the stronger memory order model. Thus when the binaries are translated, they must replace loads and stores with ordered loads and stores to guarantee program correctness.
With increasing utilization of ordered memory operations, the performance of ordered memory operations is becoming more important. In current x86 processors, processing ordered memory operations out-of-order is already crucial to performance, as all memory operations are ordered operations. Out-of-order processors implementing a strong memory order model can speculatively execute loads out-of-order, and then check to ensure that no ordering violation has occurred before committing the load instruction to machine state. This can be done by tracking executed, but not yet committed load addresses in a load queue, and monitoring writes by other central processing units (CPUs) or cache coherent agents. If another CPU writes to the same address as a load in the load queue, the CPU can trap or replay the matching load (and eradicate all subsequent non-committed loads), and then re-execute that load and all subsequent loads, ensuring that no younger load is satisfied before an older load.
In-order CPU's however can commit load instructions before they have returned their data into the register file. In such a CPU, loads can commit as soon as they have passed all their fault checks (e.g., data translation buffer (DTB) miss, and unaligned access), and before data is retrieved. Once load instructions retire, they cannot be re-executed. Therefore, it is not an option to trap and refetch or re-execute loads after they have retired based upon monitoring writes from other CPUs as described above.
A need thus exists to improve performance of ordered memory operations, particularly in a processor with a weak memory order model.
Referring to
A processor including such processor resources may use them as temporary storage for various memory operations that may be executed within the system. For example, load queue 20 may be used to temporarily store entries of particular memory operations, such as load operations and to track prior loads or other memory operations that must be completed before the given memory operation itself can be completed. Similarly, store queue 30 may be used to store memory operations, for example, store operations and to track prior memory operations (usually loads) that must be completed before a given memory operation itself can commit. In various embodiments, a merge buffer 40 may be used as a buffer to temporarily store data corresponding to a memory operation until such time as the memory operation (e.g., a store or sempahore) can be completed or committed.
An ISA with a weak memory order model (such as IPF processors) may include explicit instructions that require stringent memory ordering (e.g., load acquire, store release, memory fence, and semaphores), while most regular loads and stores do not impose stringent memory ordering. In an ISA having a strong memory order model (e.g., an IA-32 ISA), every load or store instruction may follow stringent memory ordering rules. Thus a program translated from an IA-32 environment to an IPF environment, for example, may impose strong memory ordering to ensure proper program behavior by substituting all loads with load acquires and all stores with store releases.
When a processor in accordance with an embodiment of the present invention processes a load acquire, it ensures that the load acquire has achieved global visibility before subsequent loads and stores are processed. Thus, if the load acquire misses in a first level data cache, subsequent loads may be prohibited from updating the register file, even if they would have hit in the data cache, and subsequent stores must test for ownership of the block they are writing only after the load acquire has returned its data to the register file. To accomplish this, the processor may force all loads younger than an outstanding load acquire to miss in the data cache and enter a load queue, i.e., a miss request queue (MRQ), to ensure proper ordering.
When a processor in accordance with an embodiment of the present invention processes a store release, it ensures that all prior loads and stores have achieved global visibility. Thus, before the store release can make its write globally visible, all prior loads must have returned data to the register file, and all prior stores must have achieved ownership visibility via a cache coherence protocol.
Memory fence and semaphore operations have elements of both load acquire and store release semantics.
Still referring to
Also associated with MRQ entry 25 is an order bit (O-bit) 27 that may be used to indicate that succeeding memory operations stored in load queue 20 should be ordered with respect to MRQ entry 25. Furthermore, a valid bit 28 may also be present. As still further shown in
Similarly, store queue 30 (also referred to herein as “STB 30”) may include a plurality of entries. For purposes of illustration, only a single STB entry 35 is shown in
In various embodiments, merge buffer 40 may transmit a signal 45 (i.e., an “All Prior Writes Visible” signal) that may be used to indicate that all prior write operations have achieved visibility. In such an embodiment, signal 45 may be used to notify that a release-semantic memory operation in STB 30 (usually a store release, memory fence or semaphore release) that has delayed committing may now commit upon receipt of signal 45. Use of signal 45 will be discussed further below.
Together, these mechanisms may enforce memory ordering as needed by the semantics of the memory operations issued. The mechanisms may facilitate high performance, as a processor in accordance with certain embodiments may only enforce ordering constraints when desired to take advantage of native binaries that use a weak memory order model.
Further, in various embodiments, order vector checks for loads may be deferred as late as possible. This has two implications. First, with respect to pipelined memory accesses, loads that require ordering constraints access the cache hierarchy normally (aside from being forced to miss the primary data cache). This allows a load to access second and third level caches and other processor socket caches and memory before its ordering constraints are checked. Only when the load data is about to write into the register file is the order vector checked to ensure that all constraints are met. If a load acquire misses the primary data cache, for example, a subsequent load (which must wait for the load acquire to complete) may launch its request in the shadow of the load acquire. If the load acquire returns data before the subsequent load returns data, the subsequent load suffers no performance penalty due to the ordering constraint. Thus in the best case, ordering can be enforced while load operations are completely pipelined.
Second, with respect to data prefetching, if a subsequent load tries to return data before a preceding load acquire, it will have effectively prefetched its accessed block into the CPU cache. After the load acquire returns data, the subsequent load may retry out of the load queue and get its data from the cache. Ordering may be maintained because an intervening globally visible write causes the cache line to be invalidated, resulting in the cache block being refetched to obtain an updated copy.
Referring now to
Still referring to
If instead at diamond 105, it is determined that no prior ordered operations are outstanding in the load queue, it may be determined whether data is present in a data cache (diamond 110). If so, the data may be obtained from the data cache (block 118) and normal processing may continue.
At diamond 115, it may be determined whether the instruction is a load acquire operation. If it is not, control may pass to
While not shown in
Referring now to
Then, an order vector corresponding to the load instruction may be analyzed (block 220). For example, an MRQ entry in a load queue corresponding to the load instruction may have an order vector associated therewith. The order vector may be analyzed to determine whether the order vector is clear (diamond 230). In the embodiment of
If instead the order vector is determined to be clear at diamond 230, control may pass to block 250 in which the data may be written to a register file. Next, the entry corresponding to the load instruction may be deallocated (block 260). Finally, at block 270, the order bit corresponding to the completed (i.e., deallocated) load operation may be column cleared from all subsequent entries in the load queue and store queue. In such manner, these order vectors may be updated with the completed status of the current operation.
If a store operation is about to attempt to attain global visibility (e.g., copy out from the store buffer to the merge buffer, and request ownership for its cache block), it may first check to ensure that its order vector is clear. If it is not, the operation may be deferred until the order vector is completely clear.
Referring now to
Still referring to
If instead at diamond 415 it is determined that a store release operation is present, next, an order vector for the entry may be generated based on information regarding all prior outstanding orderable operations in the load queue (block 420). As discussed above, such an order vector may include bits corresponding to pending memory operations (e.g., outstanding loads in an MRQ, as well as memory fences and other such operations).
At diamond 430, it may be determined whether the order vector is clear. If the order vector is not clear, a loop may be executed until the order vector becomes clear. When the order vector does become clear, it may be determined whether the operation is a release operation (diamond 435). If it is not, control may pass directly to block 445, as discussed below. If instead it is determined that a release operation is present, it may then be determined whether all prior writes have achieved visibility (diamond 440). For example, in one embodiment stores may be visible when data corresponding to the instruction is present in a given buffer or other storage location. If not, diamond 440 may loop back upon itself until all the prior writes have achieved visibility. When such visibility is achieved, control may pass to block 445.
There, the store may request visibility for the write to its cache block (block 445). While not shown in
In certain embodiments, a cache block for a store release operation may already be in the merge buffer (MGB), owned, when the store release is ready to attain visibility. The MGB may maintain high performance for streams of store releases (e.g., in code segments where all stores are store releases), if there is a reasonable amount of merging in the MGB for these blocks.
If the store has attained visibility, an acknowledgement bit may be set for store data in the merge buffer. The MGB may include this acknowledgment bit, also referred to as an ownership or dirty bit, for each valid cache block. In such embodiments, the MGB may then perform an OR operation across all of its valid entries. If any valid entries are not acknowledged, the “all prior writes visible” signal may be deasserted. Once this acknowledgement bit is set, the entry may become globally visible. In such manner, visibility may be achieved for the store or store release instruction (block 460). It is to be understood that at least certain actions set forth in
Referring now to
As shown in
Then it may be determined whether all prior stores are visible and whether the order vector for the entry in the store queue is now clear (diamond 535). If not, a loop may be executed until such stores have become visible and the order vector clears. When this occurs, control may pass to block 550 where the memory fence entry may be deallocated from the store queue.
As in store release processing, the STB may prevent the MF from deallocating until its order vector is clear and it receives an “all prior writes visible” signal from the merge buffer. Once the memory fence deallocates from the STB, the store order queue ID of the memory fence may be transmitted to the load queue (block 560). Accordingly, the load queue may see the store queue ID of the deallocated store, and perform a content addressable memory (CAM) operation across all entries' order store queue ID fields. Further, the memory fence entry in the load queue may be awoken from a sleep state.
Then, the order bit corresponding to the load and queue entries may be column cleared from all other entries (i.e., subsequent loads and stores) in the load queue and store queue (block 570), allowing them to complete, and the memory fence may be deallocated from the load queue.
Ordering hardware in accordance with an embodiment of the present invention may also control the order of memory or other processor operations for other reasons. For example, it can be used to order a load with a prior store that can provide some, but not all, of the load's data (partial hit); it can be used to enforce read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW) data dependency hazards through memory; and it can be used to prevent local bypassing of data from certain operations to others (e.g., from a semaphore to a load, or from a store to a semaphore). Further, in certain embodiments semaphores may use the same hardware to enforce proper ordering.
Referring now to
It is to be understood that in other embodiments additional such processors may be coupled to coherent memory 630. Furthermore in certain embodiments, coherent memory 630 may be implemented in parts and spread out such that a subset of processors within system 600 communicate to some portions to coherent memory 630 and other processors communicate to other portions of coherent memory 630.
As shown in
Coherent memory 630 may also be coupled (via a hub link) to an input/output (I/O) hub 635 that is coupled to an I/O expansion bus 655 and a peripheral bus 650. In various embodiments, I/O expansion bus 655 may be coupled to various I/O devices such as a keyboard and mouse, among other devices. Peripheral bus 650 may be coupled to various components such as peripheral device 670 which may be a memory device such as a flash memory, add-in card, and the like. Although the description makes reference to specific components of system 600, numerous modifications of the illustrated embodiments may be possible.
Embodiments may be implemented in a computer program that may be stored on a storage medium having instructions to program a computer system to perform the embodiments. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic and static RAMs, erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), flash memories, magnetic or optical cards, or any type of media suitable for storing electronic instructions. Other embodiments may be implemented as software modules executed by a programmable control device.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.