The invention may be understood by reference to the following description taken in conjunction with the accompanying drawings, in which like reference numerals identify like elements, and in which:
Illustrative embodiments of the invention are described below. In the interest of clarity, not all features of an actual implementation are described in this specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure.
According to an embodiment of the present disclosure, self-reconciled data is used to reduce cache coherence overhead in multiprocessor systems. A cache line is self-reconciled if the cache itself is responsible for maintaining the coherence of the data, where in case the data is modified in another cache, cache coherence cannot be compromised without an invalidate request being sent to invalidate the self-reconciled cache line.
When a cache needs to obtain a shared copy, the cache can obtain either a regular copy or a self-reconciled copy. The difference between a regular copy and a self-reconciled copy is that, if the data is later modified in another cache, that cache needs to send an invalidate request to invalidate the regular copy, but does not need to send an invalidate request to invalidate the self-reconciled copy. Software, executed by a processor, can provide heuristic information indicating whether a regular copy or a self-reconciled copy should be used. For example, such heuristic information can be associated with a memory load instruction, indicating whether a regular copy or a self-reconciled copy should be retrieved if a cache miss is caused by the memory load operation.
Alternatively, the underlying cache coherence protocol of a multiprocessor system can be enhanced with a self-reconciled data prediction mechanism, wherein the self-reconciled data prediction mechanism determines, when a requesting cache needs to retrieve data of an address, whether a regular copy or a self-reconciled copy should be supplied to the requesting cache. With snoopy cache coherence, the self-reconciled data prediction can be implemented at the requesting cache side or at the sourcing cache side; with directory-based cache coherence, the self-reconciled data prediction can be implemented at the requesting cache side or at the home side.
Referring now to
In a multiprocessor system that uses a snoopy cache coherence protocol, self-reconciled data may be used if the snoopy protocol is augmented with proper filtering information so that an invalidate request does not always need to be broadcast to all the caches in the system.
An exemplary self-reconciled data prediction mechanism is implemented at the sourcing cache side. When a sourcing cache receives a cache request for a shared copy, the sourcing cache predicts that a self-reconciled copy should be supplied if (a) the snoop filtering information shows that no regular data copy is cached in the requesting cache (so that if a self-reconciled copy is supplied, an invalidate operation can be avoided in the future according to the snoop filtering information), and (b) a network traffic monitor indicates that network bandwidth consumption is high due to cache coherence messages.
Another exemplary self-reconciled data prediction is implemented via proper support at both the requesting cache side and the sourcing cache side. In case of a read cache miss, the requesting cache predicts that a self-reconciled copy should be provided if the corresponding address is not found in the requesting cache. The requesting cache predicts that a regular copy should be provided if the corresponding address is found in an invalid state in the requesting cache. The requesting cache side prediction result is attached to the corresponding cache request issued from the requesting cache. When a sourcing cache receives the cache request, the sourcing cache predicts that a self-reconciled copy should be provided if the snoop filtering information shows that (a) no regular data copy is cached in the requesting cache, and (b) the requesting cache is far away from other caches in which a regular data copy may be cached at the time. The sourcing cache supplies a self-reconciled copy if both the requesting cache side prediction result and the sourcing cache side prediction result indicate that a self-reconciled copy should be supplied. It should be noted that, if no sourcing cache exists, the memory can supply a regular copy to the requesting cache.
In a multiprocessor system that uses a directory-based cache coherence protocol, the self-reconciled data prediction can be implemented at the requesting cache side or at the home side. An exemplary self-reconciled data prediction mechanism is implemented at the home side. When the home of an address receives a read cache request, the home determines that a self-reconciled copy should be supplied if the communication latency between the home and the requesting cache is significantly larger than that between the home and other caches in which a regular data copy may be cached at the time according to the corresponding directory information.
Another exemplary self-reconciled data prediction mechanism is implemented via proper support at both the requesting cache side and at the home side. In case of a read cache miss, the requesting cache predicts that a self-reconciled copy should be provided if the corresponding address is not found in the requesting cache. The requesting cache predicts that a regular copy should be provided if the corresponding address is found in an invalid state in the requesting cache. The requesting cache side prediction result is included to the corresponding cache request sent from the requesting cache to the home. When the home receives the cache request, the home predicts that a self-reconciled copy should be supplied if the communication latency between the home and the requesting cache is significantly larger than that between the home and other caches in which a regular data copy may be cached according to the corresponding directory information. Finally, the home determines that a self-reconciled copy should be supplied if both the requesting cache side prediction result and the home side prediction result indicate that a self-reconciled copy should be supplied.
A directory-based cache coherence protocol can choose to use limited directory space to reduce overhead of directory maintenance, wherein a limited number of cache identifiers can be recorded in a directory. An exemplary self-reconciled data prediction mechanism implemented at the home side determines that a self-reconciled copy should be supplied if the limited directory space has been used up and no further cache identifier can be recorded in the corresponding directory. Alternatively, the home can supply a regular data copy to the requesting cache, and downgrade a regular data copy cached in another cache to a self-reconciled data copy (so that the corresponding cache identifier no longer needs to be recorded in the directory).
In an illustrative embodiment of the present invention, a cache coherence protocol is extended with new cache states to allow self-reconciled data to be used. For a shared cache line, in addition to the regular shared (S) cache state, we introduce two new cache states, shared-transient (ST) and shared-transient-speculative (STS). If a cache line is in the regular shared state, the data is a regular shared copy. Consequently, if the data is modified in a cache, that cache needs to issue an invalidate request so that the regular shared copy can be invalidated in time.
If a cache line is in the shared-transient state, the data is a self-reconciled shared copy that would not be invalidated should the data is modified in another cache. It should be noted that the data of the cache line in the shared-transient state can be used for only once without performing a self-reconcile operation to ensure that the data is indeed up-to-date. The exact meaning that the data can be used for only once depends on the semantics of the memory model. With sequential consistency, the data is guaranteed to be up-to-date for one read operation; with a weak memory model, the data can be guaranteed to be up-to-date for read operations before the next synchronization point.
For a cache line in the shared-transient state, once data of the cache line is used, the cache state of the cache line becomes shared-transient-speculative. The shared-transient-speculative state indicates that the data of the cache line can be update-to-date or out-of-date. As a result, the cache itself, rather than its peer caches or the memory, is ultimately responsible for maintaining the data coherence. It should be noted that the data of the shared-transient-speculative cache line can be used as speculative data so that the corresponding processor accessing the data can continue its computation speculatively. Meanwhile, the corresponding cache needs to issue appropriate coherence messages to its peer caches and the memory to ensure that up-to-date data is obtained if the data is modified elsewhere. Computation using speculative data typically needs to be rolled back if the speculative data turns out to be incorrect.
It should be appreciated by those skilled in the art that, when data of an address is cached in multiple caches, the data can be cached in the regular shared state, the shared-transient state and the shared-transient-speculative state in different caches at the same time. Generally speaking, the data is cached in the shared-transient state in a cache if the cached data will be used only once or very few times before it is modified by another processor, or the invalidate latency of the shared copy is larger than that of other shared copies. The self-reconciled data prediction mechanisms described above can be used to predict whether requested data of a cache miss should be cached in a regular shared state or in a shared-transient state.
When data of a shared cache line needs to be modified, the cache only needs to send an invalidate request to those peer caches in which the data is cached in the regular shared state. If bandwidth allowed, the cache can also send an invalidate request to the peer caches in which the data is cached in the shared-transient state or the shared-transient-speculative state. This allows data cached in the shared-transient state or the shared-transient-speculative state to be invalidated quickly to avoid speculative use of out-of-date data. It should be noted that invalidate operations of shared-transient and shared-transient-speculative copies do not need to be acknowledged. It should also be noted that the proposed mechanism works even though invalidate requests to shared-transient or shared-transient-speculative caches are lost. The net effect is that some out-of-date data would be used in speculative executions (which would be rolled back eventually) since the cache lines are not invalidated in time.
For a cache line in the shared-transient-speculative state, the cache state can be augmented with a so-called access counter (A-counter), wherein the A-counter records the number that data of the cache line has been accessed since the data is cached. The A-counter can be used to determine whether a shared-transient-speculative cache line should be upgraded to a regular shared cache line. For example, the A-counter can be a 2-bit counter with a pre-defined limit of 3.
When a processor reads data from a shared-transient cache line, the cache state is changed to shared-transient-speculative (with a weak memory model, this state change can be postponed to the next proper synchronization point). The A-counter is set to 0.
When a processor reads data from a shared-transient-speculative cache line, it uses the data speculatively. The processor typically needs to maintain sufficient information so that the system state can be rolled back if the speculation turns out to be incorrect. The cache needs to perform a self-reconcile operation by sending a proper coherence message to check whether the speculative data is up-to-date, and retrieves the most update-to-date data if the speculative data maintained in the cache is out-of-date.
If the A-counter is below the pre-defined limit, the cache performs a self-reconcile operation by issuing a shared-transient read request. Meanwhile, the A-counter is incremented by 1. When the cache receives the data, the cache compares the received data with the shared-transient-speculative data. If there is a match, the computation continues, and the cache state remains as shared-transient-speculative (with a weak memory model, the cache state can be set to shared-transient until the next synchronization point). However, if there is a mismatch, the speculative computation is rolled back, and the received data is cached in the shared-transient-speculative state (with a weak memory model, the received data can be cached in the shared-transient state until the next synchronization point).
On the other hand, if the A-counter reaches the pre-defined limit, the cache performs a self-reconcile operation by issuing a shared read request. When the cache receives the data, the cache compares the received data with the shared-transient-speculative data. If there is a match, the cache state is changed to regular shared; otherwise the speculative execution is rolled back, and the received data is cached in the shared state.
It is to be understood that the systems and methods described herein may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. It is to be understood that, because some of the constituent system components and process steps depicted in the accompanying figures are preferably implemented in software, the connections between system modules (or the logic flow of method steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations of the present disclosure.
Referring to
The computer platform (701) also includes an operating system and micro instruction code. The various processes and functions described herein may either be part of the micro instruction code or part of the application program (or a combination thereof) which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present disclosure.
The particular embodiments disclosed above are illustrative only, as the invention may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. Furthermore, no limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope and spirit of the invention.