The present invention relates generally to the data processing field, and more particularly, relates to a method and apparatus for eliminating silent store invalidation propagation in shared memory cache coherency protocols, and a design structure on which the subject circuit resides.
Computers have become increasingly faster and one of the ways in which to increase the speed of computers is to minimize storage access time. In order to reduce data access time, special purpose high-speed memory spaces of static random access memory (RAM) called a cache are used to temporarily store data which are currently in use. For example, a processor cache typically is positioned near or integral with the processor. Data stored in the cache advantageously may be accessed by the processor, for example, in only one processor cycle retrieving the data necessary to continue processing; rather than having to stall and wait for the retrieval of data from a secondary slower memory or main memory.
Multiprocessing computer systems include multiple processors, each processor employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform other unrelated computing tasks. Alternatively, components of a particular computing task are distributed among the multiple processors to decrease the time required to perform the computing task as a whole. One commercially available multiprocessing computer system is a symmetric multiprocessor (SMP) system. An SMP computer system typically includes multiple processors connected through a cache hierarchy to a shared bus. A memory connected to the shared bus is shared among the processors in the system.
In today's microprocessor systems, billions of loads and stores potentially occur every few seconds. Inevitably, some of these stores place the same data value to a memory location that already contains the exact same value. This is referred to as a silent store; that is to say that if location X in memory holds the value Y, and a store operation puts the same value Y to that memory location X, the store is considered silent.
These silent stores are inherently inefficient to any computer system, but multiprocessor systems have much more potential benefit from the removal of these stores than single processor systems.
In a multiprocessor system, cache coherency protocols enable keeping the copies of data in more than one cache coherent. The cache coherency protocols ensure that each cache contains the most up to date information. Such cache coherency is easily manageable for loads, but it becomes exceedingly more complex when stores are considered. A single write to a piece of data on one processor must be reflected in the caches of every other processor that holds a copy of that data.
Known solutions to the problem have all focused on trying to identify and eliminate silent stores from an instruction stream. This is extremely difficult to do, and has been shown to require a significant amount of overhead. Another approach is to precede every store operation with a load-and-compare, to see if the value being stored is already in that location. Again, this approach has obvious drawbacks of doubling the amount of required memory transactions.
A solution is required that does not require significant overhead, yet benefits from the effect of eliminating or “squashing” a silent store. This would help alleviate a lot of work in a multiprocessor system when one processor goes to store a value to a memory location that is shared amongst a non-trivial amount of other processors, but does not actually end up changing the value with its store. It has been shown that this happens quite often, and very frequently in particular for the value zero, where a zero is being written to a location where a zero exists. For example, a page is zeroed out when the page is brought in from main memory.
A need exists for an effective mechanism for eliminating silent store invalidation propagation in shared memory cache coherency protocols.
Principal aspects of the present invention are to provide a method and apparatus for eliminating silent store invalidation propagation in shared memory cache coherency protocols, and a design structure on which the subject circuit resides. Other important aspects of the present invention are to provide such method and apparatus for eliminating silent store invalidation propagation in shared memory cache coherency protocols substantially without negative effect and that overcome many of the disadvantages of prior art arrangements.
In brief, a method and circuit for eliminating silent store invalidation propagation in shared memory cache coherency protocols, and a design structure on which the subject circuit resides are provided. A received data value is compared with a stored cache data value. When the received data value matches the stored cache data value, a first squash signal is generated. A received write address is compared with a reservation address. When the received write address matches the reservation address, a reservation signal is generated and inverted. The first squash signal and the inverted reservation signal are combined to selectively produce a silent store squash signal. The silent store squash signal cancels sending an invalidation signal.
In accordance with features of the invention, the inverted reservation signal overrides the first squash signal to cancel the silent store squash signal when the write address matches the reservation address.
In accordance with features of the invention, the first squash signal is applied to an AND gate. The reservation signal is inverted and applied to the AND gate. The ANDed output provides the silent store squash signal only when the write address does not match the reservation address.
The present invention together with the above and other objects and advantages may best be understood from the following detailed description of the preferred embodiments of the invention illustrated in the drawings, wherein:
In accordance with features of the invention, a method and circuit are provided to eliminate unnecessary invalidations caused by stores that do not change the state of the data in the cache. The method prevents one processor from invalidating other processor's shared cache lines if the store taking place is found to be silent. The optimization occurs at the hardware level, completely transparent to any software, which is an attractive feature. Also the overhead involved is minimal, providing an attractive optimization.
Having reference now to the drawings, in
Multiprocessor system 100 is shown in simplified form sufficient for understanding the invention. It should be understood that the present invention is not limited to use with the illustrated multiprocessor system 100 of
Referring also to
In accordance with features of the invention, the inverted reservation signal overrides the first squash signal to cancel the silent store squash signal. The squash signal is applied to a first input of the AND gate 214. The inverted reservation signal is applied to a second input of the AND gate 214. The ANDed output silent store squash signal indicated by SILENT STORE SQUASH SIGNAL is provided only when the write address does not match the reservation address.
In accordance with features of the invention, the silent store squash signal cancels sending an invalidation signal typically called a dclaim signal. The invalidation or dclaim signal is used for invalidating other copies of the cache line on other processors.
In prior art arrangements an invalidation or dclaim signal typically is send out onto the interconnect fabric without reading/writing/modifying the cache. This invalidation or dclaim signal has the effect of invalidating other copies of the cache line on other processors. For example, consider the required operations in a prior art implementation with two processors, A and B, with the same cache line X in each of their caches in a Shared state, and processor A wishes to write to line X, then in a traditional Modified Exclusive Shared Invalid (MESI) and MOESI cache coherency protocols as follows:
1) To write a value to memory, Processor A checks its cache and identifies a hit when the value to be written is loaded into its cache already.
2) Processor A sees that the cache hit for the particular cache line in the Shared state, and Processor A must send a signal out on the fabric to see if another processor has the cache line. Processor A then waits for an ACK signal, which signifies that Processor B has invalidated his copy of cache line X and Processor A is set to go forward.
3) Processor A then writes to cache line X and changes the state to Modified.
In accordance with features of the invention, with logic circuit 200 provided with the multiprocessor system 100, when a processor 101 is writing to local L2 cache 104, the compare 208 compares the value being written with the data 206 in the local L2 cache 104 in order to eliminate unnecessary invalidations caused by stores that do not change the state of the data in the cache. The silent store squash signal is generated to prevent a processor 101, such as, processor 101, #1 from invalidating shared cache lines of other processors 102, #2, #3, #4 when the store taking place is found to be silent. When the cache line is in a Modified or Exclusive state, then the compare 208 is ignored. When the cache line is in Shared state however, the match identified by compare 208 prevents the processor 101, #1 from sending out the dclaim signal, and operations continue as if store had gone through. Then the cache line is maintained in shared state until it is actually changed or invalided by another processor, and program behavior is conserved.
In accordance with features of the invention, with logic circuit 200 provided with the multiprocessor system 100 functionality of load reserve and store (larx/stcx) instructions are not adversely affected. When squashing a silent store, the larx/stcx atomicity is preserved by simultaneously comparing the reservation or larx/stcx address 210 with the write address 202. When the reservation or larx/stcx address 210 and the write address 202 match, this overrides the silent store squashing compare 208, so that the store goes on as normal. This ensures that larx/stcx program functionality is not violated. The output of the reservation larx/stcx compare is inverted, so that the silent store squash signal is generated only if the reservation address 210 does not match the current address 202. Thus, a reservation or larx/stcx address match cancels the silent store squash signal.
Design process 304 may include using a variety of inputs; for example, inputs from library elements 308 which may house a set of commonly used elements, circuits, and devices, including models, layouts, and symbolic representations, for a given manufacturing technology, such as different technology nodes, 32 nm, 45 nm, 90 nm, and the like, design specifications 310, characterization data 312, verification data 314, design rules 316, and test data files 318, which may include test patterns and other testing information. Design process 304 may further include, for example, standard circuit design processes such as timing analysis, verification, design rule checking, place and route operations, and the like. One of ordinary skill in the art of integrated circuit design can appreciate the extent of possible electronic design automation tools and applications used in design process 304 without deviating from the scope and spirit of the invention. The design structure of the invention is not limited to any specific design flow.
Design process 304 preferably translates an embodiment of the invention as shown in
While the present invention has been described with reference to the details of the embodiments of the invention shown in the drawing, these details are not intended to limit the scope of the invention as claimed in the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
20020124156 | Yoaz et al. | Sep 2002 | A1 |
20080052469 | Fontenot et al. | Feb 2008 | A1 |
Number | Date | Country | |
---|---|---|---|
20090210633 A1 | Aug 2009 | US |