Computer systems typically utilize a cache memory system to improve the performance and throughput of the computer system by reducing the apparent time delay or latency normally associated with a processor accessing data in a main memory. A cache memory system employs one or more caches, each including a cache memory in conjunction with control logic. Generally, each of the cache memories is smaller and faster than the main memory, so that a processor may access a copy of data from a cache more quickly and readily than from the main memory. Moreover, many cache memory systems use more than one level of cache between a processor and the main memory to further enhance computer system operation.
One important function of the cache memory system is to provide “cache coherency.” In other words, each copy of the same memory address of the main memory should hold the same value throughout the cache memory system so that the entire address space of the system remains consistent throughout. To maintain cache coherency, the cache memory system utilizes a cache coherency protocol involving the transfer of messages or some other form of communication between the various caches. This communication may occur among caches of the same level, as well as between caches residing at different levels of the cache memory system. Unfortunately, use of the protocol normally results in a significant amount of communication overhead between caches.
In some systems, the amount of communication overhead is reduced by enforcing “cache-inclusiveness” between cache levels, meaning the entire contents of a higher-level cache are replicated in the next lower-level cache. As a result, the higher-level cache propagates any changes therein to the next-lowest cache level, thus reducing the amount of negotiation, and hence communication, between the levels. Unfortunately, cache-inclusiveness also requires significant amounts of redundant storage in lower-level caches to duplicate the data contents in the caches at the next higher level. As a consequence, the lowest cache level must hold the data contents residing in all other (i.e., higher) cache levels. Also, the more levels of cache that are implemented in a system, the higher the quantity of content replication. Thus, since cache memories tend to be relatively expensive on a per-byte basis, cache-inclusiveness tends to be an expensive method for reduction of cache coherency protocol communication overhead.
Similarly,
Another embodiment of the invention—a cache memory system 300—is shown within a computer system 301. The computer system 301 includes four central processing units (CPUs) 302, 304, 306, 308 and a main memory 326, with the cache memory system 300 positioned therebetween. Other components, such as I/O devices, device interfaces, user interfaces, cache controllers, and the like, are not shown to simplify and facilitate the discussion of the cache memory system 300 presented below.
The cache memory system 300 includes a set of higher-level caches 310, 312, 314, 316, each of which is accessible to one of the CPUs 302, 304, 306, 308, respectively. In one implementation, each of the higher-level caches 310-316 is a level-three (L3) cache included within the CPU 302-308. Higher-level caches L1 and L2 (not shown) are also included within each of the CPUs 302-308, but are not discussed below.
Also included in the cache memory system 300 are two lower-level caches 318, 320. In the embodiment of
In
Also in
The specific computer system 301 of
Each of the lower-level caches 318, 320 includes, and is coupled with, a directory array 330, 332, respectively. In one embodiment, each directory array 330, 332 is stored within the cache memory (not shown explicitly in
The first directory array 330 includes a number of directory entries 334, while the second directory array 332 holds a number of directory entries 336. Each entry 334 of the first directory array 330 includes the system memory space address of a cache line stored within one or both of the first two higher-level caches 310, 312 coupled via the first bus 322 with the first lower-level cache 318. Also, the capacity of the first directory array 330 should be large enough to hold a number of directory entries 334 equal to the number of cache lines storable in the first two higher-level caches 310, 312 combined. In an analogous manner, each entry 336 within the second directory array 332 includes the memory space address of a cache line located within one or both of the second two higher-level caches 314, 316 coupled via the third bus 324 with the second lower-level cache 320. Similar to the first directory array 330, the capacity of the second directory array 332 should be large enough to hold a number of directory entries 336 equal to the number of cache lines storable in the second two higher-level caches 314, 316 combined.
Typically, the number of bits required to represent a memory space address for a cache line is less than the number of bits for the complete address of a particular location in the main memory 326, since each cache line normally includes a number of contiguous memory locations. For example, if each cache line comprises 128 bytes, and each byte is individually addressable in the main memory 326, then each cache line address is seven bits less in width than a full memory address, since 27 equals 128. Typically, the cache line address represents the most significant bits of the memory address, so that the bottom seven bits of the memory address are not represented in the cache line address in this example.
Each entry 334, 336 of a directory array 330, 332 may also include other information, such as one or more status bits describing the associated directory entry 334, 336. For example, in the case of the first lower-level cache 318, the status bits of a particular entry 334 may indicate which of the higher-level caches 310, 312 coupled with the first lower-level cache 318 over the first bus 322 includes a copy of the cache line indicated in the entry 334. In other embodiments, other status bits may be associated with each entry 334, 336 as well.
Upon initialization of the computer system 301, each of the caches 310-320 presumably is empty. Thus, the directory array 330 of the first lower-level cache 318 effectively contains no entries 334, as neither of the first two higher-level caches 310, 312 contain any valid cache lines. Presuming then that the first lower-level cache 318 receives a read request from the first CPU 302 through its higher-level cache 310 over the first bus 322 (operation 402), the empty lower-level cache 318 forwards the read request to the main memory 326 over the second bus 328 (operation 404). Once the first lower-level cache 318 receives a cache line including the requested data from the main memory 326 over the second bus 328 in response to the read request (operation 406), the first lower-level cache 318 forwards the cache line to the first higher-level cache 310 over the first bus 322 for the first CPU 302 to access (operation 408), and also creates a new directory entry 334 in its directory array 330 for the new cache line (operation 410). If status bits for the new entry 334 are also available, the entry 334 may also indicate in which of the higher-level caches 310, 312 contains the cache line (operation 412).
If, at some later time, the first lower-level cache 318 receives a read request for data in the same cache line from the second CPU 304 through its higher-level cache 312 (operation 414) over the first bus 322, the first lower-level cache 318 forwards the cache line to the second higher-level cache 312 over the bus 322 (operation 416). In addition, the first lower-level cache 318 updates the directory entry 334 of the cache line to indicate that the line is now present in both the first higher-level cache 310 and the second higher-level cache 312, if status bits are available within the entry 334 (operation 418).
At times, cache lines are purged or invalidated from either or both of the higher-level cache memories 310, 312, thus causing the lower-level cache 318 to update the appropriate entries 334 of its directory array 330. For example, if the first lower-level cache 318 receives a “dirty,” or modified, cache line from the first higher-level cache 310 over the first bus 322 to ultimately be written back to the main memory 326 (operation 420), the lower-level cache 318 deletes the directory entry 334 for that cache line from its directory array 330 (operation 422). Typically, the lower-level cache 318 may delete the entry 330 unconditionally, as cache coherency protocols normally require that only one cache at any level possess a dirty cache line.
In another example, the first higher-level cache 310 may instead invalidate a “clean,” or unmodified, cache line which may also be held within the second higher-level cache 312. Such an event may occur, for example, in response to a capacity fault in the first higher-level cache 310. As part of this operation, the first lower-level cache 318 receives an invalidate message for the cache line in the first higher-level cache 310 as part of the cache coherency protocol (operation 424). The response of the first lower-level cache 318 to the invalidate message may then depend on the type of information maintained in the status bits of the directory entry 334 associated with that cache line. Presuming these status bits possess the capacity to identify which of the first two higher level caches 310, 312 hold the cache line, the response may depend on whether the second higher-level cache 312 also holds a copy of the cache line. For example, if the status bits of the associated directory entry 334 indicate that only the first higher-level cache 310 held the cache line (operation 426), the first lower-level cache 318 deletes the entry 334 from the directory array 330 (operation 428). Otherwise, if both of the first two higher-level caches 310, 312 hold the cache line prior to the invalidate message, the second higher-level cache 312 may be able to ignore the message. Under this scenario, the first lower-level cache 318 may employ the status bits of the entry 334 for the cache line to indicate that the cache line is no longer held in the first higher-level cache 310, but is still present in the second higher-level cache 312 (operation 430). However, if the second higher-level cache 312 is not configured to ignore such an invalidate message, the first lower-level cache 318 is free to delete the entry 334 for the cache line after the associated cache line in the second higher-level cache 312 is invalidated. Alternatively, if the status bits of the associated directory entry 334 only indicate whether the cache line is stored in a higher-level cache 310, 312, but do not indicate which higher-level cache 310, 312 holds the line, the entry 334 is deleted and the second higher-level cache 312 invalidates its corresponding cache line, if valid.
Presuming the first lower-level cache 318 has maintained its directory array 330 in such a manner, the lower-level cache 318 may use the information therein to efficiently process protocol messages from the second lower-level cache 320 received over the second bus 328. For example, the first lower-level cache 318 may receive an invalidate request over the second bus 328 (operation 432), indicating that a particular cache line should be invalidated in the first lower-level cache 318 and its corresponding higher-level caches 310, 312. In response, the first lower-level cache 318 checks for an entry 334 for the cache line in its directory array 330 (operation 434). If an entry 334 is present, the first lower-level cache 318 transmits an invalidate request referencing the cache line over the first bus 322 to the first and second higher-level caches 310, 312 (operation 436); otherwise, the transmission of an invalidate request over the first bus 322 is unnecessary, thus reducing the amount of protocol communication traffic over the first bus 322. If the particular cache line is dirty, the higher-level cache 310, 312 storing the cache line will issue an implicit writeback command back over the first bus 322 to be received by the first lower-level cache 318 (operation 438), which the first lower-level cache 318 forwards to the main memory 326 (operation 440). In any event, the first lower-level cache 318 then deletes the entry 334 from its directory array 330 (operation 442).
In another situation, the first lower-level cache 318 may receive a read request for a particular cache line over the second bus 328 from the second lower-level cache 320 to access the most recent copy of the data (operation 444). In response, the first lower-level cache 318 checks its directory array 330 for an entry 334 for the cache line (operation 446). If so, a copy of the data resides in one or both of the first and second higher-level caches 310, 312, and thus the first lower-level cache 318 forwards the read request over the first bus 322 (operation 448). Thereafter, the first lower-level cache 318 receives the cache line returned in response to the read request by either the first or second higher-level cache 310, 312 (operation 450), and forwards the data to the second lower-level cache 320 over the second bus 328 (operation 452). However, if no entry 334 exists in the directory array 330 for the requested cache line, the first lower-level cache 318 need not request the data from the first two higher-level caches 310, again reducing communication overhead over the first bus 322. Instead, the first lower-level cache 318 may determine, such as by way of its own cache tags, that the cache line is present therein (operation 454). If so, the first lower-level cache 318 accesses the data from its own cache memory (operation 456), and transfers the data to the second lower-level cache 320 via the second bus 328 (operation 458).
As described above, various embodiments of the present invention, by way of the interaction between a lower-level cache and its associated directory array, reduce the amount of protocol communication within a cache memory system, especially between the lower-level and next-higher-level caches, without incurring the full extent of the cache memory consumption penalty normally associated with a cache-inclusiveness solution. Using the computer system 301 of
While several embodiments of the invention have been discussed herein, other embodiments encompassed by the scope of the invention are possible. For example, while some embodiments of the invention are described above in reference to a specific computer system architecture, many other computer architectures, including multiprocessor schemes, such as symmetric multiprocessor (SMP) systems, may employ various aspects of the invention. Also, while specific numbers, levels, and sizes of caches are presumed above for illustrative purposes, each of these characteristics may be varied greatly in other embodiments. Also, aspects of one embodiment may be combined with those of alternative embodiments to create further implementations of the present invention. Thus, while the present invention has been described in the context of specific embodiments, such descriptions are provided for illustration and not limitation. Accordingly, the proper scope of the present invention is delimited only by the following claims.