The invention relates to computers and data processing systems, and in particular to eviction algorithms for caches utilized in such computers and data processing systems.
Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both processors—the“brains” of a computer—and the memory that stores the information processed by a computer.
In general, a processor operates by executing a sequence of instructions that form a computer program. The instructions are typically stored in a memory system having a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a“memory address space,” representing the addressable range of memory addresses that can be accessed by a processor.
Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by the processor when executing the computer program. The speed of processors, however, has increased relative to that of memory devices to the extent that retrieving instructions and data from a memory can often become a significant bottleneck on performance. To decrease this bottleneck, it is desirable to use the fastest available memory devices possible, e.g., static random access memory (SRAM) devices or the like. However, both memory speed and memory capacity are typically directly related to cost, and as a result, many computer designs must balance memory speed and capacity with cost.
A predominant manner of obtaining such a balance is to use multiple“levels” of memories in a memory system to attempt to decrease costs with minimal impact on system performance. Often, a computer relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory devices (DRAM's) or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with SRAM's or the like. One or more memory controllers are then used to swap the information from segments of memory addresses, often known as“cache lines”, between the various memory levels to attempt to maximize the frequency that requested memory addresses are stored in the fastest cache memory accessible by the processor. Whenever a memory access request attempts to access a memory address that is not cached in a cache memory, a“cache miss” occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level of memory, often with a significant performance hit.
One type of multi-level memory architecture that has been developed is referred to as a Non-Uniform Memory Access (NUMA) architecture, whereby multiple main memories are essentially distributed across a computer and physically grouped with sets of processors and caches into physical subsystems or modules, also referred to herein as “nodes”. The processors, caches and memory in each node of a NUMA computer are typically mounted to the same circuit board or card to provide relatively high speed interaction between all of the components that are“local” to a node. Often, a“chipset” including one or more integrated circuit chips, is used to manage data communications between the processors and the various components in the memory architecture. The nodes are also coupled to one another over a network such as a system bus or a collection of point-to-point interconnects, thereby permitting processors in one node to access data stored in another node, thus effectively extending the overall capacity of the computer. In addition, one or more levels of caches are utilized in the processors as well as in each chipset. Memory access is referred to as“non-uniform” as the access time for data stored in a local memory (i.e., a memory resident in the same node as a processor) is often significantly shorter than for data stored in a remote memory (i.e., a memory resident in another node).
A typical cache utilizes a cache directory that maps cache lines into one of a plurality of sets, where each set includes a cache directory entry and the cache line referred to thereby. In addition, a tag stored in the cache directory entry for a set is used to determine whether there is a cache hit or miss for that set—that is, to verify whether the cache line in the set to which a particular memory address is mapped contains the information corresponding to that memory address.
Often each directory entry in a cache also includes state information that indicates the state of the cache line referred to by the entry, and that is used in connection with maintaining coherency among different memories in the memory architecture. One common coherence protocol is referred to as the MESI coherence protocol, which tags each entry in a cache in one of four states: Modified, Exclusive, Shared, or Invalid. The Modified state indicates that the entry contains a valid cache line, and that the entry has the most recent copy thereof—i.e., all other copies, if any, are no longer valid. The Exclusive state is similar to the Modified state, but indicates that the cache line in the entry has not yet been modified. The Shared state indicates that a valid copy of a cache line is stored in the entry, but that other valid copies of the cache line may also exist in other devices. The Invalid state indicates that no valid cache line is stored in the entry.
Caches may also have different degrees of associativity, and are often referred to as being N-way set associative. Each“way” or class represents a separate directory entry and cache line for a given set in the cache directory. Therefore, in a one-way set associative cache, each memory address is mapped to one directory entry and one cache line in the cache. Multi-way set associative caches, e.g., four-way set associative caches, provide multiple directory entries and cache lines to which a particular memory address may be mapped, thereby decreasing the potential for performance-limiting hot spots that are more commonly encountered with one-way set associative caches.
In addition, some caches may be“inclusive” in nature, as these caches maintain redundant copies of any cache lines that are cached by any higher level caches to which such caches are coupled. While an inclusive cache has a lower effective capacity than an “exclusive” cache due to the storage of redundant copies of cache lines that are cached in higher level caches, an inclusive cache provides a performance benefit in that the status of a cache line that is cached by any higher level cache coupled to an inclusive cache can be determined simply through a check of the status of the cache line in the inclusive cache.
One potential operation of a cache that may have an impact on system performance is that of cache line eviction. Any cache, being of limited size, is frequently required to cast out, or evict, a cache line from the cache whenever space for a new cache line is needed. In the case of a one-way set associative cache, the eviction of a cache line is unexceptional, as each cache line is mapped to a single entry in a cache, so an incoming cache line will necessarily replace any existing cache line that is stored in the single entry to which the incoming cache line is mapped.
On the other hand, in a multi-way set associative cache, an incoming cache line may potentially be stored in one of multiple entries mapped to the same set. It has been found that the selection of which entry to store the incoming cache line in, which often necessitates the eviction of a cache line previously stored in the selected entry, can have a significant impact on system performance. As a result, various selection algorithms, often referred to as eviction algorithms, have been developed to attempt to minimize the impact of cache line evictions on system performance.
Many conventional eviction algorithms select an empty entry in a set (e.g., an entry with an Invalid MESI state) if possible. However, where no empty entry exists, various algorithms may be used, including selecting the Least Recently Used (LRU) entry, selecting the Most Recently Used (MRU) entry, selecting randomly, selecting in a round robin fashion and variations thereof. Often, different algorithms work better in different environments.
One drawback associated with some conventional eviction algorithms such as LRU and MRU-based algorithms is that such algorithms are required to track the accesses to various entries in a set to determine which entry has been most recently or least recently used. In some caches, however, it may not be possible to determine a cache line's actual reference pattern. In particular, inclusive caches typically are not provided with the reference patterns for cache lines that are also cached in higher level caches.
As an example, in one implementation of the aforementioned NUMA memory architecture, each node in the architecture may include multiple processors coupled to a node controller chipset by one or more processor buses, with each processor having one or more dedicated cache memories that are accessible only by that processor, e.g., level one (L1) data and/or instruction caches, a level two (L2) cache and a level three (L3) cache. An additional level four (L4) cache may be implemented in the node controller itself and shared by all of the processors.
Where the L4 cache is implemented as an inclusive cache, the L4 cache typically does not have full visibility to a given cache line's actual reference pattern. In particular, an external L4 cache, being coupled to each processor over a processor bus, typically can only determine when a cache line is accessed whenever the L4 cache detects the access on the processor bus. However, a cache line that is frequently used by the same processor may never result in any operations being performed on the processor bus after the cache line is initially loaded into a dedicated cache for that processor. As a result, any cache eviction algorithm in the L4 cache that relies on tracked accesses to cache lines may make incorrect assumptions about the reference patterns for such cache lines, and thus select the wrong cache line to evict.
Therefore, a significant need exists in the art for an improved eviction algorithm for use with inclusive caches.
The invention addresses these and other problems associated with the prior art by utilizing a state-based cache eviction algorithm for an inclusive cache that determines which among a plurality of cache lines may be evicted from the inclusive cache based at least in part upon the state of the cache lines in a higher level cache. In particular, a cache eviction algorithm consistent with the invention determines, from an inclusive cache directory for a lower level cache whether a cache line is cached in the lower level cache but not cached in any of a plurality of higher level caches for which cache directory information is additionally stored in the cache directory, and evicts the cache line from the lower level cache based upon determining that the cache line is cached in the lower level cache but not cached in any of the plurality of higher level caches.
These and other advantages and features, which characterize the invention, are set forth in the claims annexed hereto and forming a further part hereof. However, for a better understanding of the invention, and of the advantages and objectives attained through its use, reference should be made to the Drawings, and to the accompanying descriptive matter, in which there is described exemplary embodiments of the invention.
The embodiments discussed and illustrated hereinafter implement a state-based cache eviction algorithm for an inclusive cache that is based at least in part upon the state of cache lines in a higher level cache. Specifically, cache eviction algorithms consistent with the invention attempt to identify cache lines that are cached in an inclusive cache, but not cached in any higher level caches coupled thereto. As a result, cache lines that are no longer present in higher level caches, and presumably not being used by any of the processors served by such caches, will be selected for eviction over cache lines that are still cached in a higher level cache, and thus presumably still being used by a processor. By doing so, the likelihood of a processor needing to access the evicted cache line in the immediate future is reduced, thus minimizing the likelihood of a cache miss and the consequent impact on performance.
In addition, in many implementations additional performance gains are realized due to minimizing the overhead associated with notifying higher level caches to invalidate their copies of evicted cache lines, since evicted cache lines that are not cached in any higher level cache do not require that any higher level cache be notified of the eviction of such cache lines. Particularly in environments where an inclusive cache is coupled to higher level caches via a bandwidth limited interface such as a processor bus, the elimination of such back-invalidate traffic reduces processor bus utilization and frees bandwidth for use in other operations. In addition, in pipelined processor architectures, eliminating the back-invalidate traffic can also minimize internal processor pipeline disruptions resulting from such traffic.
A cache eviction algorithm consistent with the invention typically determines, from an inclusive cache directory for a lower level cache whether a cache line is cached in the lower level cache but not cached in any of a plurality of higher level caches for which cache directory information is additionally stored in the cache directory. As will be discussed in greater detail below, the determination may be based upon state information maintained in the lower level cache directory that indicates whether a cache line is cached in a higher level cache. Such state information may be combined with state information for the cache line in the lower level cache, or may be separately maintained. Moreover, the state information may indicate which higher level cache has a valid copy of a cache line, or the state information may simply indicate that some higher level cache that is coupled to the lower level cache has a valid copy of the cache line without identifying which higher level cache has the valid copy. The state information for multiple higher level caches may be grouped together, e.g., by processor or by processor bus, or the state information for each cache may be separately maintained. The state information may also identify the actual state of a cache line in a higher level cache, or alternatively may only indicate that a higher level cache has a copy of the cache line in a non-invalid state. As an example, a cache directory for a lower level cache may require only a single bit that indicates whether or not a valid copy of an associated cache line is cached in a higher level cache. It will be appreciated, however, that in other embodiments additional state information may be stored in a lower level cache directory.
As will also become more apparent below, the eviction of a cache line based upon the state of the cache line in a higher level directory may be incorporated into various known eviction algorithms consistent with the invention. For example, as described in more detail below, it may be desirable in a multi-way set associative inclusive cache to implement an eviction algorithm that first selects an empty entry in an associativity set, then selects an entry for a cache line that is cached in the inclusive cache but not cached in any higher level cache if no empty entry exists, and finally selects an entry via an MRU, LRU, random, round robin, or other conventional algorithm if no cache line is found that is cached in the inclusive cache but not in any higher level cache. In addition, it may be desirable in some embodiments to apply MRU, LRU, random, round robin, or other techniques in connection with a determination that multiple entries in an associativity set have cache lines that are not cached in a higher level cache.
It will be appreciated that a cache is a“higher level cache” relative to an inclusive lower level cache whenever the lower level cache is coupled intermediate the higher level cache and the main memory of the computer. In the illustrated embodiment below, for example, the lower level cache is an L4 cache in the node controller of a multi-node computer, while the higher level caches are the L1, L2 and L3 caches disposed within the processors that are coupled to the node controller. It will be appreciated that a higher level cache and a lower level cache may be directly coupled to one another, or may be coupled to one another via an intermediate memory or cache. In addition, higher level caches may be dedicated to specific processors, or may be shared by multiple processors. Furthermore, a higher level cache may be multi-way set associative or one-way set associative, may itself be inclusive or exclusive, and may be only a data or instruction cache. Other variations will be apparent to one of ordinary skill in the art having the benefit of the instant disclosure.
Now turning to the Drawings, wherein like numbers denote like parts throughout the several views,
Computer 50, being implemented as a multinode computer, includes a plurality of nodes 52, each of which generally including one or more processors 54, each including one or more caches 55, and coupled to one or more system or processor buses 56. Also coupled to each of processor buses 24 is a chipset 58 incorporating a chipset cache 59, a processor bus interface 60, and a memory interface 62, which connects to a memory subsystem 64 over a memory bus 66. Memory subsystem typically includes a plurality of memory devices, e.g., DRAM's 68, which provides the main memory for each node 52.
For connectivity with peripheral and other external devices, chipset 58 also includes an input/output interface 70 providing connectivity to an I/O subsystem 72. Furthermore, to provide internodal connectivity, an internodal interface, e.g., a scalability port interface 74, is provided in each node to couple via a communications link 75 to one or more other nodes 52. Chipset 58 also typically includes a number of buffers resident therein, e.g., a central buffer 77, as well as one or more dedicated buffers 61, 75 respectively disposed in processor bus interface 60 and scalability port interface 74. Chipset 58 also includes control logic referred to herein as a coherency unit 76 to manage the processing of memory requests provided to the chipset by processors 54 and/or remote nodes 52 over a scalability port interconnect 75.
It will be appreciated that multiple ports or interfaces of any given type may be supported in chipset 58. As shown in
Chipset 58 may be implemented using one or more integrated circuit devices, and may be used to interface with additional electronic components, e.g., graphics controllers, sound cards, firmware, service processors, etc. It should therefore be appreciated that the term chipset may describe a single integrated circuit chip that implements the functionality described herein, and may even be integrated in whole or in part into another electronic component such as a processor chip.
Computer 50, or any subset of components therein, may be referred to hereinafter as an“apparatus”. It should be recognized that the term“apparatus” may be considered to incorporate various data processing systems such as computers and other electronic devices, as well as various components within such systems, including individual integrated circuit devices or combinations thereof. Moreover, within an apparatus may be incorporated one or more logic circuits that circuit arrangements, typically implemented on one or more integrated circuit devices, and optionally including additional discrete components interfaced therewith.
It should also be recognized that circuit arrangements are typically designed and fabricated at least in part using one or more computer data files, referred to herein as hardware definition programs, that define the layout of the circuit arrangements on integrated circuit devices. The programs are typically generated in a known manner by a design tool and are subsequently used during manufacturing to create the layout masks that define the circuit arrangements applied to a semiconductor wafer. Typically, the programs are provided in a predefined format using a hardware definition language (HDL) such as VHDL, Verilog, EDIF, etc. Thus, while the invention has and hereinafter will be described in the context of circuit arrangements implemented in fully functioning integrated circuit devices, those skilled in the art will appreciate that circuit arrangements consistent with the invention are capable of being distributed as program products in a variety of forms, and that the invention applies equally regardless of the particular type of computer readable media used to actually carry out the distribution. Examples of computer readable media include but are not limited to tangible, recordable type media such as volatile and non-volatile memory devices, floppy disks, hard disk drives, CD-ROM's, and DVD's, among others, and transmission type media such as digital and analog communications links.
In addition, in this exemplary architecture, four levels of caches are provided, with L1, L2, and L3 caches 55A, 55B, and 55C being provided on each processor chip 54, and with the chipset cache 59 being implemented as an L4 cache. L1 cache 55A is implemented as separate instruction and data caches, while L2 and L3 caches 55B and 55C cache both instructions and data.
L4 cache 59 includes a cache directory 80 and a data array 82, which may or may not be disposed on the same integrated circuit. L4 cache 59 is implemented as an inclusive 4-way set associative cache including N associativity sets 0 to N-1, with each associativity set 84 in directory 80 including four entries 86, 88, 90 and 92 respectively associated with four associativity classes 0, 1, 2 and 3. Each entry 86-92 in directory 80 includes a tag field 94, which stores the tag of a currently cached cache line, and a state field 96 that stores the state of a currently cached cache line, e.g., using the MESI protocol or another state protocol known in the art. Each entry 86-92 has an associated slot 98 in data array 82 where the data for each cached cache line is stored.
The state field 96 in each entry 86-92 stores state information for both the L1 cache and for the higher level L1-L3 caches 55A, 55B and 55C. In the illustrated embodiment, state information for the higher level caches is based upon a processor bus by processor bus basis, and moreover, the state information for each processor bus, as well as for the L4 cache, is encoded into a single field. For example, in one embodiment consistent with the invention, the state information for the L4 cache, the processor bus A (PBA) caches, and the processor bus B (PBB) caches is encoded into a 5-bit field, as shown below in Table 1. Moreover, in the illustrated embodiment, the L4 cache is not notified by a processor whenever that processor modifies its copy of a cache line, so the L4 cache does not distinguish between Exclusive and Modified states for the each processor bus. In other embodiments, a processor may notify the L4 cache of a state change from Exclusive to Modified such that the L4 cache will update the appropriate PBA or PBB state for the cache line.
It will be appreciated by one of ordinary skill in the art that other state protocols may be used, as may other mappings or encodings. Furthermore, state information may be partitioned on a processor-by-processor basis, or the state information may simply indicate whether any processor has a valid copy of a cache line. Other variations of storing state information that indicates whether a higher level cache has a valid copy of a cache line will be appreciated by one of ordinary skill in the art having the benefit of the instant disclosure.
Returning to block 104, if no cache hit occurs, the data must be fetched from an alternate source (e.g., node memory, a remote node, etc.). In addition, space for the new cache line must be allocated in the L4 cache. As such, control passes to block 106 to determine whether an available or unused entry exists in the associativity set for the requested cache line, e.g., by determining whether any entry in the associativity set has an invalid state. If so, control passes to block 108 to access the requested data from the node memory or a remote node (as appropriate). Once the data is retrieved, the data is then written into the empty entry, along with updating the MESI state and LRU information for the entry accordingly. Processing of the cache line request is then complete.
Returning to block 108, if no available entry is found, control passes to block 112 to determine whether any entry in the associativity set for the requested cache line is associated with a cache line that is not currently cached in any higher level cache, e.g., by determining whether any entry has an invalid state for all processor buses. If so, control passes to block 114 to access the requested data from the node memory or a remote node (as appropriate). Once the data is retrieved, the existing data in the identified entry is removed and replaced with the retrieved data, along with updating the MESI state and LRU information for the entry accordingly. Processing of the cache line request is then complete.
Returning to block 112, if no entry is found to be associated with a cache line that is not cached in a higher level cache, control passes to block 116 to select an entry according to a replacement algorithm, e.g., the aforementioned LRU algorithm. As such, block 116 accesses the requested data from the node memory or a remote node (as appropriate) and selects an entry according to the replacement algorithm, e.g., the least recently used entry. In addition, an invalidate request is sent to the relevant processor bus or buses for the cache line associated with the selected entry, and the existing data in the selected entry is removed and replaced with the retrieved data, along with updating the MESI state and LRU information for the entry accordingly. Processing of the cache line request is then complete.
It will be appreciated that other logic may be implemented in routine 100 in the alternative. For example, in the event of finding multiple available entries in block 108 or multiple entries associated with cache lines that are not cached in higher level caches in block 112, a replacement algorithm that is the same or different from that used in block 116 may be used to select from among the multiple entries.
It will be appreciated that various modifications may be made to the illustrated embodiments consistent with the invention. It will also be appreciated that implementation of the functionality described above within logic circuitry disposed in a chipset or other appropriate integrated circuit device, would be well within the abilities of one of ordinary skill in the art having the benefit of the instant disclosure.