The present invention relates to computer data storage, and more particularly to cache memory subsystems.
In order to take advantage of the ever-increasing speed of microprocessors, data storage must either use expensive memory or provide for appropriate cache memory subsystems at appropriate points in the computer network system processing the data. The cache memory is conventionally smaller, faster than the computer system memory and operates at a higher speed than the system memory. The purpose of the cache is to position the information, both instructions and data, that the computer processor is to use next. The information may then be made available to the processor more quickly due to the speed of the cache memory. In most cache systems, when the system processor requests information, the request is first sent to the cache memory. If the cache contains the information, a “hit” signal is issued, and the requested information is sent to the appropriate function under the processor control. If the requested information is not in the cache, a signal indicative of a “miss” is returned to the processor, and the information is then retrieved from the slower system memory.
In the discussion that follows, when the term data is used with respect to caches, it is meant to cover both instructions and data for storage.
A cache is a collection of cache lines: each line includes a tag identifying the line and each line also includes the data content of the line. A successful identification of a tag is a hit. Otherwise, there is a miss. The cache lines are arranged in sets. The address of the data requested includes an index that is used to access the correct set in the cache; the address also includes an address tag that is compared to the cache line tag. If the tags match, there is a hit, and the cache line data is returned to the user. If none of the tags in the set match, the requested line has to be sought from a lower level storage that might be another cache or memory. This is considered a miss. If there is only one cache line in the set, the cache is called direct-mapped. If there is more than one cache line in the set, the cache is called n-way set associative (where n is the number of cache lines in each set). The n locations in each set in an n-way set associative cache are called ways. If the whole cache is a single set and the number of cache lines in the set is equal to the number of ways in the cache, the cache is called fully-associative. When a new cache line is brought in from a lower level storage, it makes space for itself by evicting an already existing line. The candidate for eviction is chosen based upon a selected replacement policy or protocol. Standard eviction protocols are usually variations of a LRU (least recently used) policy, i.e. the cache line that has not been used for the longest time has the highest probability of being evicted.
Direct-mapped or low-associativity caches are subject to interference misses or conflict misses problems. This occurs when accesses to a relatively small number of lines, the number of accesses being larger than the associativity, map to the same set. The access tags differ but there is not enough space in the set to simultaneously keep all of the accesses. If such accesses to these lines are repetitive and in a round-robin fashion, there could be a situation where the accesses always result in a miss. This behavior pattern is known as thrashing. While there may be space in the whole cache to store all of these lines, there is not enough space in any one set to do so.
The problem is further aggravated when multiple hardware threads share a cache. A problem arises when the different threads are running and sharing the same workload, and the associativity in the cache is just enough for one thread, but falls short when multiple threads share the cache. In another situation, the different threads could be running workloads that have very different cache access patterns. One thread might not reuse any of the data it brings into the cache, thus polluting or overloading the cache, while another thread, potentially needing more space in the cache, is not being afforded that space because the first thread's data, although never reused, is occupying valuable space.
Increasing the associativity of the cache has been considered but does not necessarily solve this problem. In fact, increased associativity could increase the problem, particularly in the case of multiple threads sharing a cache. This can be the case because there is no expedient to identify or to weight the value of a line before allocating it space in a cache set. If a thread is streaming through data, it could potentially use up most of the associativity of a set, even if the data is not used. Another drawback of higher associativity leads to a super-linearly higher power requirement in the cache because multiple simultaneous tag comparisons are required to identify a hit or a miss. Such comparisons, if done serially, would significantly increase the access latency of the cache.
Many solutions to improving the utilizing of cache associativity or providing extra associativity, as required, have been tried. However, most of these solution schemes evaluate the worthiness or weight of a cache line before affording it space in the cache. One solution to increase associativity while keeping the power requirement low, the average latency low and the associativity flexible, is to have a small fully-associative buffer in addition to the usual low-associativity cache. This buffer is searched upon the occurrence of a miss in the main cache. It is called a victim buffer or a victim cache. The limitation of this approach is that the victim buffer can handle associativity extension up to a relatively small total amount of extra associativity. Also, there might continue to be associativity lying unused in other parts of the cache.
An idea similar to the victim cache is a micro-cache that provides one or more extra sets in the cache that adaptively associate themselves with and, thus, extend one or more of the existing sets in the cache. The main drawback of such a scheme is that the size of the micro-cache must be limited so as not to increase the overall cache area drastically. Control logic complexity and latency increases are other concerns with the micro-cache scheme. Schemes to reduce the chances of thrashing due to repetitive uniformly spaced addresses have included index-hashing, Column-Associative caches and Skewed-Associative caches. In simple address-hashing schemes, the bits of the address that select the index are hashed and are then used to index into the cache sets. The disadvantage with this technique is that the hashing is static and can still suffer from the same problems described above. Hash-Rehash caches and Column-Associative caches use two hash functions to hash the index-bits in the address to evaluate the index. The first hash function is applied first, and upon a miss, the second hash function is applied. The existing storage in the cache is used to place a conflicting address. Column-Associative caches extend Hash-Rehash caches with a few relatively minor optimizations. The drawback of these schemes is that they have been described for direct-mapped caches only. The Skewed-Associative cache reduces the chance of set interference by using different hashes for indexing into different ways of a cache. These hashes are applied simultaneously rather than serially as in the earlier schemes. Thus, lines that would originally all map to the same set typically get mapped to different sets. The disadvantage of this scheme is that extra mapping hardware is required.
There has also been proposed a (Most Recently Used) MRU bit array that eliminates the need for data swapping between the primary and secondary locations for a line in a multiple-access cache. The MRU bit array is accessed beforehand to determine which location should be probed first. LRU-Based Column-Associative Caches extend the Column-Associative Cache to more than two (2) locations for a line, but require even longer sequential searches through the caches. If the primary location results in a miss, the secondary location is searched. If the secondary location results in a miss, a tertiary location is searched, etc. The disadvantage of this scheme is the long latency to access the cache and the overall performance gain this scheme can give, given the additional hardware overhead required to implement this scheme.
The problem of sharing the storage in a cache is optimally even more important when there are multiple threads that share the cache. This problem has only recently come into prominence with the design of semiconductor chips with multiple processing units. Such multi-core (multi-CPU) chips typically let caches be shared by more than one thread. Often, for Level 2 caches, the number of threads sharing the cache is sixteen (16) or more. Under such circumstances, it is highly likely that a few “bad” threads could hijack the space on the cache by being aggressive in accessing the cache, while not being very efficient in using the data fetched. An example of such a thread might be one that is running a streaming benchmark. The workload accesses a lot of data; regularly spaced, randomly spaced or a mixture of the two, and brings accessed data into the cache, but only rarely reuses the data in the cache. In such a scenario, all the other threads that bring in less data, but which would have actually reused the data, might suffer at the expense of the few “bad” threads.
The v-way cache addresses a problem similar to the problem that will hereinafter be addressed in the present invention by using a tag-array independent from the data-array. The tag-array is organized as a set-associative array, whereas the data array is organized as a direct-mapped array. The tag-array has forward pointers to the entry in the data-array containing the data corresponding to the tag. The data can be anywhere in the data-array, and not necessarily tied one-to-one to the tag-array entry. Also, the tag-array trades utilization for flexibility. It is typically only half or less-than-half full. However, a given set can grow in associativity, if necessary, at the expense of another set or sets in the cache, shrinking in associativity. The benefits of this scheme are reduction in conflict misses due to higher associativity and global replacement of data due to keeping the data-array separate. The disadvantages are that every access to data requires an initial access to the tag-array, followed by another sequential access to the data-array depending on the forward pointer. This detached tag-array and data-array design, therefore, leads to a longer best-case latency. Similarly, on a replacement, after a data line eviction in the data-array, the reverse pointer is followed to the tag-array entry to invalidate it.
Several other solutions to the problem of efficiently and fairly allocating storage in a cache shared by multiple threads (fair partitioning) have been proposed. Many of these schemes use way-partitioning that reallocate the existing ways in a set among threads sharing the cache. The drawback of these schemes is that the ideas are not scalable, as the number of threads sharing a cache grows because the set-associativity of the cache could be smaller than the number of threads. Partitioning the ways among the threads works well when there are fewer threads than ways, thereby allowing at least one way in a set to be allocated to each thread.
In addition, the Utility-based Cache Partitioning relies on hardware structures called UMONs (Utility Monitors) that count the “utility” characteristics of each thread for each cache set (or, over all cache sets) over a large number of clocks (5 million) and adjusts the partition every so often. This might be too large of a granularity for the system to be reactive enough.
Accordingly, there is a need for an efficient way to share cache associativity in a processor system without relying on extraneous storage (like a victim-cache or a micro-cache), without being limited to direct-mapped caches (like the hash-rehash or column-associative caches), without relying on way-partitioning (that works when there are fewer threads than associativity) and without reacting slowly to the dynamics of cache utilization, especially when a large number of threads share the cache (like in Utility Based Computing),
In its broadest aspects, the present invention provides for the improved utilization of cache storage by determining which lines of data are worthy of the cache space, i.e. have sufficient value to be provided cache space and then judiciously utilizing space in a cache set different from the set that the cache line indexes into.
More particularly, the present invention relates to a cache memory including a plurality of sets of cache lines, and provides an implementation for increasing the associativity of selected sets of cache lines including the combination of providing a group of parameters for determining the worthiness of a cache line stored in a basic set of cache lines, providing partner sets of cache lines, in the cache memory, associated with the basic set, applying the group of parameters to determine the worthiness level of a cache line in the basic set and responsive to a determination of a worthiness in excess of a predetermined level, for a cache line, storing said worthiness level cache line in said partner set.
In accordance with one operative aspect of the invention, the cache memory is an n-way set associative cache and the access input to the cache is greater than n input threads. In providing for the partnering, there is provided an implementation enabling said basic set to borrow ways from said partner set, wherein the number of ways in the set of cache lines is increased. A function is provided associated with said basic cache for indicating the cache lines stored in said partner cache.
For best results, the means for determining the worthiness level and the means for storing cache lines in the partner set are dynamically operative while data lines are being input into the cache memory.
In accordance with another aspect of the invention, an implementation is provided for evicting selected cache lines from said basic set in order to prevent exceeding the capacity of said basic set, wherein the means for determining said worthiness level determine the worthiness of an evicted cache line. The worthiness of a cache line may be determined by the reuse potential of the cache line, and the reuse of an evicted cache line may have already been so tracked prior to eviction.
Apparatus for applying the cache line set partnering of the present invention in cache memory system may be embodied in the combination of a data array for storing said basic and partner sets of cache lines, a tag array for storing tags to said respective cache lines and a ghost tag array for storing tags for respectively indicating the cache lines stored in said partner cache.
The present invention will be better understood and its numerous objects and advantages will become more apparent to those skilled in the art by reference to the following drawings, in conjunction with the accompanying specification, in which:
Conventionally, processors supporting memory caches have two main storage arrays (FIG. 1)—a Tag Array 11 and a Data Array 10. The structure of a 4-way set associative cache is shown in
In the illustration shown in
When a cache is shared by many cores, e.g. multi-CPU semiconductor chips that will have many more threads (simultaneously running execution sequences that may each access the cache), the cache appears as a uniform resource to all of the threads. However, the thread that uses the space in the cache most aggressively tends to use up more space in the cache. Aggressively making requests to the cache and thereby using more of the cache does not necessarily imply that the thread is using the cache efficiently. The overall throughput, performance or both may suffer because the other threads could be starved for cache space.
To more efficiently use the cache space, whether a single thread or multiple threads use it, it is important to recognize that cache lines are making effective use of the cache space and which lines are not. A cache line that is used one or more times after the first access brings it in might be considered one that uses the cache space more effectively than a cache line that, after the initial access, is never used again before it is evicted. Several schemes to measure reuse effectiveness use counters that count how often a line was accessed after being brought into the cache.
It should be noted that reuse potential is not the only attribute that could determine worthiness. For example, the threads providing cache lines could be given weights or priorities that could be used to determine the worthiness of cache lines from such threads.
The present invention prefers an embodiment, to be hereinafter described in detail, that uses an extra Tag Array that may be referred to as a “Ghost Tag Array” or “Shadow Tag Array”. A purpose of this Ghost Tag Array is to retain information about cache lines that are no longer in the main (basic) data array.
This implementation is shown in
Now, with respect to
As ancillary considerations, in case of a store-back cache (also known as write-back cache), the miss handling described above applies to both load and store misses (in most cases). In case of a store-through cache (also known as a write-through cache), the miss handling described above applies only to load misses since store-misses do not bring any data into the cache.
Continuing with respect to
As has been previously mentioned, the means for determining the worthiness level and the means for storing cache lines in the partner set are, preferably, dynamically operative while data lines are being input into the cache memory. In such a dynamic environment, an implementation is provided for evicting selected cache lines from said basic set in order to prevent exceeding the capacity of said basic set, wherein the means for determining said worthiness level determine the worthiness of an evicted cache line. The worthiness of a cache line may be determined by the reuse potential of the cache line and the reuse of an evicted cache line may have already been so tracked prior to eviction. An embodiment of this will be described with respect to the Handle Replacement In Main-Set (HRIM) flowchart of
If the replacement candidate in the Main Set is marked “worthy” or high associativity eligible, step 63, Yes, then additional associativity must be borrowed in the cache so as not to lose the data from the Data Array. The tag and data corresponding to this replacement candidate are attempted to be moved to the Partner Set, step 64. An example of such a move to partner procedure will be described with respect to the Move To Partner Set (MTPS) flowchart of
When a cache data line is found to be sufficiently worthy to be moved to the partner set, an additional determination must be made as to space available in the partner set.
A flowchart, MTPS flowchart (
Since the steps laid out in the HRIM flowchart,
Although certain preferred embodiments have been shown and described, it will be understood that many changes and modifications may be made therein without departing from the scope and intent of the appended claims.