More generally, this invention is a method and apparatus for maintaining information about membership in a set, wherein membership in the set determines how items in the set are to be handled in a computer system. More specifically, an embodiment of the present invention relates to a method and apparatus for storing and removing data from cache memory in a computer system according to the data's membership in a set.
A special very high-speed memory is sometimes used to increase the speed of processing within a data processing system by making current programs and data available to a processor (“CPU”) at a rapid rate. Such a high-speed memory is known as a cache and is sometimes employed in large computer systems to compensate for the speed differential between main memory access time and processor logic. Processor logic is usually faster than main memory access time with a result that processing speed is mostly limited by the speed of main memory. A technique used to compensate for the mismatch in operating speeds is to employ one or more extremely fast, small memory arrays between the CPU and main memory whose access time is close to processor logic propagation delays. It is used to store segments of programs currently being executed in the CPU and temporary data frequently needed in the present calculations. By making programs (instructions) and data available at a rapid rate, it is possible to increase the performance rate of the processor.
If the active portions of the program and data are placed in a fast small memory such as a cache, the average memory access time can be reduced, thus reducing the total execution time of the program. The cache memory access time is less than the access time of main memory often by a factor of five to ten. The cache is the fastest component in the memory hierarchy and approaches the speed of CPU components.
The fundamental idea of cache organization is that by keeping the most frequently accessed instructions and data in one or more fast cache memory arrays, the average memory access time will approach the access time of the cache. Although the cache is only a small fraction of the size of main memory, a large fraction of memory requests will be found in the fast cache memory because of the locality of reference property of programs.
The basic operation of the cache is as follows. When the CPU needs to access memory, the cache is examined. If the word is found in the cache, it is read from the fast memory. If the word addressed by the CPU is not found in the cache, the main memory is accessed to read the word. A block of words containing the one just accessed is then transferred (prefetched) from main memory to cache memory. In this manner, some data is transferred to cache so that future references to memory find the required words in the fast cache memory.
Prefetching techniques are often implemented to try to supply memory data to the cache ahead of time to reduce latency. Ideally, a program would prefetch data and instructions far enough in advance that a copy of the memory data would always be in the cache when it was needed by the processor.
Data prefetching is a promising way of bridging the gap between faster pipeline and slower cache hierarchy. Most prefetch engines, that are in vogue today, try to detect repeated patterns among memory references. According to the detected patterns, they speculatively bring possible-future references into caches closer to the pipeline. Different prefetch engines use different methods for detecting reference patterns and speculating upon which references to prefetch. A prefetch engine often needs to accumulate historical information observed from the reference stream and base its prediction upon it. However, it is also important to periodically age-out stale information in an efficient manner.
A common paradigm in pre-fetching engines, which are in vogue today, is to have a learning phase and a tracking phase. During a learning phase, a prefetch engine detects a possible pattern exhibited by a sequence of memory accesses. Having detected, the prefetch engine switches to a tracking phase to track the progress of the pattern and issue prefetches as long as the pattern continues. For example, to detect strided references, a state machine is instituted so that it remembers the base address and the constant stride between two references. Each reference made thereafter is compared to see if the reference forms the next term in the strided sequence. If so, the state advances remembering the number of terms identified in the sequence. After a sequence of sufficient length is recognized, the state machine is disbanded, and the tracking phase is started to issue prefetches for future reference items in the sequence, ahead of time.
In general, the learning phase involves some finite tables to remember information in a local time window and possibly some associative searches to determine whether a suitable match has occurred. Scarcity of hardware resources often limits the table size and the type of searches, thereby forcing the stored information to be discarded after some time and re-learn the same information when the pattern re-appears at a later time during execution. Prefetching starts paying dividends only during the tracking phase that follows each learning phase.
Furthermore, it is not necessary to go through an entire re-learning phase before triggering the next tracking phase. It is sufficient to remember the occurrence of the first term of a pattern. In the above example, let's suppose that there is a strided sequence of references, “a, a+d, a+2d, . . . ”. If we remember the address “a” and recognize the next occurrence of “a”, then we can trigger the tracking phase much quickly without having to go through an elaborate re-learning phase. However, program behavior often changes and the same pattern may not repeat at the term that is remembered???. Hence, there is a need for a simple mechanism to phase-out old information over a period of time so that either the information will be re-confirmed or new information can be entered the next time when going through a re-learning phase.
There is, therefore, a need to provide for a low cost and high performance mechanism to phase out aging membership information in a prefeteching mechanism, for caching data or instructions.
More generally, there is a need to provide for a low cost and high performance mechanism to phase out aging membership information for items in a computer system to determine the handling of the items.
Accordingly, it is an object of this invention to provide a low cost and high performance mechanism to delete aging information in a set of items, such as data or instructions.
It is a more specific object of this invention to age out stale information in a membership engine of a data prefetcher in cache management systems, so that the right data or instructions are in the cache when it is needed for further processing or execution.
This invention provides a mechanism to accomplish easy aging in set of items to be used by a computer system by maintaining a primary and a secondary vector, which preferably have the same size and interface. An item is declared to be a member of a set when a representation of the item is found in the primary vector. When an item is inserted in a set, its representation is entered in both the primary and secondary vectors. Periodically, the two vectors are switched, that is making the primary vector become the secondary vector, and making the secondary vector become the primary vector. Then, at least some of the components of the new secondary bit vector are set to zeroes. Membership in the set is determined by examining the components in the primary vector, and the items in the set are then used in a predetermined manner.
In a more specific embodiment of this invention the set of items to be used by the computer system represent addresses of frequently used data or instructions to be stored in a cache memory. When there is a cache line miss for data or instructions, the primary vector is examined to see if the entry corresponding to the data or instructions is in the primary vector. If the entry corresponding to the data or instructions is found in the primary vector, then the corresponding data or instructions are prefetched into the cache.
It should be noted that the embodiment described below is only one example of usage of the invented apparatus, and does not constrain the generality of the claims in any manner.
Referring now to
As an example,
A cache line miss occurs when the corresponding line is not in the cache and the hardware automatically fetches the line at that time. A prefetch mechanism anticipates future uses of a line by the processor and issues commands to prefetch lines into the cache ahead of time, so that the transfer latency can be masked. As an example, a page is marked as hot, if the number of cache line misses exceeds a predefined threshold. When a page becomes “hot”, a hash of its address will be used to update the vectors in the membership engine as describe below. When the next line miss occurs in the hot page, the update in the vectors will be detected, thereby causing the prefetcher to initiate the transfer of all lines of the hot page into the cache, as shown in steps 262 and 263 of
The modules of the ME 200 and the DE 300 can be implemented, for example, in hardware using well known devices such registers and latches on a semiconductor chip.
In an alternative embodiment,
It should be appreciated by those skilled in the art that although four hash function modules are depicted, the multi-phase membership system can comprise any number of hash function modules as long as at least one hash function module is included. It should also be appreciated by those skilled in the art that the multi-phase membership system has no specific requirements for the hash functions implemented by the hash function modules, although different hash functions may result in different performance. An exemplary hash function simply maps an address x to the remainder of x divided by M, wherein M is the number of bits in each representation vector.
Also, in the alternative,
As stated above, according to an embodiment of the present disclosure, the primary representation vector and the secondary representation vector need to periodically switch their roles. When this happens, the original secondary representation vector becomes the new primary representation vector, and the original primary representation vector becomes the new secondary representation vector. Meanwhile, each bit in the new secondary representation vector is cleared to 0.
A detection engine is used to detect a pattern among memory references. As described above, a detection engine consists of two architectural components, a storage component 310 and a detection algorithm 320. The storage is a table-like structure, which may be indexed by reference addresses. Each entry should have enough machinery to accomplish what the detection algorithm needs to do. The detection engine described above inserts members in a set on the basis of a threshold number of line of cache misses. An example alternative algorithm used in the DE 300 is where the detection engine may track references to a page and label a page as “hot” if the number of such references exceeds a threshold. In this example, each entry in the table needs to have a counter to keep the information about how many times the relevant page is accessed. The detection algorithm is simply comparing a threshold value with the counter value stored in the table. The threshold value can be a predefined constant or a dynamically adjustable value. When the threshold is reached, the membership insert signal 228 is generated thereby causing membership engine updates as described above.