Not applicable.
Not applicable.
For decades, improvements in semiconductor design and manufacturing have dramatically increased processor performance and main memory density. As clock speeds for processors increase and main memory becomes larger, longer latency periods may occur when a processor accesses main memory. Cache hierarchies (e.g. different cache levels) may be implemented to reduce latency and performance bottlenecks caused by frequent access to main memory. Cache may be one or more small high speed associative memories that reduce the average time to access main memory. To reduce the average time to access main memory, cache provides a copy of frequently referenced main memory locations. When a processor reads or writes a location in main memory, the processor first checks to see if a copy of the data already resides in the cache memory. When present, the processor is directed to the cache memory rather than the slower main memory.
For cache to be effective, a processor needs to continually access the cache rather than main memory. Unfortunately, the size of cache is typically smaller and limited to storing a smaller subset of the data within the main memory. The size limitation may inherently limit the “hit” rate within the cache. A “hit” occurs when the cache holds a valid copy of the data requested by the processor, while a “miss” occurs when the cache does not hold a valid copy of the requested data. When a “miss” occurs within the cache, the processor may subsequently access the slower main memory. Hence, frequent “misses” within a cache negatively impacts latency and processor performance. One method to reduce the “miss” rate is to increase the size of the cache and the amount of information stored within the cache. However, as the cache size increases and becomes more complex, cache performance (e.g. time required to access the cache) generally decreases. As a result, a design balance for the cache is typically struck between minimizing the “miss” rate and maximizing cache performance.
Victim cache may be implemented in conjunction with cache to minimize the impact of “misses” that occur within cache. For instance, when a cache replaces old data stored in the cache with new data, the cache may evict the old data and transfer the old data to the victim cache for storage. After the eviction of the old data, a “miss” may occur within the cache when the processor requests for the old data. The processor may subsequently access the victim cache to determine whether the old data is stored in the victim cache. Victim cache may be beneficial because accessing the victim cache instead of main memory reduces the time to reference missing data evicted from the cache. However, victim cache may be somewhat inflexible and limited in applicability. For example, the size of the victim cache is typically smaller and stores less information than the cache to avoid compromising the processor clock rate. Additionally, an increase in latency occurs when the processor accesses the victim cache subsequent to a “miss” within cache. In other words, the processor may wait at least one clock cycle before accessing the victim cache. Hence, a solution is needed to increase the flexibility and usability of the victim cache, and thereby increase processor performance.
In one embodiment, the disclosure includes an apparatus for accessing a primary cache and an overflow cache, comprising a core logic unit configured to perform a first instruction that accesses the primary cache and the overflow cache in parallel, determine whether the primary cache stores a requested data, determine whether the overflow cache stores the requested data, and access a main memory when the primary cache and the overflow cache do not store the requested data, wherein the overflow cache stores data that overflows from the primary cache.
In yet another embodiment, the disclosure includes an apparatus for concurrently accessing a primary cache and an overflow cache, comprising a primary cache that is divided into a plurality of primary cache blocks, an overflow cache that is divided into a plurality of overflow cache blocks, and a memory management unit (MMU) configured to perform memory management for the primary cache and the overflow cache, wherein the primary cache and the overflow cache are accessed within a same clock cycle.
In yet another embodiment, the disclosure includes a method for concurrently accessing a primary cache and an overflow cache, wherein the method comprises, determining whether a primary cache miss has occurred within a primary cache, determining whether an overflow cache miss has occurred within an overflow cache, selecting a primary cache entry using a first cache replacement policy when a primary cache miss has occurred within a primary cache, and selecting an overflow cache entry using a second cache replacement policy when an overflow cache miss has occurred within an overflow cache, wherein determining whether the primary cache miss and the overflow cache miss occurs within a same clock cycle.
These and other features will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings and claims.
For a more complete understanding of this disclosure, reference is now made to the following brief description, taken in connection with the accompanying drawings and detailed description, wherein like reference numerals represent like parts.
It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods may be implemented using any number of techniques, whether currently known or in existence. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques described below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.
Disclosed herein are a method, an apparatus, and a system to access an overflow cache concurrently with a primary cache. When a core logic unit (e.g. a processor) performs an application that accesses the primary cache, the core logic unit may also access the overflow cache in parallel and/or within the same clock cycle of the core logic unit. The primary cache may be configured as a M-way set associative, while the overflow cache may be configured as a N-way set associative, where M and N are integers. By concurrently accessing the primary cache and overflow cache, the core logic unit may be able to access a M+N-way set associative memory element. The overflow cache may be a separate memory element that may be configured to implement the same or different replacement policies than the primary cache. “Hits” within an overflow cache may be promoted to the primary cache to avoid evicting data to the main memory and/or to the remaining memory subsystem (e.g. next level of cache). In one embodiment, a single MMU may be used to perform memory management functions, such as an address translation and/or memory protection, for both the primary cache and overflow cache.
The general-purpose computer system 100 may also comprise a core logic unit 102 coupled to the Rx 108 and the Tx 110, where the core logic unit 102 may be configured to implement any of the schemes described herein, such as accessing the primary cache 104, overflow cache 106, main memory 116, and other layers of memory subsystem 118. The core logic unit 102 may also be configured to implement methods 500, 600, 700, and 800 which will be described in more detail later. The core logic unit 102 may comprise one or more central processing unit (CPU) chips, field-programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), and/or digital signal processors (DSPs), and/or may be part of one or more ASICs. In one embodiment, the core logic unit 102 may comprise one or more processors, where each processor is a multi-core processor.
The memory subsystem 118 may comprise a primary cache 104, an overflow cache 106, and main memory 116. The primary cache 104 may be a data cache that may be organized into one or more cache levels (e.g. level 1 (L1) cache and level 2 (L2) cache). The primary cache 104 may store the actual data fetched from the main memory 116. The primary cache 104 may typically be accessed faster and/or may have less storage capacity than the main memory 116. The primary cache 104 may be configured to store and/or load physical addresses or virtual addresses. For example, when core logic unit 102 is a single processor, the primary cache 104 may store virtual addresses. Alternatively, the primary cache 104 may store physical addresses when the core logic unit 102 is a multi-processor. Overflow cache 106 may be a separate memory element configured to store data evicted from the primary cache 104. The overflow cache 106 may act as overflow storage of data when the primary cache 104 is full and unable to store the data. The size of the overflow cache 106 and the configuration of the overflow cache 106 will be discussed in more detail below. As discussed above, the primary cache 104 and overflow cache 106 may be RAM memory components (e.g. SRAM).
The main memory 116 may be accessed after “misses” occur in the primary cache 104 and/or the overflow cache 106. In one embodiment, main memory 116 may be the next level of memory subsequent to the primary cache 104 and the overflow cache 106. The main memory 116 may have a larger capacity and may operate slower than both the primary cache and the overflow cache 106. A store queue (not shown in
The memory subsystem 208 may be external to the processing chips 206 and may include portions of the memory subsystem 116 that was discussed in
Other embodiments of the primary cache 302 may be a directly mapped cache or a fully associative cache. A directly mapped cache may map one memory location within main memory 300 to one memory location of the primary cache 302. In other words, a directly mapped cache may be a one-way set associative of the primary cache 302. A fully associative is when each entry in the main memory 300 may be mapped to any of the memory locations of the primary cache 302. Using
Cache parameters for the overflow cache, such as the mapping of addresses to main memory, capacity, and cache replacement policies may be flexibility adjusted depending on the overflow cache performance and “miss” rate of the primary cache. Similar to primary cache 402, overflow cache may be configured to map to main memory 400 as fully associative, set associative, or as a directly mapped cache, which are discussed above. The mapping associativity for the overflow cache may be the same or differ from the primary cache 402. For example, the primary cache 402 and the overflow cache may both be a four-way associative cache and have a 1:1 ratio as the number of “way” associative. Other embodiments of the primary cache 402 may be a M-way associative cache, while the overflow cache is a N-way associative cache, where the value of M differs from the value of N. Moreover, the capacity of the overflow cache may be adjusted and may not be a fixed size. For example, the overflow cache may initially have a capacity of about eight kilobytes (KB). The capacity of the overflow cache may be increased to 32 KB when the “miss” rate is too high for the primary cache. The capacity of the primary cache may also be the same or differ from the capacity of the overflow cache.
A variety of cache replacement policies, such as Belady's algorithm, least recently used (LRU), most recently used (MRU), random replacement, and first-in-first-out (FIFO), may be used to determine which cache entry (e.g. cache line) is evicted from an overflow cache and/or primary cache 402. The overflow cache may also be configured with cache replacement policies that differ from the primary cache 402. For example, the overflow cache may be configured with the random replacement cache replacement policy, while the primary cache 402 may be configured with a LRU cache replacement policy. The cache replacement policy for the overflow cache may be adjusted to minimize the “miss” rate for the primary cache 402 and overflow cache.
Method 600 may start at block 602. Blocks 602, 604, 606, and 608 may be substantially similar to blocks 502, 504, 506, and 508 of method 500. Moreover, blocks 602 and 604 may be performed in parallel by method 600, similar to blocks 502 and 504 for method 500. At block 610, method 600 may select an entry (e.g. a cache line) within the primary cache to write data. In contrast to the write-through policy, an entry within the primary cache may be selected because the write-back policy initially writes into the primary cache and not to the main memory. Method 600 may use any of the cache replacement policies (e.g. FIFO) well known in the art at block 610. Method 600 then moves to block 612 and determines if the entry is “dirty” within the primary cache. If the entry is “dirty” (e.g. data has not be written into the main memory), then the method 600 may move to block 614. Conversely, if the entry is not “dirty”, the method 600 moves to block 622. At block 622, method 600 writes data into the selected entry within the primary cache. Afterwards, method 600 may proceed to block 624 to mark the entry within the primary cache as “dirty,” and subsequently ends.
Returning to block 614, method 600 determines if the overflow cache is full. The overflow cache is full when all of the allocated overflow cache entries for the “dirty” entry within the primary cache already stores data. For example, for a N-way set associate overflow cache, an overflow cache is full when all N overflow cache locations allocated for the “dirty” entry within the primary cache already store data. If the overflow cache is full, then method 600 moves to block 616 and selects an overflow cache entry to write data within the “dirty” entry of the primary cache. As discussed above, method 600 may use any cache replacement policy that is well known in art when selecting the overflow cache entry. Subsequently, method 600 moves to block 618 and writes the data located within the selected overflow cache entry into main memory. Method 600 then moves to block 620. Returning back to block 614, when the overflow cache is not full, then method 600 continues to block 620. At block 620, method 600 writes the data within the “dirty” entry of the primary cache into the selected overflow cache entry. After method 600 completes block 620, method 600 moves to block 610 and performs the block functions as described above.
At block 704, if method 700 determines that there is not overflow cache “hit,” method 700 may move to block 706 to select a replacement entry within the primary cache. Method 700 may perform any cache replacement policies well known in the art. Afterwards, method 700 may proceed to block 708 and read the data from the main memory. Method 700 reads the data from the main memory because no “hits” occurred within the primary cache and overflow cache. Method 700 may then continue to block 710 and load data read from main memory into the replacement entry within the primary cache. Method 700 loads the data read from main memory because “misses” occurred within the primary cache and/or the overflow cache. At block 710, method 700 may evict data already stored in the primary cache when loading data read from the main memory. Afterwards, method 700 may proceed to block 712, and return the data to a core logic unit (e.g. a processor).
Additionally, the capacity of the primary cache and the overflow cache may vary amongst each other. For example, in one embodiment, the capacity of the primary cache and the overflow cache may be a 1:1 ratio, such as a 32 KB capacity for both the primary cache and the overflow cache. In this instance, each primary cache block 1-4910 and each overflow cache block 1-4912 may have a capacity of 8 KB (32 KB/4 blocks). In another embodiment, the capacity of the primary cache the overflow cache may be an 1:4 ratio, such as having a 32 KB capacity for the primary cache and an 8 KB capacity for the overflow cache. For this configuration, each primary cache block 1-4910 may have a capacity of 8 KB (32 KB/4 blocks), and each overflow cache block 1-4912 may have a capacity of 2 KB (8 KB/4 blocks).
The MMU/translation table 904 may be configured to translate virtual addresses to physical addresses or vice versa. The MMU/translation table 904 may be configured to translate virtual addresses to physical addresses when the primary cache blocks 910 and the overflow cache blocks 912 are configured to store physical addresses. The MMU/translation table 404 may comprise an address translation table that includes entries that map the virtual addresses to physical addresses. The MMU/translation table 904 may be further configured to maintain page information, perform permission tracking, and implement memory protection. As shown in
The primary cache tag block 906 may reference the main memory address for the data stored within each of the primary cache blocks 910. As such, the primary cache tag block 906 may provide four different tag addresses for each of the primary cache block 910. Using
The primary cache tag block 906 and the overflow cache tag block 908 may provide the tag addresses selected using the memory access command 902 and feed the tag addresses into the tag compare components 916. The tag compare components 916 may be additional computational logic that compares the inputted tag addresses with the translated physical memory address to determine whether a match occurs and output a value to the “way” mux 914. For example, if at least one of the tag addresses matches the translated physical memory address, the tag compare component 916 may output a value that selects the corresponding primary cache block 910 and/or the overflow cache block 912. Otherwise, the tag compare component 916 may generate a “null” value (e.g. a value of “0”) that may not select any of the data provided by the primary cache block 910 and/or the overflow cache block 912 to the “way” mux 914.
The primary cache blocks 1-4910 and the overflow cache blocks 1-4912 may use the memory access command 902 to select the relevant cache entries, and output the data within the cache entries to the “way” mux 914. The “way” mux 914 may receive the input from the tag compare components 916 and determine whether to select any one of the data input from the primary cache block 1-4910 or from the overflow cache block 1-4912. One “way” mux 914 may determine whether the primary cache stores the data requested in the memory access command 902, while the second “way” mux 914 may determine whether the overflow cache stores the data requested in the memory access command 902. When one of the primary cache blocks 910 stores the data requested in memory access commend 902, the “way” mux 914 may generate a primary cache read data out 918 that corresponds to a “hit” in the primary cache. When one of the overflow cache blocks 912 stores the data requested in the memory access command 902, the other “way” mux 914 may generate an overflow cache read data out 920 that corresponds to a “hit” in the overflow cache. A “miss” occurs within the primary cache and/or the overflow cache when there is no primary cache read data out 918 and/or no overflow cache read data out 920.
The main memory address within the memory access command 902 may be split such the overflow cache tag block 908 and the primary cache tag block 906 pertain to the most significant bits, while the primary cache blocks 910 and overflow cache blocks 912 pertain to the least significant bits. For example, if the main memory has a capacity of 4 gigabytes (GB), 32 bits may be used to represent the different main memory addresses (e.g. 2̂32=4,294,967,296). If each of the primary cache blocks 910 has a capacity of 8 KB (e.g. total capacity of primary cache equals 32 KB), then the lower 13 bits may be used to reference the memory address spaces for the primary cache blocks 910 (e.g. 2̂13=8192). For example, if the lower 13 bits of the main memory address is “0000000000000,” the “0000000000000” may reference the first address space within each of the primary cache blocks 910. The upper 19 bits may then be used to reference the memory address spaces for the primary cache tag block 910. In another embodiment, the primary cache and victim may split the main memory address such the most significant bits (MSBs) are designated for the tag address, the middle bits are designated for the data blocks and the least significant bits (LSBs) may be reserved for flag bits, such as designating whether a cache entry is “dirty.” Persons of ordinary skill in the art are aware that other cache entry structures may be used that split the main memory addresses differently from described above.
It is understood that by programming and/or loading executable instructions onto the general-purpose computer system 100, at least one of the core logic units 102, the memory subsystem 118, and the secondary storage 109 are changed, transforming the computer system 500 in part into a particular machine or apparatus, e.g., a network node, having the novel functionality taught by the present disclosure. It is fundamental to the electrical engineering and software engineering arts that functionality can be implemented by loading executable software into a computer, which can be converted to a hardware implementation by well-known design rules. Decisions between implementing a concept in software versus hardware typically hinge on considerations of stability of the design and numbers of units to be produced rather than any issues involved in translating from the software domain to the hardware domain. Generally, a design that is still subject to frequent change may be preferred to be implemented in software, because re-spinning a hardware implementation is more expensive than re-spinning a software design. Generally, a design that is stable that will be produced in large volume may be preferred to be implemented in hardware, for example in an ASIC, because for large production runs the hardware implementation may be less expensive than the software implementation. Often a design may be developed and tested in a software form and later transformed, by well-known design rules, to an equivalent hardware implementation in an application specific integrated circuit that hardwires the instructions of the software. In the same manner as a machine controlled by a new ASIC is a particular machine or apparatus, likewise a computer that has been programmed and/or loaded with executable instructions may be viewed as a particular machine or apparatus.
At least one embodiment is disclosed and variations, combinations, and/or modifications of the embodiment(s) and/or features of the embodiment(s) made by a person having ordinary skill in the art are within the scope of the disclosure. Alternative embodiments that result from combining, integrating, and/or omitting features of the embodiment(s) are also within the scope of the disclosure. Where numerical ranges or limitations are expressly stated, such express ranges or limitations should be understood to include iterative ranges or limitations of like magnitude falling within the expressly stated ranges or limitations (e.g., from about 1 to about 10 includes, 2, 3, 4, etc.; greater than 0.10 includes 0.11, 0.12, 0.13, etc.). For example, whenever a numerical range with a lower limit, Rl, and an upper limit, Ru, is disclosed, any number falling within the range is specifically disclosed. In particular, the following numbers within the range are specifically disclosed: R=Rl+k*(Ru−Rl), wherein k is a variable ranging from 1 percent to 100 percent with a 1 percent increment, i.e., k is 1 percent, 2 percent, 3 percent, 4 percent, 7 percent, . . . , 70 percent, 71 percent, 72 percent, . . . , 97 percent, 96 percent, 97 percent, 98 percent, 99 percent, or 100 percent. Moreover, any numerical range defined by two R numbers as defined in the above is also specifically disclosed. The use of the term about means ±10% of the subsequent number, unless otherwise stated. Use of the term “optionally” with respect to any element of a claim means that the element is required, or alternatively, the element is not required, both alternatives being within the scope of the claim. Use of broader terms such as comprises, includes, and having should be understood to provide support for narrower terms such as consisting of, consisting essentially of, and comprised substantially of. Accordingly, the scope of protection is not limited by the description set out above but is defined by the claims that follow, that scope including all equivalents of the subject matter of the claims. Each and every claim is incorporated as further disclosure into the specification and the claims are embodiment(s) of the present disclosure. The discussion of a reference in the disclosure is not an admission that it is prior art, especially any reference that has a publication date after the priority date of this application. The disclosure of all patents, patent applications, and publications cited in the disclosure are hereby incorporated by reference, to the extent that they provide exemplary, procedural, or other details supplementary to the disclosure.
While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the spirit or scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.
In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may be indirectly coupled or communicating through some interface, device, or intermediate component whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.
The present application claims priority to U.S. Provisional Patent Application No. 61/616,742 filed Mar. 28, 2012 by Yolin Lih, et al. and entitled “Concurrently Accessed Set Associative Victim Cache,” which is incorporated herein by reference as if reproduced in its entirety.
Number | Date | Country | |
---|---|---|---|
61616742 | Mar 2012 | US |