This application is related to processor cache technology.
The processor cores 111 may be any one of a variety of processors such as a central processing unit (CPU) or a graphics processing unit (GPU). For instance, they may be x86 microprocessors that implement x86 64-bit instruction set architecture and are used in desktops, laptops, servers, and superscalar computers, or they may be Advanced RISC (Reduced Instruction Set Computer) Machines (ARM) processors that are used in mobile phones or digital media players. Other embodiments of the processors are contemplated, such as Digital Signal Processors (DSP) that are particularly useful in the processing and implementation of algorithms related to digital signals, such as voice data and communication signals, and microcontrollers that are useful in consumer applications, such as printers and copy machines.
The processor cores 111 are computational centers responsible for performing a multitude of computational tasks that enable the multi-processor system 100 to operate. The processor cores 111 may include execution units that perform additions, subtractions, and shifting and rotating of binary digits, among many other computations and may also include address generation and load and store units that perform address calculations for memory addresses and the loading and storing of data from memory. The collection of these operations performed by the processor cores 111 drives computer applications to run.
The processor cores 111 may each have local caches 112A-112D (hereinafter collectively referred to my numeral alone), which are small storage spaces where commonly used instructions or data are placed. Local caches 112 are advantageous because of their close proximity to a processor core 111; a small memory access latency is experienced by a processor core 111 in obtaining instructions or data from a local cache 112. In many implementations, a processor core 111 seeking data will look to find the data in its local cache 112 before looking elsewhere in the memory hierarchy. However, because local caches are typically expensive to implement, they are limited to a small size. Examples of local caches 112 are Level 1 (L1) instruction or data caches.
In addition to having local caches 112, the processor cores 111 of multi-processor system 100 may also have shared caches 113 (hereinafter collectively referred to by numeral alone). A shared cache 113 is shared by the two processor cores 111 of a processing node 110 and is typically larger in size than the local caches 112. A shared cache 113 may be the next level in the memory hierarchy of the multi-processor system 100, such that the processor cores 111 may look to find data in the shared cache 113 of their “home” processing node 110 when it has been determined that their own local caches 112 do not contain the data. A shared cache 113 may be inclusive, meaning that the contents of the local caches 112 of the processor cores 111 are replicated in the shared cache 113. Conversely, a shared cache 113 may be an exclusive cache, meaning that data contained in the local caches 112 of the processor cores 111 is not necessarily contained in the shared cache 113. An example of a shared cache 113 is a Level 2 (L2) cache that is shared amongst the processor cores 111 of the processing node 110.
Embodiments of a method and apparatus for repurposing a portion of a multi-processor system cache are provided. A first portion of a first cache is designated to hold pointer entries, where the pointer entries provide an indicator to a second cache that holds memory data requested from the first cache. Further, in the method and apparatus, a second portion of the first cache is designated to hold memory data entries that are accessed by a request to the first cache. In some embodiments, the first cache is a higher-level cache and the second cache is a lower-level cache.
In other embodiments, the pointer entries are held in a state field of the first cache and a data field associated with the state field is repurposed for storage, where the data field of the first cache that is repurposed for storage is used for probe filter storage. In yet other embodiments, the state field associated with the data field designated for holding memory data entries is used for a memory coherency protocol.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
As seen in
Like a shared cache 113, the high level cache 120 may be exclusive, such that it does not contain the data stored in the shared caches 113. The high level cache 120 may, alternatively, be inclusive, such that it contains the data stored in the shared caches 113 of the processing nodes 110. However, inclusiveness may limit the effectiveness of the high level cache 120, where, for instance, half of an 8 Mega Byte (MB) high level cache 120 may be dedicated to replicating the data of the shared caches 113 of the four processing nodes 110, each of a 1 MB size. In that instance, only 4 MB would be left for caching purposes in the high-level cache 120. Therefore, more of the resources of a high-level cache 120 may be available when the cache is exclusive and it does not replicate the data stored in lower-level caches.
As mentioned previously, caching is useful for keeping a copy of system memory data (where system memory may include RAM or hard disk memory) close by to the processor cores 111 for ease of access. A processor core 111 can access data that is present in its caches (local cache 112, shared cache 113, or high level cache 120) faster than it can access data from its RAM. Therefore, caching may reduce memory access latency and result in faster execution of computational tasks by the multi-processor system 100, since the processor cores 111 may not have to wait as long for needed data to be brought to them.
Data in the system memory and local caches of a multi-processor system 100 is referenced by memory addresses, where a processor core 111 seeking memory data sends a request asking for memory data residing in a specific memory address. Memory addresses may have any number of bits, where for instance, in some implementation 48 bits are used to index a byte or an octet in memory. When a processor core 11 requires data, it sends out a request that includes the memory address for the needed data. The hierarchy of caches—local caches 112, shared cache 113, and the high level cache 120—are searched, and if they do not contain the data for the requested memory address, then system memory is searched. Because data is requested according to its memory address, caches are structured to utilize memory addressing in order for fast searching to be performed.
The cache 200 also has a state field 203 for every data field 201. The state field 203 may be any number of bits but usually comprises several bits that give indications about the state of the data within the corresponding data field 201. One of these state bits may be a valid bit, for instance, that indicates whether the data in the data field 201 is valid. If the valid bit of the state field 203 indicates that the data in a data field 201 is valid, then the data can be outputted and used. Alternatively, if the valid bit of the state field 203 indicates that the data in a data field 201 is invalid, the data may not be outputted when requested. Further, some bits of the state field 203 may be used to maintain cache coherence information regarding data. Cache coherence, which will be discussed in further detail herein, is important in multi-processor systems, such as multi-processor system 100, where copies of memory data are commonly held in different processor core caches (such as local caches 112) and are written to or modified in these caches by processor cores 111. Cache coherence is the process of updating other caches in a multi-processor system with the up-to-date information regarding data. For instance, when two processor cores 111 have copies of the same memory data in their respective local caches 112, and one processor core 111 operates on the data and changes it, through cache coherency the other processor core 111 whose local cache 112 holds the data is informed of the changes and may therefore label its own version of the data as no longer valid.
Cache 200 also has two memory address-related components: an index 204, comprised of index entries, and a tag field 202. The index 204 and tag field 202 are both drawn from the memory address of the data that is stored in the cache 200. The index 204 of cache 200 is 1024 in length, and it may be referenced using 10 bits. Each data field 201 has a corresponding entry in the index 204. For instance, if a data field 201 of cache 200 contained 8 bytes of data, then the cache 200 would be (1024)*8 bytes in size, or 8 kilo bytes (kB). The tag field 202 is also used for memory addressing purposes, as will be shown in more detail in
Because cache 300 has an index 304 of size 1024 (which may be fully referenced using 10 bits), the 10-bits in the memory address 310 making up the index 312 are used to point to the appropriate index entry of index 304 of the cache 300. After the index 312 of the memory address 310 is used to point to the appropriate index entry of index 304 in the cache 300, the tag 311 of the memory address 310 is compared against the corresponding tag field 302. If the tag 311 of the memory address 310 matches the tag field 302, this indicates that the requested data is contained in the corresponding data field 301. However, if the address tag 311 does not match the tag field 302, then this indicates that the data contained in the data field 301 is not that of the requested memory address 310.
Equator 320 is used to determine if the tag 311 of memory address 310 matches the tag field 302 corresponding to the index entry that matched the index 312. If there is a match, then line 321 is asserted (with an output of 1) and vise-versa. However, even if the data for the requested address is contained in a data field 301, it may or may not be valid. For instance, the data may not be current or may have been subsequently overwritten. To account for these possibilities, the state field 303 has a valid state 303a, where if the valid state is asserted (with an output of 1), then it is implied that the data is valid and vise-versa. Thereafter, the logical conjunction (using AND gate 322) of the valid bit 303a of the state field 303 and the output 321 of equator 320 is taken to indicate a cache “hit” when it is asserted (with an output of 1) and a cache “miss” when it is not asserted (with an output of 0). When there is a cache hit, the data in the indicated data field 301 of the cache 300 is outputted 330. As previously mentioned, the outputted data 330 is 8 bytes in size. The byte offset 313 indicates the byte position of the needed data in a data field 301. For instance, 000 may indicate that the requested memory address 310 is the first byte in a data field 301, whereas 111 may indicate the requested memory address 310 is the last byte in a data field 301. Thereby, the byte offset 313 of the memory address 310 is used to select the requested byte.
Those skilled in the art will recognize that cache 300 is a directly-mapped cache because any two memory addresses that share an address index, such as address index 312, will only be mapped to one location in the cache 300—the location pertaining to the matching index entry of index 304. A cache may, on the other hand, be set-associative, which implies that memory addresses that share an index may be mapped to more than one location in the cache. Caches may be associative in any number of ways. For instance, a cache may be 2, 4, 8, 16, or 32-way set-associative. The number of ways indicates the number of possible places in the cache that a certain memory address may belong.
The cache 400 shown in
The state field 403 may comprise any number of bits, where some of these bits may be used for the purposes of cache coherency. Some state fields may follow the MOESI (modified, owned, exclusive, shared, invalid) coherency protocol. Table 1 shows the meaning of the states of the MOESI protocol.
In some embodiments, three of the state field bits may be used to indicate an MOESI state, where 3′b0xx=Invalid, 3′b100=Exclusive, 3′101=Shared, 3′b110=Modified and 3′b111=Owned. Further, some of the state bits may indicate a core that originally placed the data in the cache.
When a data request for a memory address 410 is received, in order to determine whether the data pertaining to the memory address is present in cache 400, the index 412 of the memory address is used to point to the proper index entry in the index 404 of the cache 400 where the data may be held. Because cache 400 is n-way set associative, the index 412 points to the data being present in any of the data fields 401 of the n-ways 405 of the cache 400 having the same index entry in index 404 as the memory address index 412. To determine whether the data resides in the cache, all n tag fields 4020-402n are compared to the address tag 411 to determine whether there is a match. This comparison is done using equators 4200-420n. If any of the tag fields 402 match the address tag 412, then a hit is declared and the corresponding line 4210-421n is asserted (with an output of 1) and the corresponding data is outputted 422 from the data field 401. However, if the tag 411 of the address does not match the tag field 402 of any of the n-ways at the particular index entry in index 404, then it is determined that the cache 400 does not hold the needed data and a cache miss is declared.
The state field 403 pertaining to the outputted data 422 is used to identify whether the data is valid or current, where, for instance, if the invalid bit is asserted the data may be deemed to be not useful and may, therefore, not be subsequently used. If the exclusive bit of the state field 403 is asserted, then it is implied that the cache holds the most recent, correct copy of the data, where the copy in main memory is also the most recent, correct copy of the data and no other processor holds a copy of the data. Table 1 may be used to determine the meaning of the remaining states, if an MOESI protocol is used for the state field 403. Additionally, a cache may also use other coherency protocols for the state field 403 that are well known to those skilled in the art.
In some embodiments, the portions of a cache containing data fields are re-designated for purposes other than the storage of cache data.
An example of an application that may use re-designated or repurposed data is a probe filter, which is also known as a snoop filter or a memory coherency filter. A probe filter requires data storage space, such as the repurposed data 530 field, for minimizing memory coherency traffic in a multi-processor system. As mentioned previously, memory coherency is important in multi-processor systems, like multi-processor system 100 in
Rather than wasting the tag field 531 and the state field 532 associated with the repurposed data 530 of the cache 500, these fields may be utilized in a multi-processor system for a neighbor cache directory. A neighbor cache directory provides an indication to a processor core of whether requested data may be present in another processor core's cache. Therefore, rather than obtaining data from system memory, such as RAM, data may be obtained from a neighbor core's cache.
Returning to
In many instances, the data requested by a processor core 111A may be present in the local caches 112 or shared cache 113 of a processing node other than the home node (e.g. processing node 110D). For instance, the shared cache 113D of processing node 110D may hold a copy of the requested memory data. Therefore, a processor core 111A may rather obtain the requested data from the shared cache 113D of processor node 110D than obtain the data from system memory, as there may be a smaller amount of latency in obtaining the data from the neighbor cache (i.e., shared cache 113D) than from the system memory.
A neighbor cache directory utilizes the unused state and tag fields of repurposed data to hold information regarding whether data requested from the cache is present in other caches in a multi-processor system. The portion of a cache designated for state fields corresponding to repurposed data may itself be repurposed to include a pointer that indicates that another core's cache holds the requested data. The repurposing of the portion of the cache designated for state fields is possible because with the data field being repurposed, these state fields are otherwise unused.
For instance, the pointer field 604B may be a 3-bit field (000 to 111) that points to a neighbor cache having a copy of the requested memory address. In this example, the repurposed pointer field 604B can point to any one of eight neighbor caches. When this information is provided, the data may be fetched to the requesting core from a neighbor cache as opposed to being fetching from system memory.
The state field 603B of cache B may use a sub-set of the states used by cache A 650A. Further, these states may have a different meaning or the same meaning. For instance, cache B 650B may use the MOESI modified state to indicate that the data is in the MOESI modified in the neighbor core's cache.
The received memory address 710 comprises a tag 711, an index 712, and a byte offset 713. The cache 700 is n-way set associative, where the portion of the cache 700 designated for data fields 701n of way-n 705n has been repurposed to be a probe filter for the minimization of cache coherency traffic. Although the entirety of the portion of cache 700 designated for way-n 705n data fields 701n is taken up by the probe filter in this embodiment, in other embodiments the repurposed data may take up the data fields 701 associated with more than one of the ways 705 of the cache 700, or may take up portions of one or more ways 705. For instance, the repurposed data 530 in
Way-07050 through way-(n−1) 705n-1 of the cache 700 are used for conventional caching purposes and each have data fields 701, tag fields 702, state fields 703, and an index 704. The index 712 of the requested memory address 710 is used to point to a corresponding entry in the index 704. Thereafter, the tag fields 702 of all the n ways 7050-705n-1 corresponding to the matched index entry are compared using equator 720 with the tag 711 of the received memory address 710. If a match exists on any of the portions of the cache containing data (way-07050 through way-(n−1) 705n-1, in this embodiment) then a cache hit is declared and the corresponding cache hit/miss line 7210-721n-1 is asserted. Thereafter, the data corresponding to the matching tag field is outputted 722 and the processor core need for data is, therefore, satisfied. The state field 7030-703n-1 indicates the state of the matched data, and may be according to any one of the memory coherency protocols (MOESI, for instance). If a cache miss is declared then no data may be outputted and the processor core's need for data is not met. The core will likely need to request the data from system memory, or if this cache 700 is not the highest level cache in the multi-processor system's memory hierarchy, then the data may be requested from a higher level cache. However, because this cache also comprises a repurposed neighbor cache directory, it may yield a neighbor cache directory hit and the memory data may be requested from a neighbor core's cache that has the desired data.
The neighbor cache directory is searched in the same manner as that of a conventional cache search described herein. The index 712 of the memory address 710 points to an entry in the index 704 where tags are to be compared. Thereafter, the memory address 710 tag 711 is compared (using equator 720n) to the tag field 702n of the matching index entry. If there is a match corresponding to the neighbor cache directory, then the neighbor cache hit/miss line 721n is asserted (with an output of 1) to indicate a neighbor cache hit. Because the match was in a region of the cache whose data fields 701n have been repurposed, no data is outputted. Instead, since this region of the cache represents the neighbor cache directory, the pointer field 706n is outputted. The outputted neighbor cache pointer 723n indicates another cache location in the system where the requested data is present. The neighbor cache pointer 723n may be used to obtain the requested data from another cache in the system. For instance, a memory controller may use the pointer to provide the data to the requesting core from another core's cache. The state field 707n of the neighbor cache directory may be used to maintain information regarding the neighbor cache.
In some embodiments, a cache including a neighbor cache directory may be a higher level “victim” cache, where data evicted or removed (i.e., due to lack of capacity) from lower-level caches is stored. Further, as previously mentioned, a cache with a neighbor cache directory may be the highest level cache in a multi-processor system's memory access hierarchy.
The memory addresses that are held in a neighbor cache directory, i.e. neighbor cache directory entries, may be populated in a variety of ways that are intended to make the memory hierarchy of a multi-processor system more efficient. Memory access data that is most likely to be shared among processor cores may be important to include in a neighbor cache directory. Communication data between processor cores is one type of data that is commonly read and modified by these cores and copies of the data are maintained in the cores' lower-level caches. It would therefore be useful to include entries for this data in the neighbor cache directory so that a requesting core may take advantage of the neighbor cache and be able to access this data with lower latency.
In other embodiments, a higher-level cache with a neighbor cache directory may receive a request for data that is in the modified state (i.e., the cache holds the most recent, correct copy of the data, the copy in main memory is stale (incorrect), and no other processor holds a copy). If the data were to be given to the processor core and removed from the higher level cache, then it may be useful to include it in the neighbor cache directory and provide a pointer to the requesting core's cache.
Further, a neighbor cache entry may be updated because of a neighbor cache hit. For instance, when one core requests data from a high-level cache with a neighbor cache directory and the request results in a neighbor cache hit, then a pointer is provided to point to a core whose cache has the data. In this instance, the requested data will be delivered to the requesting core from the other core's cache and the requesting core may keep the data in its own cache. Therefore, it is useful for the neighbor cache entry to be updated in order to point to the cache of the requesting core since this core having requested the data may now operate on and change the data.
In some embodiments memory address data that is present in the cache may be removed from the neighbor cache directory, as it is redundant to maintain memory data in the cache and also maintain an entry for a pointer to a lower-level cache containing the same data. Furthermore, in some embodiments, if a lower-level cache evicts data that is pointed to by the neighbor cache directory, the neighbor cache entry may be removed and the data may be placed in the high-level cache instead.
In other embodiment, when a core requests data and this data is provided to the core from outside the core's own multi-processor system memory hierarchy (such as another multi-processor system's caches), it is useful for the neighbor cache directory to be updated to include a pointer to the requesting core's cache that now holds the data. This data may now also be requested by other cores in the multi-processor system and a neighbor cache directory entry to the requesting core which now holds the data is beneficial to the operation of the multi-processor system. It is worth noting that a neighbor cache directory may only point to cache's within its own hierarchy and may therefore not point to cache's of a multi-processor system outside its own hierarchy. However, in the event that a requesting core receives data from outside its own multi-processor hierarchy, then, as previously mentioned, the neighbor cache may be updated to include a pointer to this received data.
In some embodiments, a processor core may have data in its local caches that other processor cores also have in their local caches in the shared state. As described in Table 1, a cache line in the shared state holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. If no other processor holds it in the owned state, then the copy in main memory is also the most recent. A core may wish to modify its data and may issue a request to other processor cores to render their own copies invalid. In this instance, it is useful for a neighbor cache directory to have an entry to point to the processor core's cache now modifying the data. It is worth noting that a core's request to invalidate the data in other cores' caches may sometimes fail, i.e., when there are other processor cores also attempting the same. It is, therefore, useful to only install a neighbor cache directory entry when the request does not fail.
As previously discussed, because communication data is shared amongst processor cores, they are good candidates for installation in the neighbor cache directory. Communication data may be data that results in a hit in other cores' caches in the system. Non-communication data, on the other hand, may be provided to a core from system memory. However, a neighbor cache directory may have entries to either type of data or both types of data, where an insertion algorithm may be utilized to determine which type of data to install in the neighbor cache directory.
In other embodiments, entries in the neighbor cache directory are not installed when there are indications that data is not being shared by multiple cores. For instance, when one processor cores installs data in a high-level cache and thereafter requests this data, in some embodiments, this data may not be installed in the neighbor cache directory as it was not shared by another core and may, therefore, not be a good candidate for sharing among other cores.
Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements. The methods or flow charts provided herein may be implemented in a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of processors, one or more processors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the present invention.