Data processing systems, such as a System-on-a-Chip (SoC) may contain multiple processor cores, multiple data caches and shared data resources. In a shared memory system for example, each of the processor cores may read and write to a single shared address space. Cache coherency is an issue in any system that contains one or more caches and more than one device sharing data in a single cached area. There are two potential problems with system that contains caches. Firstly, memory may be updated (by another device) after a cached device has taken a copy. At this point, the data within the cache is out-of-date or invalid and no longer contains the most up-to-date data. Secondly, systems that contain write-back caches must deal with the case where the device writes to the local cached copy at which point the memory no longer contains the most up-to-date data. A second device reading memory will see out-of-date (stale) data.
Snoop filters, which monitor data transactions, may be used to ensure cache coherency.
Cache line based snoop filters in general are ‘fine grain’ (maintaining one bit for each source in a presence vector) or ‘coarse grain’ (each bit tracks more than one or many sources). Fine grain snoop filters require more storage and can be expensive as a system grows, while coarse grain snoop filters can lead to an increased amount of snooping. Designs either adopt fine grain or coarse grain based on the system need.
With coarse grain snoop filters there is never a directed snoop to exactly one source as the presence bit always indicates more than one source. This can lead to over snooping always for cases where there is a unique owner of a cache line.
The accompanying drawings provide visual representations which will be used to more fully describe various representative embodiments and can be used by those skilled in the art to better understand the representative embodiments disclosed and their inherent advantages. In these drawings, like reference numerals identify corresponding elements.
While this invention is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an example of the principles of the invention and not intended to limit the invention to the specific embodiments shown and described. In the description below, like reference numerals are used to describe the same, similar or corresponding parts in the several views of the drawings.
In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
Reference throughout this document to “one embodiment”, “certain embodiments”, “an embodiment” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.
The term “or” as used herein is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C”. An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. Numerous details are set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, well-known methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described. The description is not to be considered as limited to the scope of the embodiments described herein.
Data processing systems, such as a System-on-a-Chip (SoC), may contain multiple processing devices, multiple data caches and shared data resources.
Note that many elements of a SoC, such as clocks for example, have been omitted in
Cache coherency is an issue in any system that contains one or more caches and more than one device sharing data in a single cached area. There are two potential problems with system that contains caches. Firstly, memory may be updated (by another device) after a cached device has taken a copy. At this point, the data within the cache is out-of-date or invalid and no longer contains the most up-to-date data. Secondly, systems that contain write-back caches must deal with the case where the device updates the local cached copy, at which point the memory no longer contains the most up-to-date data. A second device reading memory will see out-of-date (stale) data. Cache coherency may be maintained through use of a snoop filter.
When multiple RN's share a data or memory resource, a coherence protocol may be used, and nodes may be referred to as fully coherent (e.g. RN-F and HN-F) or I/O coherent (e.g. RN-I). Other devices may provide connections to another integrated circuit (e.g. RN-C and HN-C). To maintain coherence, each RN includes a cache controller 14 that accepts load and store instructions from the processor cores. The cache controller 114 also issues and receives coherence requests and responses via the interconnect circuit 106 from other nodes.
Home nodes 108 include a system cache 116. Herein, the system cache 116 is referred to as an L3 cache, however caches at other levels may be used. For example, in a system with multiple caches, the cache 116 may be a lowest or last level cache (LLC). To avoid excessive exchange of messages between the cache controllers 114 of the request nodes 102, a home node 108 also includes a snoop filter 300 that monitors data transactions and maintains the status of data stored in the system cache 116 and operates to maintain coherency of data in the various caches of the system. A home node generally provides an interface to a data resource such as a memory or I/O device. A home node acts as a point of coherence in that it issues coherence responses and receives coherence requests via the interconnect circuit 106 from other nodes. A home node is an intermediate node: it responds to data transaction requests from a request node, and can issue data transaction requests to other devices such as a memory controller. Thus, a home node may act as an intermediary node between a request node and a memory, and may include a cache for temporary storage of data. The snoop filter of a home node functions as a cache controller and a point of coherence. Since memory accesses, for a given set of memory addresses in shared data resource, pass through the same home node, the home node can monitor or ‘snoop’ on transactions and determine if requested data should be retrieved from a main memory, from a cache in the home node, or from a local cache of one of the request nodes.
In alternative embodiments, one or more snoop filters may be utilized at other locations in a system. For example, a snoop filter may be located in interconnect 106.
Together, snoop filters 300 and cache controllers 114 monitor data transactions and exchange messages to ensure cache coherency. In order to maintain coherency of data in the various local caches, the coherency state of each cache line or block is tracked. For example, data in a local cache, such as cache 114, is said to be in a ‘dirty’ state if it the most up-to-date but does not match the data in the memory or lowest level cache. Otherwise, the data is said to be ‘clean’. A cache coherence protocol may employ a MOESI cache coherence model, in which the cache data may be in one of a number of coherency states. The coherency states are: Modified (M), Owned (O), Exclusive (E), Shared (S) and Invalid (I).
Modified data, also called ‘UniqueDirty’ (UD) data, is not shared by other caches. Modified data in a local cache has been updated by a device, but has not been written back to memory, so it is ‘dirty’. Modified data is exclusive and owned. The local cache has the only valid copy of the data.
Owned data, also called ‘SharedDirty’ (SD) data, is shared by other caches. It has not been written back to memory so it is ‘dirty’.
Exclusive data, also called ‘UniqueClean’ (UC) data, is not shared and matches the corresponding data in the memory.
Shared data, also called ‘SharedClean’ (SC) data, is shared and matches the corresponding data in the memory. Shared data is not exclusive, not dirty, and not owned.
Invalid data is data that has be updated in the memory and/or in another cache, so is out-of-date. Valid data is the most up-to-date data. It may be read but it may only be written if it also exclusive.
Alternatively, a cache coherence protocol may employ a MESI cache coherence model. This is similar to the MOESI model except that data cannot be in the ‘Owned’ or ‘SharedDirty’ state.
It accordance with one aspect of the present disclosure, it is recognized that when data is in a ‘Unique’ state, whether ‘Modified’ or ‘Exclusive’, only a single node can have a valid copy of the data. For a coarse grain snoop filter this will result in sending unnecessary snoop messages to node in the same subset as the node having the valid copy of the data associated with the tag.
In accordance with various embodiments, the data in presence field 314 may be formatted in two or more ways, as indicated by format data.
In accordance with various embodiments, the format data comprises the cache coherence status stored in field 312, so that the data stored in presence field 314 is interpreted dependent upon the state of the cached data. This is illustrated in
If the address is found in the snoop filter (a snoop filter ‘hit’), as indicated by the positive branch from decision block 608, the data is stored in a RN-F cache. If the data is stored in a ‘Unique’ state, as depicted by the positive branch from decision block 618, the data in presence field 406 is interpreted as a unique identifier of a node, as depicted in
If the address is found in the cache (a cache ‘hit’), as indicated by the positive branch from decision block 608, the data is already stored in the system cache of the HN-F node. The snoop filter (SF) is updated at block 628 to indicate that the requesting RN-F will have a copy of the data and the data is forwarded to the RN-F node at block 614.
When the snoop filter is updated, the data in presence field 406 is updated dependent upon the new coherency state of the data. In particular, the format and interpretation of the field is changed when the coherency state changes from ‘Shared’ to ‘Unique’ or from ‘Unique’ to ‘Shared’.
When the coherency state is ‘Invalid’, the presence field 406 is not used.
In one example, a snoop filter has presence field 406 of length 64 bits. If the system has 256 nodes, the field is too small to storage a fine grain presence vector, so each bit in presence vector may be associated with four nodes. When data is ‘Unique’, a coarse grain filter would result in three unnecessary snoop messages. However, using a technique of the present disclosure, presence field 406 stores the unique identifier of the single node that has a copy of the data in its local cache. Thus, only a single snoop message is sent. In this example, the unique identifier may be an 8-bit number assigned to a node.
In accordance with a further embodiment of the disclosure, it is recognized the presence field 406 may be used to stored more than one node identifier.
When the format flag 714 in format field 708 is set to a first value, zero say, the presence data 712 stored in presence field 706 is interpreted as a coarse grain presence vector, with each bit associated with a subset of nodes and indicating if any node in the subset has a copy of data associated with the tag 710 stored in field 702.
When the format flag in format field 708 is set to a second value, one say, the presence field 706 is configured to save a number of unique node identifiers together with a shortened presence vector as depicted in
If the address is found in the snoop filter (a snoop filter ‘hit’), as indicated by the positive branch from decision block 1108, the data is stored in a RN-F cache. If the snoop filter line is configured in a first format, as depicted by the ‘PV’ branch from decision block 1118, the presence field 314 is interpreted as a coarse grain presence vector and broadcast snoop messages are directed to all subsets of nodes that share the data at block 1122. If the snoop filter line is configured in a second format, as depicted by the ‘ID’ branch from decision block 1118, the presence field 314 is interpreted as indicating one or more unique node identifiers, as discussed above. Unicast snoops are sent, at block 1120 to all of the identified RN-F's. If the response to the snoop fails to return the requested data, as depicted by the negative branch from decision block 1124, flow continues to block 1110 to retrieve the data from the memory using the memory controller. If, in response to a snoop, an RN-F provides the data, as indicated by the positive branch from decision block 1124, the data is stored in the (L3) system cache at block 1126 and the coherency state of the cache data is marked in the snoop filter as ‘dirty’. By updating the cache at block 1124, the data in the local caches of the request node is guaranteed to be clean, thus, there is no requirement to identify the owner of shared dirty data. Flow then continues to block 1114. Any subsequent read request will result in a hit in the system cache and so will not generate any snoop. The data will be provided from the system cache.
If the address is found in the cache (a cache ‘hit’), as indicated by the positive branch from decision block 1106, the data is already stored in the system cache of the HN-F node. The snoop filter (SF) is updated at block 1128 to indicate that the requesting RN-F will have a copy of the data and the data is forwarded to the RN-F node at block 1114.
When the snoop filter is updated, the presence field 314 is updated dependent upon the new state of the system. In particular, the format and interpretation of the field is changed when the number of nodes sharing a copy of the data exceeds the number of identifiers that can be stored in the presence field 314.
Table 1 provides some example of how presence data may be organized for different number of nodes and different size presence fields. Other values may be used without departing from the present disclosure.
It will be apparent to those of ordinary skill in the art that the information in a line of a snoop filter may be organized in a variety of ways. For example, the order of the fields may be varied and the order of bits within the fields may be varied with departing from the present disclosure. Further, presence bits may be grouped with associated identifiers rather than grouped together.
As used herein, the term processor, controller or the like may encompass a processor, controller, microcontroller unit (MCU), microprocessor, and other suitable control elements. It will be appreciated that embodiments of the invention described herein may be comprised of one or more conventional processors and unique stored program instructions that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions described herein. The non-processor circuits may include, but are not limited to, a receiver, a transmitter, a radio, signal drivers, clock circuits, power source circuits, and user input devices. As such, these functions may be interpreted as a method to perform functions in accordance with certain embodiments consistent with the present invention. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used. Thus, methods and means for these functions have been described herein. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.
Those skilled in the art will recognize that the present invention has been described in terms of exemplary embodiments based upon use of hardware component such as special purpose hardware, custom logic and/or dedicated processors. However, the invention should not be so limited, since general purpose computers, microprocessor based computers, micro-controllers, optical computers, analog computers, dedicated processors and/or dedicated hard wired logic may be used to construct alternative equivalent embodiments of the present invention.
Further, the hardware components described above may be defined by instructions of a Hardware Description Language (HDL). Such instructions may be stored on a non-transitory machine-readable storage medium or transmitted from one computer to another over a computer network. The HDL instructions may be utilized in the design and manufacture of the defined hardware components of systems containing the hardware components and additional components.
Moreover, those skilled in the art will appreciate that a program flow and associated data used to implement the embodiments described above can be implemented using various forms of storage such as Read Only Memory (ROM), Random Access Memory (RAM), Electrically Erasable Programmable Read Only Memory (EEPROM); non-volatile memory (NVM); mass storage such as storage class memory, a hard disc drive, floppy disc drive, optical disc drive; optical storage elements, magnetic storage elements, magneto-optical storage elements, flash memory, core memory and/or other equivalent storage technologies without departing from the present invention. Such alternative storage devices should be considered equivalents.
Those skilled in the art will appreciate that the processes described above can be implemented in any number of variations without departing from the present invention. For example, the order of certain operations carried out can often be varied, additional operations can be added or operations can be deleted without departing from the invention. Error trapping can be added and/or enhanced and variations can be made in user interface and information presentation without departing from the present invention. Such variations are contemplated and considered equivalent.
The various representative embodiments, which have been described in detail herein, have been presented by way of example and not by way of limitation. It will be understood by those skilled in the art that various changes may be made in the form and details of the described embodiments resulting in equivalent embodiments that remain within the scope of the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
5966729 | Phelps | Oct 1999 | A |
6128707 | Arimilli et al. | Oct 2000 | A |
6298424 | Lewchuk et al. | Oct 2001 | B1 |
6519685 | Chang | Feb 2003 | B1 |
6546447 | Buckland et al. | Apr 2003 | B1 |
6598123 | Anderson | Jul 2003 | B1 |
6799252 | Bauman | Sep 2004 | B1 |
6810467 | Khare | Oct 2004 | B1 |
6868481 | Gaither | Mar 2005 | B1 |
7117311 | Rankin | Oct 2006 | B1 |
7117312 | Cypher | Oct 2006 | B1 |
7240165 | Tierney | Jul 2007 | B2 |
7325102 | Cypher | Jan 2008 | B1 |
7613882 | Akkawi | Nov 2009 | B1 |
7685409 | Du | Mar 2010 | B2 |
7698509 | Koster | Apr 2010 | B1 |
7836144 | Mannava | Nov 2010 | B2 |
7925840 | Harris | Apr 2011 | B2 |
7937535 | Ozer | May 2011 | B2 |
8392665 | Moga | Mar 2013 | B2 |
8423736 | Blake | Apr 2013 | B2 |
8638789 | Pani | Jan 2014 | B1 |
8935485 | Jalal | Jan 2015 | B2 |
9058272 | O'Bleness | Jun 2015 | B1 |
9166936 | Stovall | Oct 2015 | B1 |
9507716 | Salisbury | Nov 2016 | B2 |
9575893 | Lin | Feb 2017 | B2 |
9639469 | Moll | May 2017 | B2 |
9652404 | Pierson | May 2017 | B2 |
9727466 | Tune | Aug 2017 | B2 |
9767026 | Niell | Sep 2017 | B2 |
9817760 | Robinson | Nov 2017 | B2 |
9830265 | Rowlands | Nov 2017 | B2 |
9830294 | Mathewson | Nov 2017 | B2 |
9870209 | Kelm | Jan 2018 | B2 |
20020087811 | Khare et al. | Jul 2002 | A1 |
20020147889 | Kruckemyer et al. | Oct 2002 | A1 |
20020184460 | Tremblay et al. | Dec 2002 | A1 |
20030028819 | Chiu | Feb 2003 | A1 |
20030070016 | Jones | Apr 2003 | A1 |
20030105933 | Keskar et al. | Jun 2003 | A1 |
20030115385 | Adamane et al. | Jun 2003 | A1 |
20030131202 | Khare | Jul 2003 | A1 |
20030140200 | Jamil et al. | Jul 2003 | A1 |
20030163649 | Kapur | Aug 2003 | A1 |
20030167367 | Kaushik | Sep 2003 | A1 |
20040003184 | Safranek | Jan 2004 | A1 |
20040117561 | Quach | Jun 2004 | A1 |
20040193809 | Dieffenderfer et al. | Sep 2004 | A1 |
20050005073 | Pruvost | Jan 2005 | A1 |
20050160430 | Steely, Jr. et al. | Jul 2005 | A1 |
20050201383 | Bhandari et al. | Sep 2005 | A1 |
20060080508 | Hoover | Apr 2006 | A1 |
20060080512 | Hoover | Apr 2006 | A1 |
20060136680 | Borkenhagen et al. | Jun 2006 | A1 |
20060224835 | Blumrich | Oct 2006 | A1 |
20060224836 | Blumrich | Oct 2006 | A1 |
20060224838 | Blumrich | Oct 2006 | A1 |
20060224840 | Blumrich | Oct 2006 | A1 |
20070005899 | Sistla | Jan 2007 | A1 |
20070073879 | Tsien | Mar 2007 | A1 |
20070079044 | Mandal et al. | Apr 2007 | A1 |
20070186054 | Kruckemyer | Aug 2007 | A1 |
20070239941 | Looi | Nov 2007 | A1 |
20080005485 | Gilbert | Jan 2008 | A1 |
20080005486 | Mannava | Jan 2008 | A1 |
20080120466 | Oberlaender | May 2008 | A1 |
20080209133 | Ozer | Aug 2008 | A1 |
20080243739 | Tsien | Oct 2008 | A1 |
20080244193 | Sistla et al. | Oct 2008 | A1 |
20080320232 | Vishin et al. | Dec 2008 | A1 |
20080320233 | Kinter | Dec 2008 | A1 |
20090158022 | Radhakrishnan | Jun 2009 | A1 |
20090300289 | Kurts | Dec 2009 | A1 |
20110179226 | Takata | Jul 2011 | A1 |
20120099475 | Tokuoka | Apr 2012 | A1 |
20120144064 | Parker | Jun 2012 | A1 |
20120198156 | Moyer | Aug 2012 | A1 |
20130042070 | Jalal | Feb 2013 | A1 |
20130042078 | Jalal | Feb 2013 | A1 |
20130051391 | Jayasimha | Feb 2013 | A1 |
20140032853 | Lih | Jan 2014 | A1 |
20140052905 | Lih et al. | Feb 2014 | A1 |
20140082297 | Solihin | Mar 2014 | A1 |
20140095801 | Bodas | Apr 2014 | A1 |
20140095806 | Fajardo | Apr 2014 | A1 |
20140095808 | Moll | Apr 2014 | A1 |
20140181394 | Hum et al. | Jun 2014 | A1 |
20140189239 | Hum | Jul 2014 | A1 |
20140223104 | Solihin | Aug 2014 | A1 |
20140281180 | Tune | Sep 2014 | A1 |
20140317357 | Kaplan et al. | Oct 2014 | A1 |
20140372696 | Tune | Dec 2014 | A1 |
20150074357 | McDonald | Mar 2015 | A1 |
20150095544 | Debendra | Apr 2015 | A1 |
20150103822 | Gianchandani | Apr 2015 | A1 |
20150127907 | Fahim et al. | May 2015 | A1 |
20150286577 | Solihin | Oct 2015 | A1 |
20150324288 | Rowlands | Nov 2015 | A1 |
20160041936 | Lee et al. | Feb 2016 | A1 |
20160055085 | Salisbury | Feb 2016 | A1 |
20160062889 | Salisbury | Mar 2016 | A1 |
20160062890 | Salisbury | Mar 2016 | A1 |
20160062893 | Tune | Mar 2016 | A1 |
20160117249 | Lin | Apr 2016 | A1 |
20160147661 | Ambroladze | May 2016 | A1 |
20160147662 | Drapala | May 2016 | A1 |
20160188471 | Forrest | Jun 2016 | A1 |
20160210231 | Huang | Jul 2016 | A1 |
20160216912 | Muralimanohar et al. | Jul 2016 | A1 |
20160283375 | Das Sharma | Sep 2016 | A1 |
20170024320 | Forrest et al. | Jan 2017 | A1 |
20170091101 | Lin | Mar 2017 | A1 |
20170168939 | Jalal | Jun 2017 | A1 |
20170185515 | Fahim | Jun 2017 | A1 |
Entry |
---|
Scalability port: a coherent interface for shared memory multiprocessors; Azimi et al; 10th Symposium on High Performance Interconnects; Aug. 21-23, 2002 (6 pages) (Year: 2002). |
Subspace snooping: filtering snoops with operating system support; Kim et al; Proceedings of the 19th international conference on Parallel architectures and compilation techniques; Sep. 11-15, 2010; pp. 111-122 (12 pages) (Year: 2010). |
A. Moshovos, G. Memik, B. Falsafi and A. Choudhary, “JETTY: filtering snoops for reduced energy consumption in SMP servers,” Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture, Monterrey, 2001, pp. 85-96. |
Number | Date | Country | |
---|---|---|---|
20180004663 A1 | Jan 2018 | US |