The technology of the disclosure relates generally to memory coherency in processor-based systems, and, in particular, to memory coherency in processor systems having multiple processor sockets.
Many conventional processor-based systems provide multiple processors (single- or multi-core) located on physically separate processor dies interfaced with separate processor sockets that are linked by an interconnect bus. Such multi-socket systems may provide a feature known as “multi-socket coherency” to maintain memory coherency among the multiple processor sockets' local memory hierarchy regions. To provide multi-socket coherency, each memory access request from a given processor must be evaluated (i.e., “snooped”) to determine whether a remote processor has modified the memory element corresponding to the memory address of the memory access request. A snoop to a remote processor socket (i.e., a “remote snoop”) consumes bandwidth provided by the interconnect bus, thereby reducing the bandwidth available for other inter-socket communications. Consequently, the performance of all processors of the multiple processor sockets may be negatively impacted by each memory access request that has to wait for a remote processor socket to be snooped.
To address this issue, some conventional snoop filter mechanisms employ a “shadow directory,” which is used to track the contents of a local processor socket's system caches to filter cross-socket memory access requests. However, when the storage capacity of a shadow directory of a given processor socket is reached, the snoop filter mechanism must evict an entry from the shadow directory, and must also force all remote caches to evict any corresponding entries. As a result, while the use of a shadow directory may reduce the occurrence of cross-socket snooping, such mechanisms may not be scalable for larger-sized caches and/or larger numbers of processor sockets. Thus, a more effective and scalable mechanism for filtering cross-socket snooping is desirable.
Aspects disclosed in the detailed description include providing multi-socket memory coherency using cross-socket snoop filtering in processor-based systems. In this regard, in some aspects, a processor-based system provides multiple interconnected processor sockets that are each associated with a point of serialization (POS) circuit and a local memory hierarchy subdivided into a plurality of memory granules. In some aspects, the size of the memory granules corresponds to a size of a system cache line, such as 128 bytes. Stored in the local memory hierarchy for each processor socket is a coherency directory, comprising a plurality of coherency directory entries. Each of the coherency directory entries stores one or more status indicators corresponding to the memory granules of the local memory hierarchy. The status indicators each provide an indication as to whether or not the corresponding memory granule of the local memory hierarchy has been accessed by a remote processor socket, and, in some aspects, which remote processor socket or sockets have accessed the local memory hierarchy (and thus may be caching more recent data for the memory granule). Upon receiving a memory access request referencing a local memory address of a processor socket, the POS circuit of the processor socket retrieves a coherency directory entry corresponding to the local memory address. The POS circuit then determines, based on the status indicator for the local memory address provided by the coherency directory entry, whether a remote snoop is required to determine which processor socket has the most recent data for the local memory address. If so, a remote snoop is performed. If the POS determines that a remote snoop is not required, data from the local memory hierarchy is read and returned in response to the memory access request. In this manner, the coherency directory provides an efficient and scalable mechanism for reducing the occurrence of unnecessary cross-socket snoops, thus improving system performance.
Some aspects may further provide a coherency directory cache for caching coherency directory entries for faster lookup. Aspects may also provide a remote access indicator array, which provides access indicators corresponding to portions of memory larger than a single memory granule. The remote access indicator array may be consulted prior to accessing the coherency directory, and thus may be used to determine whether a coherency directory lookup is needed.
In another aspect, a processor-based system for providing multi-socket memory coherency using cross-socket snoop filtering is provided. The processor-based system includes a plurality of processor sockets, each of which provides a coherency directory stored in a local memory hierarchy comprising a plurality of memory granules. The coherency directory includes a plurality of coherency directory entries each storing one or more status indicators corresponding to the plurality of memory granules of the local memory hierarchy. The processor-based system further includes a POS circuit. The POS circuit is configured to receive a memory access request comprising a local memory address within the local memory hierarchy. The POS circuit is further configured to retrieve a coherency directory entry of the plurality of coherency directory entries of the coherency directory corresponding to the local memory address. The POS circuit is also configured to determine, based on a status indicator of the one or more status indicators of the coherency directory entry corresponding to a memory granule of the plurality of memory granules associated with the local memory address, whether a remote snoop is required for the memory access request. The POS circuit is additionally configured to, responsive to determining that a remote snoop is required for the memory access request, perform the remote snoop of one or more remote processor sockets of the plurality of processor sockets indicated by the status indicator. The POS circuit is further configured to, responsive to determining that a remote snoop is not required for the memory access request, return data from the local memory hierarchy for the memory access request.
In another aspect, a processor-based system for providing multi-socket memory coherency using cross-socket snoop filtering is provided. The processor-based system comprises a means for receiving a memory access request comprising a local memory address within a local memory hierarchy comprising a plurality of memory granules. The processor-based system further comprises a means for retrieving a coherency directory entry of a plurality of coherency directory entries of a coherency directory corresponding to the local memory address, wherein the coherency directory is stored in the local memory hierarchy, and the plurality of coherency directory entries each stores one or more status indicators corresponding to the plurality of memory granules of the local memory hierarchy. The processor-based system also comprises a means for determining, based on a status indicator of the one or more status indicators of the coherency directory entry corresponding to a memory granule of the plurality of memory granules associated with the local memory address, whether a remote snoop is required for the memory access request. The processor-based system additionally comprises a means for performing the remote snoop of one or more remote processor sockets of a plurality of processor sockets indicated by the status indicator, responsive to determining that a remote snoop is required for the memory access request. The processor-based system further comprises a means for returning data from the local memory hierarchy for the memory access request, responsive to determining that a remote snoop is not required for the memory access request.
In another aspect, a method for providing multi-socket memory coherency using cross-socket snoop filtering is provided. The method comprises receiving, by a POS circuit, a memory access request comprising a local memory address within a local memory hierarchy comprising a plurality of memory granules. The method further comprises retrieving a coherency directory entry of a plurality of coherency directory entries of a coherency directory corresponding to the local memory address, wherein the coherency directory is stored in the local memory hierarchy, and the plurality of coherency directory entries each stores one or more status indicators corresponding to the plurality of memory granules of the local memory hierarchy. The method also comprises determining, based on a status indicator of the one or more status indicators of the coherency directory entry corresponding to a memory granule of the plurality of memory granules associated with the local memory address, whether a remote snoop is required for the memory access request. The method additionally comprises, responsive to determining that a remote snoop is required for the memory access request, performing the remote snoop of one or more remote processor sockets of a plurality of processor sockets indicated by the status indicator. The method further comprises, responsive to determining that a remote snoop is not required for the memory access request, returning data from the local memory hierarchy for the memory access request.
In another aspect, a non-transitory computer-readable medium having stored thereon computer-executable instructions is provided. The computer-executable instructions, when executed by a processor, cause the processor to receive a memory access request comprising a local memory address within a local memory hierarchy comprising a plurality of memory granules. The computer-executable instructions further cause the processor to retrieve a coherency directory entry of a plurality of coherency directory entries of a coherency directory corresponding to the local memory address, wherein the coherency directory is stored in the local memory hierarchy, and the plurality of coherency directory entries each stores one or more status indicators corresponding to the plurality of memory granules of the local memory hierarchy. The computer-executable instructions also cause the processor to determine, based on a status indicator of the one or more status indicators of the coherency directory entry corresponding to a memory granule of the plurality of memory granules associated with the local memory address, whether a remote snoop is required for the memory access request. The computer-executable instructions additionally cause the processor to, responsive to determining that a remote snoop is required for the memory access request, perform the remote snoop of one or more remote processor sockets of a plurality of processor sockets indicated by the status indicator. The computer-executable instructions further cause the processor to, responsive to determining that a remote snoop is not required for the memory access request, return data from the local memory hierarchy for the memory access request.
With reference now to the drawing figures, several exemplary aspects of the present disclosure are described. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
Aspects disclosed in the detailed description include providing multi-socket memory coherency using cross-socket snoop filtering in processor-based systems. In this regard,
Each of the processor sockets 102(0)-102(P) is associated with a corresponding local memory hierarchy 106(0)-106(P). As used herein, the term “local memory hierarchy” generally refers to one or more local memory devices that are dedicated or directly connected to the corresponding processor sockets 102(0)-102(P), and are accessed in a hierarchical fashion according to response time or other performance characteristics. Accordingly, each local memory hierarchy 106(0)-106(P) in some aspects may comprise one or more of a Level 1 (L1) cache, a Level 2 (L2) cache, a Level 3 (L3) cache, and/or a system memory (e.g., double data rate (DDR) synchronous dynamic random access memory (SDRAM)), as non-limiting examples. The local memory hierarchies 106(0)-106(P) are subdivided into a plurality of memory granules 108(0)-108(X), 110(0)-110(X), 112(0)-112(X), 114(0)-114(X), respectively. In some aspects, the memory granules 108(0)-108(X), 110(0)-110(X), 112(0)-112(X), 114(0)-114(X) may have a size corresponding to a system cache line size (e.g., 128 bytes, as a non-limiting example).
The processor sockets 102(0)-102(P) are further associated with a corresponding point of serialization (POS) circuit 116(0)-116(P). Each of the POS circuits 116(0)-116(P) is configured to provide functionality for maintaining memory coherency for its local memory hierarchy 106(0)-106(P). As a non-limiting example, the functionality of the POS circuits 116(0)-116(P) may include issuing remote snoops to other processor sockets 102(0)-102(P), collecting snoop responses for given transactions, and initiating memory access operations to appropriate memory controllers (not shown). The POS circuits 116(0)-116(P) may also issue transaction results and handle transaction conflicts for a given memory address.
The processor-based system 100 of
To maintain perfect memory coherency among the processor sockets 102(0)-102(P), each of the POS circuits 116(0)-116(P) would have to perform a snoop of every remote processor socket 102(0)-102(P) for every memory access request to a cacheable local memory address. However, the resulting snoop requests and snoop responses would overwhelm the interconnect bus 104, resulting in decreased system performance for all of the processor sockets 102(0)-102(P). Accordingly, in this regard, each of the processor sockets 102(0)-102(P) is associated with a corresponding coherency directory 118(0)-118(P) stored within the local memory hierarchy 106(0)-106(P). In some aspects, each coherency directory 118(0)-118(P) is stored within a system memory of the local memory hierarchy 106(0)-106(P). Performance may be further enhanced through the use of coherency directory caches 120(0)-120(P), which may be used to cache recently accessed data from the respective coherency directories 118(0)-118(P), and further through the use of remote access indicator arrays 122(0)-122(P), which may be used to minimize the latency impact of accessing the respective local memory hierarchies 106(0)-106(P). The structure and functionality of the coherency directories 118(0)-118(P), the coherency directory caches 120(0)-120(P), and the remote access indicator arrays 122(0)-122(P) are discussed in greater detail below with respect to
To further illustrate the functionality provided by the coherency directories 118(0)-118(P) of
In exemplary operation, a POS circuit, such as the POS circuit 116(0), may receive a memory access request, and may consult the coherency directory 118(0) to determine, based on the status indicators 202(0)-202(S), 202′(0)-202′(S) of the memory granules 108(0)-108(X) being accessed, whether the memory granules 108(0)-108(X) have been previously accessed by one of the remote processor sockets 102(1)-102(P). If not, the POS circuit 116(0) may conclude that a remote snoop is not necessary, and may proceed to fulfill the memory access request using the local memory hierarchy 106(0) (e.g., by performing a memory access operation on a local cache or system memory). However, if the status indicators 202(0)-202(S), 202′(0)-202′(S) of the memory granules 108(0)-108(X) indicate that a remote access has taken place, the POS circuit 116(0) may conclude that a remote snoop of one or more of the remote processor sockets 102(1)-102(P) is necessary. In this manner, the occurrence of unnecessary remote snoops may be reduced, thus improving system performance.
To supplement the coherency directories 118(0)-118(P) of
Some aspects may also further minimize the latency impact of accessing local memory addresses through the use of the remote access indicator arrays 122(0)-122(P) of
On subsequent memory access operations, the POS circuit 116(0) may access the remote access indicator array 122(0) before consulting the coherency directory 118(0) and the coherency directory cache 120(0) (if present). This allows the POS circuit 116(0) to bypass the coherency directory 118(0) and the coherency directory cache 120(0) if the remote access indicator array 122(0) indicates that a given local memory address has not been accessed by one of the remote processor sockets 102(1)-102(P). The POS circuit 116(0) may later clear the remote access indicators 400(0)-400(Y) whenever an access of the coherency directory 118(0) indicates that no memory granules 108(0)-108(X) within the corresponding pages are cached remotely.
In some aspects, the POS circuit 116(0) may update the contents of the remote access indicator array 122(0) to ensure that the remote access indicators 400(0)-400(Y) provide an accurate representation of the status of the corresponding page of memory granules 108(0)-108(X). In such aspects, the POS circuit 116(0) may process the coherency directory entries 200(0)-200(N) of the coherency directory 118(0) to determine whether the status indicators 202(0)-202(S), 202′(0)-202′(S) are set. If none of the status indicators 202(0)-202(S), 202′(0)-202′(S) for a page of memory granules 108(0)-108(X) that corresponds to a given remote access indicator 400(0)-400(Y) are set, the POS circuit 116(0) clears that remote access indicator 400(0)-400(Y) in the remote access indicator array 122(0). In this manner, the accuracy of contents of the remote access indicator array 122(0) may be maintained over time as the memory granules 108(0)-108(X) are accessed by remote processor sockets.
As indicated by arrow 502, the POS circuit 116(0) of the processor socket 102(0) receives a memory access request 504 (e.g., a memory read request or a memory write request) including a local memory address 506 (i.e., “local” with respect to the local memory hierarchy 106(0) of the processor socket 102(0)). In aspects providing a remote access indicator array 122(0), the POS control logic circuit 500 first accesses the remote access indicator array 122(0) to determine whether a remote access indicator, (such as the remote access indicators 400(0)-400(Y) of
However, if the remote access indicator 400(0)-400(Y) corresponding to the page containing the local memory address 506 is set, the POS control logic circuit 500 may next consult the coherency directory cache 120(0), as indicated by arrow 512. The POS control logic circuit 500 of the POS circuit 116(0) determines whether a coherency directory cache entry, such as the coherency directory cache entries 306(0)-306(Z) of
If accessing the coherency directory cache 120(0) results in a miss, the POS control logic circuit 500 consults the coherency directory 118(0) to retrieve a coherency directory entry, such as the coherency directory entries 200(0)-200(N), corresponding to the local memory address 506 of the memory access request 504, as indicated by arrow 518. Based on the coherency directory 118(0), the POS control logic circuit 500 determines whether a remote snoop of the remote processor socket 102(P) is required, or if the memory access request 504 can be fulfilled by accessing the local memory hierarchy 106(0). If a remote snoop is required, the POS circuit 116(0) may perform a snoop of the remote processor socket 102(P), and if the remote processor socket 102(P) is caching the updated data value 514 for the local memory address 506, the POS circuit 116(0) returns the updated data value 514 in response to the memory access request 504, as indicated by arrow 516. If no remote snoop is required, the POS circuit 116(0) returns data 508 from the local memory hierarchy 106(0) in response to the memory access request 504, as indicated by arrow 510.
To illustrate exemplary operations of the POS circuit 116(0) of
In aspects in which the POS circuit 116(0) provides the remote access indicator array 122(0), the POS circuit 116(0) may next determine whether a remote access indicator 400(0) of a plurality of remote access indicators 400(0)-400(Y) of a remote access indicator array 122(0) corresponding to the local memory address 506 is set (block 602). If not (indicating that the corresponding page containing the local memory address 506 has not been remotely accessed), processing resumes at block 604 of
Referring now to
With continuing reference to
Turning to
Referring now to
In
Providing multi-socket memory coherency using cross-socket snoop filtering in processor-based systems according to aspects disclosed herein may be provided in or integrated into any processor-based device. Examples, without limitation, include a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a global positioning system (GPS) device, a mobile phone, a cellular phone, a smart phone, a session initiation protocol (SIP) phone, a tablet, a phablet, a server, a computer, a portable computer, a mobile computing device, a wearable computing device (e.g., a smart watch, a health or fitness tracker, eyewear, etc.), a desktop computer, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a video player, a digital video disc (DVD) player, a portable digital video player, an automobile, a vehicle component, avionics systems, a drone, and a multicopter.
In this regard,
Other master and slave devices can be connected to the system bus 708. As illustrated in
The CPU(s) 702 may also be configured to access the display controller(s) 720 over the system bus 708 to control information sent to one or more displays 726. The display controller(s) 720 sends information to the display(s) 726 to be displayed via one or more video processors 728, which process the information to be displayed into a format suitable for the display(s) 726. The display(s) 726 can include any type of display, including, but not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, etc.
Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the aspects disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer readable medium and executed by a processor or other processing device, or combinations of both. The master devices, and slave devices described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
The aspects disclosed herein may be provided in hardware and in instructions that are stored in hardware, and may reside, for example, in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It is also noted that the operational steps described in any of the exemplary aspects herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary aspects may be combined. It is to be understood that the operational steps illustrated in the flowchart diagrams may be subject to numerous different modifications as will be readily apparent to one of skill in the art. Those of skill in the art will also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.