In a typical computing system, a memory system is designed with a goal of low latency experienced by a processor when accessing arbitrary units of data. In general, the memory system design takes advantage of memory access properties known as temporal locality and spatial locality. Temporal locality refers to multiple accesses to specific memory locations within a relatively small time period. Spatial locality refers to accesses to memory locations relatively close in the address space within a relatively small time period.
Typically, temporal locality is evaluated in terms of a granularity smaller than that of a next level in a memory hierarchy. For example, a cache captures a repeated access of blocks of fixed size (i.e., cache lines, e.g., blocks of 64 Bytes (B)), which are smaller than the storage granularity of main memory (e.g., 4 Kilobyte (KB) pages). A cache captures spatial locality by storing locally quantities of sequentially stored data slightly larger than a requested quantity in order to reduce memory access latency in the event of sequential access. For example, a cache is designed to store 64 B blocks, although a processor requests one to eight Bytes at a time. Meanwhile, the cache requests blocks of 64 B at a time from a memory, which stores pages of 4 KB contiguous portions.
In a shared memory multiprocessor system, workloads may cause communications between caching agents (e.g., processors, graphics processing units, processor offload engines, or other processing units that each include a cache in a node, socket, or other multi-processor system). Communications between caching agents that do not share a last-level cache may result in cache-to-cache latency for communication traffic between caching agents. In general, as cache sizes grow larger over time (e.g., using die stacked SRAM), the proportion of communication misses will increase and increase the effects of cache-to-cache latency for communication traffic between caching agents. Thus, improved techniques for communication between caching agents is desired.
In at least one embodiment of the invention, a method includes storing a communication attribute in a shadow tag entry associated with a cache line stored in a penultimate-level cache of a first caching agent having a first last-level cache. The method includes bypassing the first last-level cache in response to the cache line having a modified state, the cache line being evicted from the penultimate-level cache, and the communication attribute having a first state. The first state of the communication attribute indicates prior communication of the cache line with a second caching agent having a second last-level cache. The method may include storing the cache line in the first last-level cache in response to the cache line having the modified state, the cache line being evicted from the penultimate-level cache, and the communication attribute having a second state.
The bypassing may include issuing a victim packet including a communication bypass attribute and the cache line to a directory controller and writing the cache line to a buffer by the directory controller. The bypassing may further include issuing a storing probe packet by the directory controller to the second caching agent in response to receiving the victim packet. The second caching agent may be identified as a previous owner of the cache line by communication history information for the cache line. The bypassing may further include prefetching the cache line into a second last-level cache of the second caching agent, setting to the first state an associated communication attribute in a second shadow tag entry of the second caching agent, and updating the communication history information for the cache line in response to receiving the storing probe packet by the second caching agent. The method may include setting to the first state the communication attribute in response to satisfying a memory request miss of the first last-level cache by a read response from the second caching agent. The method may include storing communication history information for the cache line in a probe filter.
In at least one embodiment of the invention, an apparatus includes a probe filter configured to store communication history information for a cache line stored in a first caching agent having a first last-level cache. The apparatus includes a controller configured to store the cache line in response to the cache line being evicted from a penultimate-level cache of the first caching agent and configured to provide the cache line to a second caching agent having a second last-level cache in response to the communication history information. The communication history information may be set in response to the cache line being provided by the first caching agent to a second caching agent in response to a directed probe. The communication history information may include a previous owner identifier and a communication state.
The first caching agent may include a shadow tag memory associated with the first last-level cache. The shadow tag memory may be configured to store a communication attribute for the cache line. The communication attribute may have a first state in response to a read response from the second caching agent initiated by a miss in the first last-level cache. The first caching agent may be configured to issue a victim packet including a communication bypass attribute to the controller in response to the cache line having a modified state being evicted from a penultimate-level cache of the first caching agent and in response to the communication attribute having the first state. The controller may be configured to issue a storing probe packet to the second caching agent in response to receiving a victim packet from the first caching agent. The probe filter may identify the second caching agent as a previous owner of the cache line in the communication history information. The second caching agent may be configured to prefetch the cache line into the second last-level cache and to set to the first state an associated communication attribute in a second shadow tag of the second caching agent.
In at least one embodiment of the invention, an apparatus includes a shadow tag memory configured to store a communication attribute associated with a cache line stored in a penultimate-level cache of a first caching agent having a first last-level cache. The apparatus includes control logic configured to bypass the first last-level cache in response to the cache line having a modified state, the cache line being evicted from the penultimate-level cache, and the communication attribute having a first state. The first state indicates prior communication of the cache line with a second caching agent having a second last-level cache. The control logic may be further configured to store the cache line in the first last-level cache in response to the cache line having the modified state, the cache line being evicted from the penultimate-level cache, and the communication attribute having a second state. The apparatus may include a probe filter configured to store status information and communication history information for the cache line. The apparatus may include a directory controller configured to store the cache line to main memory in response to the cache line being evicted from the penultimate-level cache and configured to provide the cache line to the second caching agent responsive to the communication history information.
The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.
The use of the same reference symbols in different drawings indicates similar or identical items.
A technique for accelerating cache-to-cache transfers between caching agents attempts to prefetch a cache line by a consumer caching agent. Since the consumer caching agent generally does not know when a cache line has been made available by a producer caching agent, the prefetching technique alone is as likely to degrade performance as it is to improve performance depending on the timing of the prefetch by the consumer caching agent.
Another technique for accelerating cache-to-cache transfers introduces a new attribute associated with each cache line stored in a caching agent. The attribute indicates the provenance of an associated cache line. When a cache fill request is serviced by another caching agent and the cache line is dirty, then the cache line is categorized as a communication cache line. When that cache line is eventually cast out of a private cache of the caching agent, the cache line is installed in the last-level cache only if the cache line is not a communication cache line. Otherwise, the cache line is written back to main memory. When the consumer caching agent requests that cache line, it is returned from main memory instead of the cache in the producer caching agent. This approach provides a modest benefit if the latency of accessing the data from main memory is less than a cache-to-cache transfer. However, under some circumstances this technique wastes power unnecessarily by writing a cache line that is involved in producer-consumer communication to main memory and unnecessarily reading that cache line back from main memory.
A communication bypass mechanism accelerates cache-to-cache data transfers for cache lines communicated between caching agents that have private last-level caches. The acceleration provided by the communication bypass mechanism increases with increases to the size of the last-level cache. In some applications, since a next access to a cache line is likely to be a read from a consumer caching agent, the communication bypass mechanism uses an eviction of the cache line from the penultimate-level cache (e.g., a level-two cache of a three level cache) of a producer caching agent to trigger a transfer of the cache line to the last-level cache (e.g., a level-three cache of the three level cache) of the consumer caching agent prior to a request for the cache line by the consumer caching agent.
The communication bypass mechanism uses a shadow tag, which is a cache-like hardware structure that is associated with a last-level cache. The shadow tag records any cache lines stored in a penultimate-level cache or higher-level cache of the cores. The shadow tag maintains communication information indicating whether the associated cache line is involved in communication between caching agents. The last-level cache sets the communication information when it receives or provides the associated cache line in response to a directed probe. The caching agent subsequently bypasses a last-level cache install for a communication line when that cache line is evicted from a penultimate-level cache. That victim cache line is associated with state information and a destination that a directory controller uses to inject the data into the last-level cache of a consumer caching agent before the consumer caching agent requests the data. The mechanism allows the consumer caching agent to directly access the data of the victim cache line from its cache and avoid the latency of a cache-to-cache transfer from the last-level cache of the producer caching agent. The directory controller uses the communication history information stored in a probe filter to steer cache lines evicted from a cache of a producer caching agent to a cache of a consumer caching agent.
Referring to
Probe filter 112 includes line probe filter 118 and buffer 114. Buffer 114 is used and reused as temporary storage for communications between caching agent 102, caching agent 104, and main memory 110. In at least one embodiment, probe filter 112 includes page probe filter 116, which tracks pages stored in the caches (e.g., 4 KB pages) of coherence domain 122 and line probe filter 118 tracks the caching status for any cache lines shared across caching agents (e.g., written to by a core) in coherence domain 122 and any associated communication history.
Referring to
Referring to
In at least one embodiment, victim packet 162 carries a communication bypass attribute that causes directory controller 121 to send storing probe 164 to the previous owner indicated in the communication history field of associated line probe filter entry 300 in line probe filter 118. Storing probe 164 serves as a hint to caching agent 102 that the cache line is being written back to main memory 110 and then cache-to-cache data transfers may be accelerated. If the previous owner is caching agent 102, in response to storing probe 164, caching agent 102 sends prefetch 166. In response to prefetch 166, directory controller 121 sends response 168, which causes cache control logic to install the cache line into the last-level cache of caching agent 102 (e.g., level-three cache 128) and sets an associated communication attribute for the cache line in shadow tag 126. Since at least some of the cache line resides in buffer 114, the installation of the cache line in caching agent 102 is further accelerated by the prefetch, as compared to a later fetch of the cache line from main memory 110. If the communication attribute associated with the victim cache line has a value indicating that the cache line was not communicated to caching agent 104 by another caching agent, then caching agent 104 installs the data associated with that cache line in the last-level cache, as is typical of eviction from the penultimate-level cache. Note that the probe messages and responses of
Referring to
A system implementing the communication bypass mechanism of
Thus, a communication bypass mechanism that reduces the latency of cache-to-cache transfers for producer-consumer communication between caching agents in a shared memory multiprocessor system has been described. The reduced latency increases throughput and reduces response times for workloads that have a high level of producer-consumer communication (e.g., inter-task communications). Bus trace analysis for benchmark simulations indicates that several exemplary workloads involve only a small number of cache lines in communication and heavily access those cache lines. Thus, buffer 114 need not be large to capture the set of cache lines involved in producer-consumer communication. Performance for cache-to-cache transfers of a probe filter-based (i.e., cache directory-based) system approaches the performance for cache-to-cache transfers in systems including a large center last-level cache that is shared by all cores in a node.
While circuits and physical structures have been generally presumed in describing embodiments of the invention, it is well recognized that in modern semiconductor design and fabrication, physical structures and circuits may be embodied in computer-readable descriptive form suitable for use in subsequent design, simulation, test or fabrication stages. Structures and functionality presented as discrete components in the exemplary configurations may be implemented as a combined structure or component. Various embodiments of the invention are contemplated to include circuits, systems of circuits, related methods, and tangible computer-readable medium having encodings thereon (e.g., VHSIC Hardware Description Language (VHDL), Verilog, GDSII data, Electronic Design Interchange Format (EDIF), and/or Gerber file) of such circuits, systems, and methods, all as described herein, and as defined in the appended claims. In addition, the computer-readable media may store instructions as well as data that can be used to implement the invention. The instructions/data may be related to hardware, software, firmware or combinations thereof.
The description of the invention set forth herein is illustrative, and is not intended to limit the scope of the invention as set forth in the following claims. For example, while the invention has been described in an embodiment in which a MOESI cache coherence protocol is used and three levels of cache are used, one of skill in the art will appreciate that the teachings herein can be utilized with other cache coherence protocols and caches having other numbers of levels. In addition, while the invention has been described in embodiments in which the caching agents are multi-core processors, one of skill in the art will appreciate that the teachings herein can be utilized to accelerate producer-consumer communication between any pair of caching agents (e.g., processor core-to-processor core, processor core-to-GPU, processor core-to-offload engine, etc.). Variations and modifications of the embodiments disclosed herein, may be made based on the description set forth herein, without departing from the scope of the invention as set forth in the following claims.