Aspects disclosed herein relate to processing systems which implement speculative memory operations. More specifically, aspects disclosed herein relate expediting cache misses through cache hit prediction.
Modern computing systems may include multiple processors, where each processor has one or more compute cores. Such systems often include multiple classes of data storage, including private caches, shared caches, and main memory. Private caches are termed as such because each processor has its own private cache, which is not accessed by the other processors in the system. Shared caches conventionally are larger than private caches, but are shared by multiple (or all) of the processors in the system. Such a shared cache is conventionally divided into many portions that are distributed across the system interconnect. Main memory conventionally is the largest unit of storage, and may be accessed by all processors in the system.
Conventionally, when a processor requests data, the system attempts to service the request using the private cache first. If the request misses in the private cache (e.g., the data is not present in the private cache), the system then checks the shared cache. If the request misses in the shared cache, the request is forwarded to main memory, where the request is serviced and the requested data is sent to the processor. However, many data requests miss in all caches (private and shared), and get serviced by main memory. Such requests spend many tens of cycles traversing through the caches before the request reaches main memory. As such, system performance slows while the processor waits for the data request to be serviced.
Aspects disclosed herein relate to expediting cache misses in a shared cache using cache hit prediction.
In one aspect, a method comprises determining that a request to access data at a first physical address misses in a private cache of a processor. The method further comprises determining that a confidence value received for the first physical address based on a hash value of the first physical address exceeds a threshold value. The method further comprises issuing a speculative read request specifying the first physical address to a memory controller of a main memory to expedite a miss for the data at the first physical address in a shared cache.
In one aspect, a non-transitory computer-readable medium stores instructions that, when executed by a processor, cause the processor to perform an operation comprising determining that a request to access data at a first physical address misses in a private cache of a processor. The operation further comprises determining that a confidence value received for the first physical address based on a hash value of the first physical address exceeds a threshold value. The operation further comprises issuing a speculative read request specifying the first physical address to a memory controller of a main memory to expedite a miss for the data at the first physical address in a shared cache.
In one aspect, an apparatus comprises a plurality of computer processors, each processor comprising a respective private cache. The apparatus further comprises a shared cache shared by at least two of the processors and a main memory. The apparatus further comprises logic configured to perform an operation comprising determining that a request to access data at a first physical address misses in a private cache of a first processor of the plurality of processors. The operation further comprises determining that a confidence value received for the first physical address based on a hash value of the first physical address exceeds a threshold value. The operation further comprises issuing a speculative read request specifying the first physical address to a memory controller of a main memory to expedite a miss for the data at the first physical address in a shared cache.
In one aspect, an apparatus comprises a processor comprising a private cache, a shared cache, and a main memory. The apparatus further comprises means for determining that a request to access data at a first physical address misses in the private cache of the processor. The apparatus further comprises means for determining that a confidence value received for the first physical address based on a hash value of the first physical address exceeds a threshold value. The apparatus further comprises means for issuing a speculative read request specifying the first physical address to a memory controller of the main memory to expedite a miss for the data at the first physical address in the shared cache.
So that the manner in which the above recited aspects are attained and can be understood in detail, a more particular description of aspects of the disclosure, briefly summarized above, may be had by reference to the appended drawings.
It is to be noted, however, that the appended drawings illustrate only aspects of this disclosure and are therefore not to be considered limiting of its scope, for the disclosure may admit to other aspects.
Aspects disclosed herein expedite cache misses in a shared cache by employing cache hit prediction. Generally, aspects disclosed herein predict whether a data request that has missed in a private cache will be served by the shared cache. The prediction is based on a confidence value associated with a hash value computed for a physical memory address specified by the data request. If the confidence value exceeds a threshold, aspects disclosed herein predict that the data request will not be served by the shared cache, and issue a speculative read request to a memory controller which controls a main memory. The speculative request may be issued in addition to a demand request for the data in the shared cache. If the demand request for the data misses in the shared cache, the demand request is forwarded to the memory controller, where the demand request and the speculative request merge. The merged request is then serviced by main memory, such that the requested data is brought to the private cache. Doing so resolves the cache miss in the private shared cache in less time than would conventionally be required.
Furthermore, aspects disclosed herein provide a training mechanism to reflect the accuracy of previous predictions. Generally, if the main memory provides the requested data, the prediction that the request will miss in the shared cache is determined to be correct, and the confidence value for the physical memory address is incremented. However, if the demand request is not served by main memory (e.g., the demand request is served by the shared cache), the confidence value for the physical memory address is decremented. Doing so improves subsequent predictions for the respective physical memory address.
The memory controller 104 manages the flow of data to and from the memory 105. Memory 105 may comprise physical memory in a physical address space. A memory management unit (not pictured) may be used to obtain translations of virtual addresses (e.g., from processor 101) to physical addresses for accessing memory 105. Although the memory 105 may be shared amongst one or more other processors 101 or processing elements, these have not been illustrated, for the sake of simplicity. However, each processor 101 is allocated a respective inner cache 102 comprising an L1 cache 108 and an L2 cache 109. The one or more processors 101 each share at least a portion of the L3 cache 110.
During program execution, the processor 101 first looks for needed data in the caches 102, 103. More specifically, the processor 101 first looks for data in the L1 cache 108, followed by the L2 cache 109, and then the L3 cache 110. A cache hit represents the case where the needed data resides in the respective cache. Conversely, a cache miss represents the case where the needed data does not reside in the respective cache. A cache miss in one of the caches 108-110 causes a cache controller (not pictured) to issue a demand request for the requested data in the next highest cache. If the needed data is not resident in the L3 cache 110, the request is served by the memory 105. When the needed data is brought to one or more of the caches 108-110, the cache miss is said to resolve.
The CHiP 106 is a hardware structure configured to expedite cache misses using cache hit prediction. As shown, the CHiP 106 includes a prediction table 107 and a hash function 111. The hash function 111 may be any type of hash function. Generally, a hash function is any function that can be used to map data of arbitrary size to data of fixed size (e.g., a physical memory address to a hash value). The prediction table 107 is an N-entry structure which is indexed by a hash value produced by applying the hash function 111 to a physical memory address. Although the prediction table 107 may be of any size, in one aspect, the prediction table 107 includes 256 entries. Although depicted as being a separate component of the processor 101 for the sake of clarity, in at least one aspect, the CHiP 106 is disposed on an integrated circuit including the processor 101 and the caches 108, 109.
In operation, the CHiP 106 determines that a request for data misses in the private caches 102 (e.g., a miss in the L1 cache 108, followed by a miss in the L2 cache 109). When the CHiP 106 determines that the request misses in the L2 cache, the CHiP 106 applies the hash function 111 to the physical memory address specified by the request. If there is a hit in on a hash value 201 in the prediction table 107 (e.g., the hash value produced by the hash function 111 exists in the prediction table 107), the CHiP 106 determines whether the associated confidence value 202 exceeds a threshold value. If the confidence value 202 exceeds the threshold value, the CHiP 106 issues a speculative read request specifying the physical address to the memory controller 104. The threshold value may be any value in a range of values supported by the N-bit confidence values 202. However, in one aspect, the confidence value is zero.
As previously indicated, when the request misses in the L2 cache 109, a demand request for the data is issued to the L3 cache. If the demand request misses in the L3 cache, the demand request is forwarded to the memory controller 104. However, because the speculative request is received by the memory controller 104 prior to the demand request, the demand request merges with the speculative request when received by the memory controller 104. In at least one aspect, merging the speculative and demand requests comprises changing a status of the speculative request to a demand request. The memory controller 104 then processes the merged request, and the data is brought to the L1 cache 108 and/or the L2 cache 109. Doing so allows the miss in the private cache 102 to resolve faster than waiting for the demand request to be served by the memory controller 104 after missing in the L3 cache 110.
The prediction table 107 is depicted as a tagged structure in
At event 334, the memory controller 104 services the merged request using the main memory 105. Doing so transfers the data stored at PA1 to the L2 cache 109 (and/or the L1 cache 108), and resolves the initial miss for PA1 incurred at event 310. At event 336, the CHiP 106 increments the confidence value 202 of the entry associated with the hash value generated for PA1 in the prediction table 107. Doing so reflects that the prediction that PA1 will miss in the L3 cache 110 was correct, and allows the CHiP 106 to make future predictions that PA1 will miss in the L3 cache 110 with greater confidence.
At event 346, the CHiP 106 receives the cancel acknowledgement from the memory controller 104. At event 348, the CHiP 106 decrements the confidence value 202 for the entry associated with the hash of PA1 in the prediction table 107. The CHiP 106 decrements the confidence value 202 to reflect that the prediction that PA1 would miss in the L3 cache 110 was incorrect. In at least one aspect, however, the CHiP 106 resets the confidence value 202 to zero rather than decrementing the confidence value 202. Doing so may prevent the CHiP 106 from issuing excessive speculative reads to the memory controller 104. At event 350, the CHiP 106 and/or the cache controller releases an entry reflecting the miss in the L2 cache for PA1 (e.g., in a miss status holding register (MSHR), and/or an outgoing request buffer).
More generally, in some aspects, the CHiP 106 may optionally modify the threshold value applied to the confidence values 202. For example, if a computed average of the confidence values 202 in the prediction table 107 increases over time, the CHiP 106 may increase the threshold to reduce the number of speculative reads issued to the memory controller 104. Similarly, if the average of the confidence values 202 decreases over time, the CHiP 106 may decrease the threshold to expedite more misses in the L3 cache 110.
Means for storing data in the caches 108-110, memory 104, CHiP 106, prediction table 107, and queue 301 include one or more memory cells. Means for searching and modifying data stored in the caches 108-110, memory 104, CHiP 106, prediction table 107, and queue 301 include logic implemented as hardware and/or software. Similarly, the logic implemented as hardware and/or software may serve as means for reading and/or writing values, returning indications of hits and/or misses, evicting entries, and returning values from the caches 108-110, memory 104, CHiP 106, prediction table 107, and queue 301. Example of such means logic includes memory controllers (e.g., the memory controller 104), cache controllers, and data controllers.
Returning to block 620, the L3 cache 110 may store the needed data. In such aspects, the method 600 proceeds to block 650, where the CHiP 106 determines that the L3 cache 110 serviced the cache miss (e.g., the data associated with the first physical address was present in the L3 cache 110 and was brought to the L2 cache 109). In response, the CHiP 106 issues a speculative read cancel instruction for the first physical address to the memory controller 104. At block 660, the memory controller 104 receives the speculative read cancel instruction from the CHiP 106. In response, the memory controller 104 terminates the speculative read for the first physical memory address (e.g., removes the corresponding entry from the queue 301), and transmits a cancel acknowledgement to the CHiP 106. At block 670, the CHiP 106 receives the cancel acknowledgement from the memory controller 104. At block 680 the CHiP 106 releases an indication of the initial miss in the L2 cache 109.
An example apparatus in which exemplary aspects of this disclosure may be utilized is discussed in relation to
Accordingly, in a particular aspect, input device 830 and power supply 844 are coupled to the system-on-chip device 822. Moreover, in a particular aspect, as illustrated in
Although
Advantageously, aspects disclosed herein provide techniques to expedite cache misses in a shared cache using cache hit prediction. Generally, aspects disclosed herein predict whether a request to access data at a first physical address will miss in the shared cache (e.g., the L3 cache 110) after a miss for the data has been incurred at a private cache (e.g., the L2 cache 109). If the prediction is for a miss in the L3 cache 110, aspects disclosed herein expedite the miss in the L3 cache 110 by sending a speculative read request to the memory controller 104. If there is a miss in the L3 cache 110, the memory controller 104 receives a demand request for the first physical address, which merges with the speculative request, giving the demand request a higher relative priority in a read queue.
A number of aspects have been described. However, various modifications to these aspects are possible, and the principles presented herein may be applied to other aspects as well. The various tasks of such methods may be implemented as sets of instructions executable by one or more arrays of logic elements, such as microprocessors, embedded controllers, or IP cores.
The various operations of methods described above may be performed by any suitable means capable of performing the operations, such as a processor, firmware, application specific integrated circuit (ASIC), gate logic/registers, memory controller, or a cache controller. Generally, any operations illustrated in the Figures may be performed by corresponding functional means capable of performing the operations.
The foregoing disclosed devices and functionalities may be designed and configured into computer files (e.g. RTL, GDSII, GERBER, etc.) stored on computer readable media. Some or all such files may be provided to fabrication handlers who fabricate devices based on such files. Resulting products include semiconductor wafers that are then cut into semiconductor die and packaged into a semiconductor chip. Some or all such files may be provided to fabrication handlers who configure fabrication equipment using the design data to fabricate the devices described herein. Resulting products formed from the computer files include semiconductor wafers that are then cut into semiconductor die (e.g., the processor 101) and packaged, and may be further integrated into products including, but not limited to, mobile phones, smart phones, laptops, netbooks, tablets, ultrabooks, desktop computers, digital video recorders, set-top boxes and any other devices where integrated circuits are used.
In one aspect, the computer files form a design structure including the circuits described above and shown in the Figures in the form of physical design layouts, schematics, a hardware-description language (e.g., Verilog, VHDL, etc.). For example, design structure may be a text file or a graphical representation of a circuit as described above and shown in the Figures. Design process preferably synthesizes (or translates) the circuits described below into a netlist, where the netlist is, for example, a list of wires, transistors, logic gates, control circuits, I/O, models, etc. that describes the connections to other elements and circuits in an integrated circuit design and recorded on at least one of machine readable medium. For example, the medium may be a storage medium such as a CD, a compact flash, other flash memory, or a hard-disk drive. In another aspect, the hardware, circuitry, and method described herein may be configured into computer files that simulate the function of the circuits described above and shown in the Figures when executed by a processor. These computer files may be used in circuitry simulation tools, schematic editors, or other software applications.
The implementations of aspects disclosed herein may also be tangibly embodied (for example, in tangible, computer-readable features of one or more computer-readable storage media as listed herein) as one or more sets of instructions executable by a machine including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The term “computer-readable medium” may include any medium that can store or transfer information, including volatile, nonvolatile, removable, and non-removable storage media. Examples of a computer-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette or other magnetic storage, a CD-ROM/DVD or other optical storage, a hard disk or any other medium which can be used to store the desired information, a fiber optic medium, a radio frequency (RF) link, or any other medium which can be used to carry the desired information and can be accessed. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc. The code segments may be downloaded via computer networks such as the Internet or an intranet. In any case, the scope of the present disclosure should not be construed as limited by such aspects.
The previous description of the disclosed aspects is provided to enable a person skilled in the art to make or use the disclosed aspects. Various modifications to these aspects will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.