The technology of the disclosure relates generally to cache memory provided by processor-based devices, and, in particular, to prefetching cache lines by hardware prefetcher engines.
In many conventional processor-based devices, overall system performance may be constrained by memory access latency, which refers to the time required to request and retrieve data from relatively slow system memory. The effects of memory access latency may be mitigated somewhat through the use of one or more caches by a processor-based device to store and provide speedier access to frequently-accessed data. For instance, when data requested by a memory access request is present in a cache (i.e., a cache “hit”), system performance may be improved by retrieving the data from the cache instead of the slower system memory. Conversely, if the requested data is not found in the cache (resulting in a cache “miss”), the requested data then must be read from the system memory. As a result, frequent occurrences of cache misses may result in system performance degradation that could negate the advantage of using the cache in the first place.
To reduce the likelihood of cache misses, the processor-based device may provide a hardware prefetch engine (also referred to as a “prefetch circuit” or simply a “prefetcher”). The hardware prefetch engine may improve system performance of the processor-based device by predicting a subsequent memory access and prefetching the corresponding data prior to an actual memory access request being made. For example, in systems that tend to exhibit spatial locality, the hardware prefetch engine may be configured to prefetch data from a next memory address after the memory address of a current memory access request. The prefetched data may then be inserted into one or more cache lines of a cache. If the hardware prefetch engine successfully predicted the subsequent memory access, the corresponding data can be immediately retrieved from the cache.
However, inaccurate prefetches generated by the hardware prefetch engine may negatively impact system performance in a number of ways. For example, prefetched data that is not actually useful (i.e., no subsequent memory access requests are directed to the prefetched data) may pollute the cache by causing the eviction of cache lines storing useful. The prefetching operations performed by the hardware prefetch engine may also increase consumption of power and memory bandwidth, without the benefit of the prefetched data being useful. Thus, it is desirable to provide a mechanism to increase the likelihood that data prefetched by the hardware prefetch engine will prove useful.
Aspects disclosed in the detailed description include adaptively predicting usefulness of prefetches generated by hardware prefetch engines in processor-based devices. In this regard, in some aspects, a processor-based device provides a hardware prefetch engine that includes a sampler circuit and a predictor circuit. The sampler circuit is configured to store data related to demand requests and prefetch requests that are directed to a subset of sets of a cache of the processor-based device. The sampler circuit maintains a plurality of sampler set entries, each of which corresponds to a set of the cache and includes a plurality of sampler line entries corresponding to memory addresses of the set. Each sampler line entry comprises a prefetch indicator that indicates whether the corresponding memory line was added to the sampler circuit in response to a prefetch request or a demand request. The predictor circuit includes a plurality of confidence counters that correspond to the sampler line entries of the sampler circuit, and that indicate a level of confidence in the usefulness of the corresponding sampler line entry. The confidence counters provided by the predictor circuit are trained in response to demand request hits and misses (and, in some aspects, on prefetch misses) on the memory lines tracked by the sampler circuit. In particular, on a demand line hit corresponding to a sampler line entry, the predictor circuit increments the confidence counter corresponding to a sampler line entry if the prefetch indicator of the sampler line entry is set (thus indicating that the memory line was populated by a prefetch request). Similarly, on a demand line miss, the predictor circuit decrements the confidence counter associated with a sampler line entry corresponding to an evicted memory line if the prefetch indicator of the sampler line entry is set. The predictor circuit may then use the confidence counters to generate a usefulness prediction for a subsequent prefetch request corresponding to a sampler line entry of the sampler circuit. In some aspects, the hardware prefetch engine may further provide an adaptive threshold adjustment (ATA) circuit configured to adaptively modify a confidence threshold of the predictor circuit and/or a bandwidth ratio threshold of the ATA circuit to further fine-tune the accuracy of the usefulness predictions generated by the predictor circuit.
In another aspect, a hardware prefetch engine of a processor-based device is provided. The hardware prefetch engine comprises a sampler circuit that comprises a plurality of sampler set entries, each corresponding to a set of a plurality of sets of a cache. Each sampler set entry comprises a plurality of sampler line entries, each of which comprises a prefetch indicator and corresponds to a memory address indicated by one of a demand request and a prefetch request. The hardware prefetch engine further comprises a predictor circuit that comprises a plurality of confidence counters, each of which corresponds to a sampler line entry of the sampler circuit. The predictor circuit is configured to, responsive to a demand request hit on the sampler circuit, increment a confidence counter of the plurality of confidence counters corresponding to a sampler line entry of the sampler circuit corresponding to the demand request hit and having the prefetch indicator of the sampler line entry set. The predictor circuit is further configured to, responsive to the demand request hit on the sampler circuit, clear the prefetch indicator of the sampler line entry. The predictor circuit is also configured to, responsive to a demand request miss on the sampler circuit, decrement a confidence counter of the plurality of confidence counters corresponding to a sampler line entry of the sampler circuit evicted as a result of the demand request miss and having the prefetch indicator of the sampler line entry set. The predictor circuit is also configured to, responsive to a prefetch request, generate a usefulness prediction for the prefetch request based on comparing a value of a confidence threshold with a value of a confidence counter of the plurality of confidence counters corresponding to a sampler line entry of the sampler circuit identified by the prefetch request.
In another aspect, a hardware prefetch engine of a processor-based device is provided. The hardware prefetch engine comprises a means for providing a plurality of sampler set entries each corresponding to a set of a plurality of sets of a cache, and comprising a plurality of sampler line entries each comprising a prefetch indicator and corresponding to a memory address indicated by one of a demand request and a prefetch request. The hardware prefetch engine further comprises a means for incrementing a confidence counter of a plurality of confidence counters corresponding to a sampler line entry corresponding to a demand request hit and having the prefetch indicator of the sampler line entry set, responsive to the demand request hit. The hardware prefetch engine also comprises a means for clearing the prefetch indicator of the sampler line entry, responsive to the demand request hit. The hardware prefetch engine additionally comprises a means for decrementing a confidence counter of the plurality of confidence counters corresponding to a sampler line entry of the sampler circuit evicted as a result of a demand request miss and having the prefetch indicator of the sampler line entry set, responsive to the demand request miss. The hardware prefetch engine further comprises a means for generating a usefulness prediction for a prefetch request based on comparing a value of a confidence threshold with a value of a confidence counter of the plurality of confidence counters corresponding to a sampler line entry of the sampler circuit identified by the prefetch request, responsive to the prefetch request.
In another aspect, a method for predicting prefetch usefulness is provided. The method comprises, responsive to a demand request hit on a sampler circuit of a hardware prefetch engine of a processor-based device, the sampler circuit comprises a plurality of sampler set entries each corresponding to a set of a plurality of sets of a cache, and comprises a plurality of sampler line entries each comprising a prefetch indicator and corresponding to a memory address indicated by one of a demand request and a prefetch request. The method further comprises incrementing, by a predictor circuit of the hardware prefetch engine, a confidence counter of a plurality of confidence counters corresponding to a sampler line entry of the sampler circuit corresponding to the demand request hit and having the prefetch indicator of the sampler line entry set. The method further comprises, responsive to the demand request hit on the sampler circuit, clearing the prefetch indicator of the sampler line entry. The method also comprises, responsive to a demand request miss on the sampler circuit, decrementing, by the predictor circuit, a confidence counter of the plurality of confidence counters corresponding to a sampler line entry of the sampler circuit evicted as a result of the demand request miss and having the prefetch indicator of the sampler line entry set. The method additionally comprises, responsive to a prefetch request, generating, by the predictor circuit, a usefulness prediction for the prefetch request based on comparing a value of a confidence threshold with a value of a confidence counter of the plurality of confidence counters corresponding to a sampler line entry of the sampler circuit identified by the prefetch request.
With reference now to the drawing figures, several exemplary aspects of the present disclosure are described. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
Aspects disclosed in the detailed description include adaptively predicting usefulness of prefetches generated by hardware prefetch engines in processor-based devices. Accordingly, in this regard,
The processor-based device 100 further includes a cache 108 for caching frequently accessed data retrieved from the system memory 106 or from another, lower-level cache (i.e., a larger and slower cache, hierarchically positioned at a level between the cache 108 and the system memory 106). Thus, the cache 108 according to some aspects may comprise a Level 1 (L1) cache, a Level 2 (L2) cache, or another cache lower in a memory hierarchy. In the example of
It is to be understood that the processor-based device 100 and the illustrated elements thereof may encompass any one of known digital logic elements, semiconductor circuits, processing cores, and/or memory structures, among other elements, or combinations thereof. Aspects described herein are not restricted to any particular arrangement of elements, and the disclosed techniques may be easily extended to various structures and layouts on semiconductor dies or packages. It is to be further understood that aspects of the processor-based device 100 of
The cache 108 of the processor-based device 100 may be used to provide speedier access to frequently-accessed data retrieved from the system memory 106 and/or from a higher-level cache (as in aspects in which the cache 108 is an L2 cache storing frequently accessed data from an L1 cache, as a non-limiting example). To minimize the number of cache misses that may be incurred by the cache 108, the processor-based device 100 also includes the hardware prefetch engine 102. The hardware prefetch engine 102 comprises a prefetcher circuit 114 that is configured to predict memory accesses and generate prefetch requests for the corresponding prefetch data (e.g., from the system memory 106 and/or from a higher-level cache). In some aspects in which memory access requests tend to exhibit spatial locality, the prefetcher circuit 114 of the hardware prefetch engine 102 may be configured to prefetch data from a next memory address after the memory address of a current memory access request. Some aspects may provide that the prefetcher circuit 114 of the hardware prefetch engine 102 is configured to detect patterns of memory access requests, and predict future memory access requests based on the detected patterns.
However, as noted above, if the prefetcher circuit 114 generates inaccurate prefetch requests, the overall system performance of the processor-based device 100 may be negatively impacted. For example, the cache 108 may suffer from cache pollution if prefetched data that is not actually useful causes the eviction of one or more of the cache lines 112(0)-112(C), 112′(0)-112′(C) that are storing useful data. Inaccurate prefetch requests also may increase consumption of power and memory bandwidth, without the benefit of the prefetched data being useful.
In this regard, the hardware prefetch engine 102 of the processor-based device 100 of
To illustrate elements of the sampler circuit 116 of
To accurately mimic the activities of the cache 108, the sampler circuit 116 stores data related to the sets 110(0)-110(S) of the cache 108 that are targeted by either a demand request 206 or a prefetch request 208. Moreover, the sampler circuit 116 stores data related to both prefetch requests that are predicted useful (and thus result in prefetch data being retrieved and stored in the cache 108) as well as prefetch requests that are predicted useless (and thus are discarded without affecting the content of the cache 108). Accordingly, data may be inserted into the sampler circuit 116 in response to demand loads, prefetches predicted to be useful, and prefetches predicted to be useless.
To further illustrate data that may be stored within each of the sampler line entries 204(0)-204(C), 204′(0)-204′(C),
The confidence counters 302(0)-302(Q) are incremented or decremented by the predictor circuit 118 in response to a demand request hit or a demand request miss (resulting in an eviction) on the sampler circuit 116, and, in some aspects, in response to a prefetch request miss on the sampler circuit 116. This process of incrementing and decrementing the confidence counters 302(0)-302(Q) is referred to as “training” the predictor circuit 118, and is discussed in greater detail below with respect to
However, if it is determined at decision block 402 of
If it is determined at decision block 402 of
Referring now to
To illustrate an exemplary process that may be performed by the predictor circuit 118 of
In some aspects, the operations of block 502 for generating the usefulness prediction 306 may include first determining whether a value of the confidence counter 302(Q) corresponding to the sampler line entry 204(C) of the sampler circuit 116 identified by the prefetch request 208 is greater than the value of the confidence threshold 304 (block 504). Accordingly, the predictor circuit 118 may be referred to herein as “a means for determining whether the value of the confidence counter corresponding to the sampler line entry of the sampler circuit identified by the prefetch request is greater than the value of the confidence threshold.” If the value of the confidence counter 302(Q) is determined at decision block 504 to be greater than the value of the confidence threshold 304, the predictor circuit 118 generates the usefulness prediction 306 indicating that the prefetch request 208 is useful (block 506). The predictor circuit 118 thus may be referred to herein as “a means for generating the usefulness prediction indicating that the prefetch request is useful, responsive to determining that the value of the confidence counter is greater than the value of the confidence threshold.” However, if the value of the confidence counter 302(Q) is not greater than the value of the confidence threshold 304, the predictor circuit 118 generates the usefulness prediction 306 indicating that the prefetch request 208 is not useful (block 508). In this regard, the predictor circuit 118 may be referred to herein as “a means for generating the usefulness prediction indicating that the prefetch request is not useful, responsive to determining that the value of the confidence counter is not greater than the value of the confidence threshold.”
In some aspects, the predictor circuit 118 may also update a predicted useful indicator 214(C) of the sampler line entry 204(C) of the sampler circuit 116 identified by the prefetch request 208 based on the usefulness prediction 306 (block 510). Accordingly, the predictor circuit 118 may be referred to herein as “a means for updating a predicted useful indicator of the sampler line entry identified by the prefetch request based on the usefulness prediction.” By updating the predicted useful indicator 214(C) based on the usefulness prediction 306, the predictor circuit 118 can track the disposition of sampler line entries 204(0)-204(C), sampler line entries 204′(0)-204′(C) to determine misprediction rates. Processing in some aspects may continue at block 512 of
Turning now to
According to some aspects, the predictor circuit 118 may determine whether the prefetch request 208 results in a miss on the sampler circuit 116 (block 516). In such aspects, a miss on the sampler circuit 116 may cause the predictor circuit 118 to be trained in much the same way as if the demand request 206 results in a miss. Accordingly, the predictor circuit 118 decrements the confidence counter 302(Q) corresponding to the sampler line entry 204(C) of the sampler circuit 116 evicted as a result of the prefetch request 208 miss and having the prefetch indicator 216(C) of the sampler line entry 204(C) set (block 520). In this regard, the predictor circuit 118 may be referred to herein as “a means for decrementing a confidence counter corresponding to a sampler line entry of the sampler circuit evicted as a result of a prefetch request miss and having the prefetch indicator of the sampler line entry set, responsive to the prefetch request miss.” If the predictor circuit 118 determines at decision block 516 that the prefetch request 208 results in a hit on the sampler circuit 116, processing continues in conventional fashion (block 522).
To illustrate exemplary elements of the ATA circuit 120 of
In some aspects, operations of block 700 for calculating of the misprediction rate 604 may take place during an interval defined by a specified number of elapsed processor cycles or a specified number of executed instructions. The misprediction rate 604 in such aspects may be calculated by tracking a total number of mispredictions during this interval. For example, if the predicted useful indicator 214(C) for a sampler line entry 204(C) indicates that the sampler line entry 204(C) was considered useful, but the prefetch indicator 216(C) for the sampler line entry 204(C) indicates that the sampler line entry 204(C) was never targeted by a demand request 206 before eviction, the sampler line entry 204(C) is categorized as a misprediction, and the total number of mispredictions is incremented. Conversely, if the predicted useful indicator 214(C) for the sampler line entry 204(C) indicates that the sampler line entry 204(C) was considered not useful, but the prefetch indicator 216(C) for the sampler line entry 204(C) indicates that the sampler line entry 204(C) was consumed by a demand request 206, the sampler line entry 204(C) is categorized as a misprediction, and the total number of mispredictions is incremented. At the end of the interval, the total number of mispredictions may then be compared to a total number of predictions made during the interval to determine the misprediction rate 604.
Returning to
To illustrate exemplary operations that may be performed by the ATA circuit 120 to adjust the prediction accuracy threshold 602 of
If it is determined at decision block 802 of
Adaptively predicting usefulness of prefetches generated by hardware prefetch engines in processor-based devices according to aspects disclosed herein may be provided in or integrated into any processor-based device. Examples, without limitation, include a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a global positioning system (GPS) device, a mobile phone, a cellular phone, a smart phone, a session initiation protocol (SIP) phone, a tablet, a phablet, a server, a computer, a portable computer, a mobile computing device, a wearable computing device (e.g., a smart watch, a health or fitness tracker, eyewear, etc.), a desktop computer, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a video player, a digital video disc (DVD) player, a portable digital video player, an automobile, a vehicle component, avionics systems, a drone, and a multicopter.
In this regard,
Other master and slave devices can be connected to the system bus 908. As illustrated in
The CPU(s) 902 may also be configured to access the display controller(s) 920 over the system bus 908 to control information sent to one or more displays 926. The display controller(s) 920 sends information to the display(s) 926 to be displayed via one or more video processors 928, which process the information to be displayed into a format suitable for the display(s) 926. The display(s) 926 can include any type of display, including, but not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, etc.
Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the aspects disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer readable medium and executed by a processor or other processing device, or combinations of both. The master devices, and slave devices described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
The aspects disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It is also noted that the operational steps described in any of the exemplary aspects herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary aspects may be combined. It is to be understood that the operational steps illustrated in the flowchart diagrams may be subject to numerous different modifications as will be readily apparent to one of skill in the art. Those of skill in the art will also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.