Embodiments of the present invention relate to operation of a processor, and more particularly to prefetching data for use in a processor.
Processors perform operations on data in response to program instructions. Today's processors operate at ever-increasing speeds, allowing operations to be performed rapidly. Data needed for operations must be present in the processor. If the data is missing from the processor when it is needed, a latency, which is the time it takes to load the data into the processor, occurs. Such a latency may be low or high, depending on where the data is obtained from within various levels of a memory hierarchy. Accordingly, prefetching schemes are used to obtain data or instructions and provide them to a processor prior to their use in a processor's execution units. When this data is readily available to an execution unit, latencies are reduced and increased performance is achieved.
Often times a prefetching scheme will prefetch information and store it in a cache memory of the processor. However, such prefetching and storage in a cache memory can cause the eviction of other data from the cache memory. The data evicted from the cache, when needed, can only be obtained at the expense of a long latency. Such eviction and resulting delays are commonly referred to as cache pollution. If the prefetched information is not used, the prefetch and eviction of data provides no benefit. In addition to potential performance slowdowns due to cache pollution, excessive prefetching can cause increased bus traffic, which leads to further bottlenecks, reducing performance.
While for many applications, prefetching is a critical component for improved processing performance, unconstrained prefetching can actually harm performance in some applications. This is especially so as processors expand to include multiple cores, and multiple threads that execute per core. Accordingly, unconstrained prefetching schemes that work well in a single core and/or single-threaded environment can negatively impact performance in a multi-core and/or multi-threaded environment.
In various embodiments, mechanisms may be provided to enable throttling of prefetching. Such throttling may be performed on a per-thread basis to enable fine-grained control of prefetching activity. In this way, prefetching may be performed when it improves thread performance, while prefetching may be constrained in situations in which prefetching would negatively impact performance. By performing an analysis of prefetching, a mechanism in accordance with an embodiment of the present invention may set a throttling policy, e.g., on a per-thread basis to either allow prefetching in an unconstrained manner or to throttle such prefetching. In various embodiments, different manners of throttling prefetching may be realized, including disabling of prefetching, reducing an amount of prefetching or in other such ways. In some implementations, a prefetching throttling policy may be used to initialize prefetch detectors, which are tables or the like allocated to particular memory regions. In this way, these prefetch detectors may have a throttling policy set on allocation that enables throttling to occur from allocation, even where the prefetch detector lacks information to make a throttling decision on its own. Accordingly, ill effects potentially associated with unconstrained prefetching may be limited where a prefetch detector is allocated with an initial throttling policy set to a throttled state.
Various manners of implementing prefetch throttling analysis may be performed in different embodiments using various combinations of hardware, software and/or firmware. Furthermore, implementations may exist in many different processor architecture types, and in connection with different prefetching schemes, including such schemes that do not use detectors.
Referring now to
Still referring to
Still referring to
Demand accesses corresponding to processor requests may be provided to prefetcher 40. In one embodiment all such demand accesses may be sent, while in other embodiments only demand accesses associated with cache misses are sent to prefetcher 40. As shown in
Accordingly, to prevent such ill effects embodiments of the present invention may analyze demand accesses to determine whether prefetching should be throttled. Demand accesses are requests issuing from processor components resulting from instruction stream execution for data at particular memory locations. Various manners of determining whether to throttle prefetching can be implemented. Referring now to
As shown in
Still referring to
If at diamond 140 the sample count is determined not to exceed the predetermined value, control passes back to block 110, where further demand accesses are tracked in additional allocated detectors. If instead at diamond 140 it is determined that the desired sample size of lifetimes is present, control passes to block 150. There the average accesses per prefetch detector lifetime may be determined (block 150). As one example determination, a total amount of accesses accumulated may be averaged by dividing the total accesses by the sample size. In embodiments in which the sample size is a power of 2, this operation may be effected by taking only the desired number of most significant bits of the accumulated value. For example, the accumulated value may be taken to 11 bits. However, for a desired lifetime sample size of 32, only the 6 most significant bits may be used to obtain the average. Also at block 150, the sample count (and the accumulation value) may be reset.
Still referring to
Then from either of blocks 170 and 180, control may pass back to block 110, discussed above. Thus method 100 may be continuously performed during operation such that dynamic analysis of demand accesses is routinely performed so that prefetching or throttling of prefetching may occur based on the nature of demand accesses currently being performed in a system. Because demand accesses and the characteristics of corresponding detector behavior is temporal in nature, such dynamic analysis and control of throttling may improve performance. For example, sometimes an application may switch from a predominant behavior to a transient behavior with respect to memory accesses. Embodiments of the present invention may thus set an appropriate throttling policy based on the nature of demand accesses currently being made.
While certain applications may exhibit a given demand access pattern that in turn either enables prefetching or causes throttling of prefetching, transient behavior of the application may change demand access patterns, at least for a given portion of execution. Accordingly, in various embodiments prefetch detectors in accordance with an embodiment of the present invention may include override logic to override a throttling policy when a current demand access pattern would be improved by prefetching.
Referring now to
Still referring to
If instead at diamond 230, it is determined that the tracked accesses do exceed the override threshold, control passes to block 240. There, prefetching may be allowed for prefetch addresses generated for the memory region thread associated with the detector (block 240). Accordingly, such an override mechanism allows for prefetching of accesses associated with a given detector even where the thread associated with that detector has a throttling policy set. In this way, transient behavior of the thread that indicates, e.g., streaming accesses may support prefetching, improving performance by reducing latencies to obtain data from memory or a throttling policy may be overridden when a thread performs multiple tasks having different access profiles. While described with this particular implementation in the embodiment of
In different implementations, prefetch throttling determinations and potential overriding of such policies may be implemented using various hardware, software and/or firmware. Referring now to
As shown in
As shown in
As shown in
Accordingly, based on a thread with which the deallocated detector 305 is associated, the corresponding count from register 308 is provided through one of first and second multiplexers 315a and 315b to a corresponding thread averager 330a and 330b. For purposes of the discussion herein, the mechanism with respect to first thread (i.e., T0) will be discussed. However, it is to be understood that an equivalent path and similar control may occur in other threads (e.g., T1). Thread averager 330a may take the accumulated count value and accumulate it with a current count value present in a register 332a associated with thread averager 330. This accumulated value corresponds to a total number of accesses for a given number of detector lifetimes. Specifically, upon each deallocation and transmission of an access count a sample counter 320a is incremented and the incremented value is stored in an associated register 322a. Upon this incrementing, the incremented value is provided to a first logic unit 325a, which may compare this incremented sample count to a preset threshold. This preset threshold may correspond to a desired number of sample lifetimes to be analyzed. As described above, in some implementations this sample lifetime value may be a power of two and may correspond to 16 or 32, in some embodiments. Accordingly, when the desired number of sample lifetimes has been obtained and its demand access counts accumulated in thread averager 330a, first logic 325a may send a control signal to enable the averaging of the total number of demand accesses. In one embodiment, such averaging may be implemented by dropping off the least significant bits (LSBs) of register 332a via presence of a second register 334a coupled thereto. In one embodiment, register 332a may be 11 bits wide, while register 334a may be six bits wide, although the scope of the present invention is not so limited.
When the averaged value corresponding to average demand accesses per detector lifetime is obtained, the value may be provided to a second logic unit 335a. There, this average value may be compared to a threshold. This threshold may correspond to a level above which unconstrained prefetching may be allowed. In contrast, if the value is below the threshold, throttling of prefetching may be enabled. In various embodiments, the threshold may be empirically determined and in some embodiments, for example, where detectors have a depth of 32 to 128 entries, this threshold may be between approximately 5 and 15, although the scope of the present invention is not so limited. Thus based on the average number of accesses, it may be determined whether detector-based prefetching will improve performance. If, for example, the average is sufficiently low detector-based prefetching may not improve performance and thus may be throttled. Accordingly, a threshold value T between 1 and N may be set such that prefetching is throttled if the average is less than T, while prefetching may be enabled if the average is greater than T.
Accordingly, an output from second logic 335a may correspond to a prefetch throttling policy. Note that this throttle policy may be independently set and controlled for these different threads. If throttling is enabled (i.e., prefetching is throttled), the signal may be set or active, while if throttling is disabled, the signal may be disabled or logic low, in one implementation. As shown in
Because of transient or other behavior, a given allocated detector may see a relatively high level of demand accesses. If the number of demand accesses for an allocated detector is greater than an override threshold, which may be stored in third logic 345, for example, a set throttle policy may be disabled. Because some applications may exhibit a behavior that causes a low overall number of average accesses with periodic relatively high demand accesses, an override mechanism may be present. Thus to improve performance where prefetching may aid and thus reduce latency, if a particular detector has a number of accesses that exceeds the override threshold, throttling may be disabled and prefetching re-enabled for the given detector. Accordingly, prefetching may be enabled for a given detector if the actual number of demand accesses for a given detector 305 is greater than this override threshold. Thus, third logic unit 345 may enable prefetching decisions made in detector 305 to be output via prefetch output line 304. While described with this particular implementation in the embodiment of
Using embodiments of the present invention in a multi-threaded environment, prefetches may be throttled when they are less likely to be used. Specifically, threads in which a relatively high number of memory accesses per detector occur may perform prefetching. Such threads may benefit from prefetching. However, in applications or threads in which a relatively low number of demand accesses per detector lifetime occur, prefetching may be throttled. In such threads or applications, prefetching may provide little benefit or may negatively impact performance. Furthermore, because demand accesses may be temporal in nature, override mechanisms may enable prefetching in a thread in which prefetching is throttled to accommodate periods of relatively high demand accesses per detector lifetime.
Embodiments may implement thread prefetch throttling using a relatively small amount of hardware, which may be wholly contained within a prefetcher, reducing communication between different components. Furthermore, demand access detection and corresponding throttling may be performed on a thread-specific basis and may support heterogeneous workloads. Embodiments may be dynamically adaptive to quickly adapt and accommodate for transient behavior that may enable prefetching when it can improve performance. Furthermore, by throttling prefetching in certain environments, power efficiency may be increased, as only a fraction of unconstrained prefetches may be issued. Such power reduction may improve performance in a portable or mobile system which may often operate on battery power.
Embodiments may be implemented in many different system types. Referring now to
First processor 470 and second processor 480 may be coupled to a chipset 490 via P-P interconnects 452 and 454, respectively. As shown in
In turn, chipset 490 may be coupled to a first bus 416 via an interface 496. In one embodiment, first bus 416 may be a Peripheral Component Interconnect (PCI) bus, as defamed by the PCI Local Bus Specification, Production Version, Revision 2.1, dated June 1995 or a bus such as the PCI Express bus or another third generation input/output (I/O) interconnect bus, although the scope of the present invention is not so limited.
As shown in
Embodiments may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.