1. Field
The disclosed embodiments generally relate to techniques for improving performance in computer systems. More specifically, the disclosed embodiments relate to the design of a processor, which includes a mechanism to filter out redundant software prefetch instructions, which access cache lines that have already fetched from memory.
2. Related Art
As the gap between processor speed and memory performance continues to grow, prefetching is becoming an increasingly important technique for improving computer system performance. Prefetching involves pulling cache lines from memory and placing them into a cache before the cache lines are actually accessed by an application. This prevents the application from having to wait for a cache line to be retrieved from memory and thereby improves computer system performance.
Computer systems generally make use of two types of prefetching, software-controlled prefetching (referred to as “software prefetching”) and hardware-controlled prefetching (referred to as “hardware prefetching”). To support software prefetching, a compiler analyzes the data access patterns of an application at compile time and inserts software prefetch instructions into the executable code to prefetch cache lines before they are needed. In contrast, a hardware prefetcher operates by analyzing the actual data access patterns of an application at run time to predict which cache lines will be accessed in the near future, and then causes the processor to prefetch these cache lines.
Many software prefetch instructions are redundant because a processor's hardware prefetchers are often able to eliminate the same cache misses. Note that redundant prefetches can reduce processor performance because they consume processor resources, such as execution pipeline stages and load-store unit bandwidth, without performing useful work. However, blindly filtering out all software prefetches or disabling all hardware prefetchers both degrade performance because there are some cache misses that only the software prefetches are able to eliminate and others that only the hardware prefetchers are able to eliminate.
Hence, it is desirable to be able to selectively eliminate redundant software prefetches without eliminating valid software prefetches.
The disclosed embodiments relate to a system that selectively filters out redundant software prefetch instructions during execution of a program on a processor. During execution of the program, the system collects information associated with hit rates for individual software prefetch instructions as the individual software prefetch instructions are executed, wherein a software prefetch instruction is redundant if the software prefetch instruction accesses a cache line that has already been fetched from memory. As software prefetch instructions are encountered during execution of the program, the system selectively filters out individual software prefetch instructions that are likely to be redundant based on the collected information. In this way, software prefetch instructions that are likely to be redundant are not executed by the processor.
In some embodiments, while selectively filtering out individual software prefetch instructions, the system enables filtering operations when a utilization rate of a load-store unit in the processor exceeds a threshold.
In some embodiments, the system periodically determines the utilization rate for the load-store unit by determining how many loads, stores and prefetches are processed by the processor within a given time interval.
In some embodiments, while collecting the information associated with hit rates, the system uses one or more counters associated with each software prefetch instruction to keep track of cache hits and cache misses for the software prefetch instruction.
In some embodiments, upon decoding the software prefetch instruction at a decode unit in the processor, the system performs a lookup for the software prefetch instruction in a filter table, wherein the filter table includes entries for software prefetch instructions that are to be filtered out. If the lookup finds an entry for the software prefetch instruction, the system filters out the software prefetch instruction so that the software prefetch instruction is not executed. If the lookup does not find an entry for the software prefetch instruction, the system allows the software prefetch instruction to execute.
In some embodiments, upon encountering a software prefetch instruction at a load-store unit in the processor, the system performs a lookup for the software prefetch instruction in a learning table, wherein the learning table includes entries for software prefetch instructions that are executed by the program. If an entry does not exist for the software prefetch instruction in the learning table, the system allocates and initializes an entry for the software prefetch instruction in the learning table. The system also determines whether executing the software prefetch instruction causes a cache hit or a cache miss. Next, the system updates information in the entry for the software prefetch instruction based on the determination. If the updated information indicates that the software prefetch instruction is likely to be redundant, the system creates an entry in the filter table for the software prefetch instruction, if an entry does not already exist. On the other hand, if the updated information indicates that the software prefetch instruction is unlikely to be redundant, the system invalidates an entry in the filter table for the software prefetch instruction if such an entry exists.
In some embodiments, while selectively filtering out the individual software prefetch instructions, the system adjusts a hit-rate threshold for the filtering technique based on a utilization rate for the load-store unit, wherein the hit-rate threshold becomes higher as the utilization rate of the load-store unit increases, and becomes lower as the utilization rate of the load-store unit decreases.
The following description is presented to enable any person skilled in the art to make and use the present embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present embodiments. Thus, the present embodiments are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features disclosed herein.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium. Furthermore, the methods and processes described below can be included in hardware modules. For example, the hardware modules can include, but are not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), and other programmable-logic devices now known or later developed. When the hardware modules are activated, the hardware modules perform the methods and processes included within the hardware modules.
As mentioned above, the disclosed embodiments relate to a technique for selectively filtering out individual software prefetch instructions that are likely to be redundant based on the collected information, so that likely redundant software prefetch instructions are not executed by the processor.
Before we describe how this technique operates, we first describe the structure of a processor that implements this technique.
Processor 100 includes a number of components which are illustrated in
Processor 100 also includes two hardware structures that are used to facilitate selectively filtering software prefetch instructions, including learning table 130 and filter table 132.
Referring to
Referring to
When a software prefetch instruction is executed at load-store unit 120 in
After step 205 or step 206, the system performs a lookup for the prefetch instruction in data cache 122 (step 208). This lookup either causes a cache hit or a cache miss. If the lookup causes a cache hit (or hits in the load miss buffer 125), the system increments REDUNDANT_CT (step 210). The system then determines whether REDUNDANT_CT exceeds a maximum value RMAX (step 212). If not, the process is complete. Otherwise, if REDUNDANT_CT>RMAX, the system takes this as an indication that software prefetch instructions located at the same PC are likely to be redundant. In this case, the system performs a lookup for the software prefetch instruction in filter table 132 (step 214). If the lookup generates a filter table miss, the system allocates a filter table entry 153 for the software prefetch instruction (step 216). If the lookup generates a filter table hit at step 214 or after step 216, the system sets the hit count HIT_CT 155 for the filter table entry 153 to an initial value H_INIT_VAL (which, for example, can be zero) (step 218). At this point, the process is complete.
If the lookup in step 208 causes a cache miss, the system decrements REDUNDANT_CT (step 220). The system then determines whether REDUNDANT_CT falls below a minimum value RMIN (step 222). If not, the process is complete. Otherwise, if REDUNDANT_CT<RMIN, the system takes this as an indication that prefetch instructions from the same PC are not likely to be redundant. In this case, the system performs a lookup for the software prefetch instruction in filter table 132 (step 224). If the lookup in filter table 132 causes a filter table miss, the process is complete. Otherwise, if the lookup in filter table 132 causes a filter table hit, the system invalidates the filter table entry (step 226). At this point, the process is complete.
On the other hand, if software prefetch instruction filtering is enabled at step 304, the system looks up the software prefetch instruction in filter table 132 (step 306). If this lookup generates a filter table miss, the software prefetch instruction is not subject to filtering and the process is complete. Otherwise, if the filter table lookup generates a hit, this indicates that the software prefetch instruction is subject to filtering. In this case, the system drops the software prefetch instruction at decode unit 106, increments the HIT_CT 155 in the corresponding entry in filter table 132 and updates LRU information 157 (step 308). Note that dropping the software prefetch instruction conserves processor resources, such as pick queue entries, reorder buffer entries, and load-store unit bandwidth.
Next, the system determines whether HIT_CT exceeds a maximum value HMAX (step 310). If not, the process is complete. Otherwise, if HIT_CT>HMAX, the system invalidates the corresponding filter table entry 153 (step 312). The system also performs a lookup for the software prefetch instruction in learning table 130 (step 314). If the learning table lookup causes a miss, the process is complete. Otherwise, if the learning table lookup causes a hit, the system reinitializes the REDUNDANT_CT in the learning table entry, which involves setting REDUNDANT_CT to R_INIT_VAL (step 316). By invalidating the filter table entry periodically in this manner, the system enables re-learning to take place. This prevents a software prefetch instruction from being continually filtered even though its most recent instances are actually not redundant.
Note that the values of R_INIT_VAL, RMAX, R_MIN, H_INIT_VAL and HMAX may either be hardwired constants or can be programmed by firmware.
The foregoing descriptions of embodiments have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present description to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present description. The scope of the present description is defined by the appended claims.