Embodiments of the invention relate to microprocessors and microprocessor systems. More particularly, embodiments of the invention pertain to a technique to regulate prefetches of data from memory by a microprocessor.
In modern computing systems, data may be retrieved from memory and stored in a cache within or outside of a microprocessor ahead of when a microprocessor may execute an instruction that uses the data. This technique, known as “prefetching”, allows a processor to avoid latency associated with retrieving (“fetching”) data from a memory source, such as DRAM, by using a history (e.g., heuristic) of fetches of data from memory into respective cache lines to predict future ones.
Excessive prefetching can result if prefetched data is never used by instructions executed by a processor for which the data is prefetched. This may arise for example, from inaccurately predicted or ill-timed prefetches. An inaccurately predicted or an ill-timed prefetch is a prefetch that brings in a line that is not used before the line is evicted from the cache by the normal allocation policies. Furthermore, in a multiple processor system or multi-core processor, excessive prefetching can result in fetching data to one processor that is still being actively used by another processor or processor core. This can hinder the performance of the processor deprived of the data. Furthermore, the prefetching processor may not receive a benefit from the data if the processor deprived of the data originally prefetches or uses the data again. Additionally, excessive prefetching can cause and result from prefetched data being replaced by subsequent prefetches before the earlier prefetched data is used by an instruction.
Excessive prefetching can degrade system performance in several ways. For example, prefetching uses bus resources and bandwidth from the processor to memory. Excessive prefetching, therefore, can increase bus traffic and thereby increase the delay experienced by other instructions with no or little benefit to data fetching efficiency. Furthermore, because prefetched data may replace data already in a corresponding cache line, excessive prefetching can cause useful data to be replaced in a cache by data that may not be used as much or, in some cases, not at all. Finally, excessive prefetching can cause a premature transfer of ownership of prefetched cache lines among a number of processors, or processing cores that may share the cache line, by forcing a processor or a processor core to give up its exclusive ownership of cache lines before it has performed data updates to the cache lines.
Embodiments of the invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
Embodiments of the invention relate to microprocessors and microprocessor systems. More particularly, embodiments of the invention relate to using memory attribute bits to modify the amount of prefetching performed by a processor.
In one embodiment of the invention, cache lines filled with prefetched data may be marked as having been filled by a prefetch. In one embodiment of the invention, cache lines filled with prefetched data have their attribute cleared when the line is accessed for a normal memory operation. This enables the system to be aware of which cache lines have been prefetched and not yet used by an instruction. In one embodiment, memory attributes associated with a particular segment, or “block”, of memory may be used to indicate various properties of the memory block, including whether data stored in the memory block has been prefetched and not yet used, or prefetched and subsequently used by an instruction, or if a block was not brought in by a prefetch.
If a prefetched cache line is evicted or invalidated without being used by an instruction, then, in one embodiment, a fault-like yield may result in one or more architecturally-programmed scenarios being performed. Fault-like yields can be used to invoke software routines within a program being preformed to adjust the policies for the prefetching of the data causing the fault-like yield. In another embodiment the prefetching hardware may track the number of prefetched lines that are evicted or invalidated before being used, in order to dynamically adjust the prefetching policies without the program's intervention. By monitoring the prefetching of unused data and adapting to excessive prefetching, at least one embodiment allows prefetching to be dynamically adjusted to improve efficiency, reduce useless bus traffic, and help prevent premature eviction or invalidation of cache line data.
In one embodiment, each block of memory may correspond to a particular line of cache, such as a line of cache within a level one (L1) or level two (L2) cache memory, and prefetch attributes may be represented with bit storage locations located within or otherwise associated with a line of cache memory. In other embodiments, a block of memory for which prefetch attributes may be associated may include more than one cache memory line or may be associated with another type of memory, such as DRAM.
In the embodiment illustrated in
In addition to the attribute bits, each line of cache may also have associated therewith a state value stored in state storage location 120. For example, in one embodiment the state storage location 120 contains a state bit vector, or a state field, 125 associated with cache line 105 which designates whether the cache line is in a modified state (M), exclusively owned state (E), shared state (S), or invalid state (I). The MESI states can control whether various software threads, cores, or processors can use and/or modify information stored in the particular cache line. In some embodiments the MESI state attribute is included in the attribute bits 115 for cache line 105.
Prefetches are caused by either hardware mechanisms that predict what lines to prefetch or are guided in their prediction by software, or software directives in the form of prefetch instructions or by arbitrary combinations of hardware mechanisms and software directives. Prefetching can be controlled by changing the hardware mechanisms for predicting what lines to prefetch. Prefetching can also be controlled by adding some heuristic for what lines to not prefetch if either a hardware prefetch predictor or software prefetch directive indicates that a prefetch could potentially be done. Policies on prefetching and filtering of prefetches can be handled either for all prefetches or separately for each prefetch based on what address range the prefetched addresses fall within or what part of a program an application is in. The controls for prefetching will be specific to a given implementation and can optionally be made architecturally visible as a set of machine registers.
For example, in one embodiment of this invention, the eviction or invalidation or a prefetched cache line that has not yet been used may result in a change of the policies for what future lines should be prefetched. In other embodiments, a number (“n”) of unused prefetches (indicated by evictions of prefetched cache lines, for example) and/or a number (“m”) of invalidations or evictions of prefetched cache lines may cause the prefetching algorithm to be modified to reduce the number of prefetches of cache lines until the attribute bits and the cache line states indicate that the cache lines that are prefetched are used by instructions more frequently.
In one embodiment of the invention, attributes associated with a block of memory may be accessed, modified, and otherwise controlled by specific operations, such as an instruction or micro-operation decoded from an instruction. For example, in one embodiment an instruction that both loads information from a cache line and sets the corresponding attribute bits (e.g., “load_set” instruction) may be used. In other embodiments, an instruction that loads information from a cache line and checks the corresponding attribute bits (e.g., “load_check” instruction) may be used in addition to or a load_set instruction.
In one embodiment, an instruction may be used that specifically prefetches data from memory to a cache line and sets a corresponding attribute bit to indicate the data has yet to be used by an instruction. In other embodiments, it may be implicit that all prefetches performed by software have attribute bits set for prefetched cache lines. In even other embodiments, prefetches performed by hardware prefetch mechanisms my have attributes set for prefetched cache lines.
If the attribute bits or the cache line state is checked, via, for example, a load_check instruction, one or more architectural scenarios within one or more processing cores may be defined to perform certain events based on the attributes that are checked. There may be other types of events that can be performed in response to the attribute check. For example, in one embodiment, an architectural scenario may be defined to compare the attribute bits to a particular set of data and invoke a light-weight yield event based on the outcome of the compare. The light-weight yield may, among other things, call a service routine which performs various operations in response to the scenario outcome before returning control to a thread or other process running in the system. In another embodiment, a flag or register may be set to indicate the result. In still another embodiment, a register may be written with a particular value. Other events may be included as appropriate responses.
For example, one scenario that may be defined is one that invokes a light-weight yield and corresponding handler upon detecting n number of evictions of prefetched-and-unused cache lines and/or m number of invalidations of prefetched-and-unused cache line (indicated by the MESI states, in one embodiment), where m and n may be different or the same value. Such an architecturally defined scenario may be useful to adjust the prefetching algorithm to more closely correspond to the usage of specific prefetched data from memory.
a illustrates the use of attribute bits and cache line states to cause a fault-like yield, which can adjust the prefetching of data, according to one embodiment. In
In one embodiment, the MLI scenario may invoke a handler that may cause a software routine to be called to adjust prefetching algorithms for all prefetches or only for a subset of prefetches associated with a specific range of data or a specific region of a program. Various algorithms in various embodiments may be used to adjust prefetching. In one embodiment hardware logic may be used to implement the prefetch adjustment algorithm, whereas in other embodiments some combination of software and logic may be used. The particular algorithm used to adjust the prefetching of data in response to the attribute bits and state variables is arbitrary in embodiments of the invention.
b is a flow diagram illustrating the operation of at least one embodiment of the invention in which a prefetch_set instruction and a cache line state variable is used to set prefetch attribute bits associated with a particular cache line in order to dynamically adjust the prefetching of the data to correspond to its usefulness. In other embodiments, other instructions may be used to perform the operations illustrated in
Prefetching may be performed in a variety of ways. For example, in one embodiment, prefetching is performed by executing an instruction (e.g., “prefetch_set” instruction), as described above (“software” prefetching or “explicit” prefetching). In other embodiments, prefetching may be performed by hardware logic (“hardware” prefetching or “implicit” prefetching). In one embodiment, hardware prefetching may be performed by configuring prefetch logic (vis-à-vis a software utility program, for example) to set an attribute bit for each prefetched cache line to indicate that the prefetched data within the cache line has not been used. In some embodiments, control information associated with the prefetch logic may be configured to determine which attribute bit(s) is/are to be used for the purpose of indicating whether prefetched data has been used.
Illustrated within the processor of
The main memory may be implemented in various memory sources, such as dynamic random-access memory (DRAM), a hard disk drive (HDD) 420, or a memory source located remotely from the computer system via network interface 430 containing various storage devices and technologies. The cache memory may be located either within the processor or in close proximity to the processor, such as on the processor's local bus 407.
Furthermore, the cache memory may contain relatively fast memory cells, such as a six-transistor (6T) cell, or other memory cell of approximately equal or faster access speed. The computer system of
The system of
Other embodiments of the invention, however, may exist in other circuits, logic units, or devices within the system of
Embodiments of the invention described herein may be implemented with circuits using complementary metal-oxide-semiconductor devices, or “hardware”, or using a set of instructions stored in a medium that when executed by a machine, such as a processor, perform operations associated with embodiments of the invention, or “software”. Alternatively, embodiments of the invention may be implemented using a combination of hardware and software.
While the invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications of the illustrative embodiments, as well as other embodiments, which are apparent to persons skilled in the art to which the invention pertains are deemed to lie within the spirit and scope of the invention.