1. Technical Field
This disclosure relates to processors and, more particularly, to prefetch mechanisms within the processors.
2. Description of the Related Art
Computer system processor performance is closely related to cache memory system performance in many systems. As processor technology has advanced and the demand for performance has increased, the number and capacity of cache memories has followed. Some processors may have a single cache or single level of cache memory, while others may have multiple levels of caches. Cache memories may be defined by levels, based on their proximity to execution units of a processor core. For example, a level one (L1) cache may be the closest cache to the execution unit(s), a level two (L2) cache may be the second closest to the execution unit(s), and an level three (L3) cache may be the third closest to the execution unit(s).
Data may be typically loaded into a cache memory responsive to a cache miss. A cache miss occurs when requested data is not found in the cache. Cache misses are undesirable, as the performance penalty associated with a cache miss can be significant. Accordingly, some processors employ one or more prefetch units. A prefetch unit may analyze data access patterns in order to predict from where in memory future accesses will be performed. Based on these predictions, the prefetch unit may then retrieve data from the memory and store it into the cache before it is requested. Thus, prefetch units may prefetch a predefined number of cache lines ahead of the cache line currently being referenced. When tracking a data stream, prefetch units typically use the physical address for memory accesses to avoid accessing the translation look-aside buffer and to bypass the cache-access logic when prefetch requests are made. Conventional prefetch units are typically limited to generating prefetch requests within the smallest page size supported by the system. Accordingly, in such conventional systems, data streams may be lost at the boundary of the smallest supported page-size, even when large page-sizes are enabled.
Various embodiments of a processor including a prefetch aware prefetch unit. In one embodiment, the prefetch unit includes a storage. The storage includes a number of entries, and each entry corresponds to a different prefetch data stream. Each entry may be configured to store information corresponding to a page size of the prefetch data stream, along with, for example, an address corresponding to the prefetch data stream. For each entry, the prefetch unit may be configured to determine whether a prefetch of data in the data stream will cross a page boundary associated with the data stream based upon the page size information.
In one specific implementation, in response to determining that the prefetch of data will cross the page boundary, the prefetch unit may be configured to inhibit prefetching the data.
In another specific implementation, the prefetch unit may be configured to receive the page size information with each address, and the address may be associated with a cache miss. In addition, for each entry having an active data stream, the prefetch unit may be configured to generate a prefetch address based upon the received address. Further, the prefetch unit may includes compare logic configured to compare the prefetch address for a given data stream to an address corresponding to the page boundary defined by the page size information for the given data stream.
Specific embodiments are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description are not intended to limit the claims to the particular embodiments disclosed, even where only a single embodiment is described with respect to a particular feature. On the contrary, the intention is to cover all modifications, equivalents and alternatives that would be apparent to a person skilled in the art having the benefit of this disclosure. Examples of features provided in the disclosure are intended to be illustrative rather than restrictive unless stated otherwise.
As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.
Various units, circuits, or other components may be described as “configured to” perform a task or tasks. In such contexts, “configured to” is a broad recitation of structure generally meaning “having circuitry that” performs the task or tasks during operation. As such, the unit/circuit/component can be configured to perform the task even when the unit/circuit/component is not currently on. In general, the circuitry that forms the structure corresponding to “configured to” may include hardware circuits. Similarly, various units/circuits/components may be described as performing a task or tasks, for convenience in the description. Such descriptions should be interpreted as including the phrase “configured to.” Reciting a unit/circuit/component that is configured to perform one or more tasks is expressly intended not to invoke 35 U.S.C. §112, paragraph six, interpretation for that unit/circuit/component.
Turning now to
As described further below in conjunction with the description of
As shown in
In one embodiment, memory controller 18 may receive memory requests conveyed from north bridge 12. Data accessed from memory 6 responsive to a read request (including prefetches) may be conveyed by memory controller 18 to the requesting agent via north bridge 12. Responsive to a write request, memory controller 18 may receive both the request and the data to be written from the requesting agent via north bridge 12. If multiple memory access requests are pending at a given time, memory controller 18 may arbitrate between these requests. It is noted that in some embodiments, memory controller 18 may be part of the north bridge 12.
In various embodiments, the memory 6 may be implemented as a plurality of memory modules. As such, each of the memory modules may include one or more memory devices (e.g., memory chips) mounted thereon. In another embodiment, the memory 6 may include one or more memory devices mounted on a motherboard or other carrier upon which processing node 2 may also be mounted. In yet another embodiment, at least a portion of memory 6 may be implemented on the die of processing node 2 itself. Embodiments having a combination of the various implementations described above are also possible and contemplated. The devices may be implemented using any of a variety of random access memories (RAM). Thus, memory 6 may include memory devices in the static RAM (SRAM) family or in the dynamic RAM (DRAM) family. For example, memory 6 may be implemented using (but is not limited to) double data rate (DDR) DRAM, DDR2 DRAM, DDR3 DRAM, and so forth.
Referring to
In one embodiment, the dispatch unit 104 may be configured to receive instructions from the instruction cache 106 and to dispatch operations to the scheduler(s) 118. One or more of the schedulers 118 may be coupled to receive dispatched operations from the dispatch unit 104 and to issue operations to the one or more execution unit(s) 124. The execution unit(s) 124 may include one or more integer units, and one or more floating point units (both not shown). At least one load-store unit 126 is also included among the execution units 124 in the embodiment shown. Results generated by the execution unit(s) 124 may be output to the result bus 130 (a single result bus is shown here for clarity, although multiple result buses are possible and contemplated). These results may be used as operand values for subsequently issued instructions and/or stored to the register file 116. In one embodiment, the processor core may support out of order execution. The retire queue 102 may be configured to determine when each issued operation may be retired. The execution units 124 are configured to execute instructions stored in a system memory (e.g., memory 6 of
It is noted that the processing node 11 may also include many other components that have been omitted here for simplicity. For example, the processing node 11 may include a branch prediction unit (not shown) that may predict branches in executing instruction threads and a translation lookaside buffer (TLB) that may translate virtual addresses to physical addresses used for accessing memory 6. In some embodiments (e.g., if implemented as a stand-alone processor), processor core 11 may also include a memory controller configured to control reads and writes with respect to memory 6.
The L1 instruction cache 106 may store instructions for fetch by the dispatch unit 104. Instruction code may be provided to the instruction cache 106 for storage by prefetching code from the system memory 200 through the prefetch unit 108. Instruction cache 106 may be implemented in various configurations (e.g., set-associative, fully-associative, or direct-mapped).
The prefetch unit 108 may prefetch instructions from the memory 6 for storage within the instruction cache 106. In one embodiment, the prefetch unit 108 may be configured to prefetch instructions from different sized memory pages. More particularly, as described further below, prefetch unit 108 may maintain page size information for each data stream for which it is prefetching. An exemplary embodiment of a prefetch unit 108 will now be discussed in further detail below.
Turning to
The address storage 301 may store and maintain information from the miss addresses which have been observed by the prefetch unit 108. The address storage 301 comprises at least one entry, and may include any number of entries. In one embodiment, each entry may represent a pattern of miss addresses, where consecutive addresses within the pattern are separated by a fixed stride amount. The most recent address of a given pattern may be recorded in the corresponding entry of the address storage 301, along with other information (not shown) that may be used to indicate of the number of addresses detected in that pattern. The more addresses which have matched the pattern, the more likely the pattern may be to repeat in the future. The prefetch control unit 305 may receive information (e.g., a miss signal), for example, from the data cache 128 (which may indicate, when asserted, that the address presented to the data cache 128 by the load/store unit 26 is a miss in the data cache 128), and may update the address storage 301 when a miss address is received. While a miss signal is used in the present embodiment, other embodiments may use a hit signal or any other indication of the hit/miss status of an address presented to data cache 128.
In the embodiment of
The Address field 321 stores the most recent address which was detected by prefetch control unit 305 to be part of the access pattern represented by entry 309. In various embodiments, the whole physical address or only a portion of the physical address may be stored. Particularly, in one implementation, the bits of the physical address which are not part of the cache line offset may be stored. The cache line offset portion (in this case, six bits since cache lines are 64 bits, although other embodiments may employ different cache line sizes) is not stored since cache lines are prefetched in response to prefetch addresses generated by prefetch unit 108 and thus strides of less than a cache line are not of interest to prefetch unit 108. Viewed in another way, the granularity of addresses in prefetch unit 108 is a cache line granularity. Any granularity may be used in other embodiments, including larger and smaller granularities. Generally, addresses are said to “match” if the bits which are significant to the granularity in use are equal. For example, if a cache line granularity is used, the bits which are significant are the bits excluding the cache line offset bits. Accordingly, addresses match if bits 35:6, for example, of the two addresses are equal.
The page size field stores a value indicative of the memory page size to which the address in the entry corresponds. The page size may be included with the address information received by the prefetch unit 108. The page size may be used by the prefetch control unit 305 to ensure that a page boundary is not being crossed when prefetching addresses.
In one embodiment, when a miss address is received by prefetch control unit 305, the miss address is compared to the addresses recorded in the address storage 301 to determine if the miss address matches any of the recorded patterns. If the miss address does not match one of the recorded patterns, prefetch control unit 305 may allocate an entry in the address storage 301 to the address. In this manner, new patterns may be detected. However, if prefetch control unit 305 detects that the miss address matches one of the recorded patterns, prefetch control unit 305 may change information (e.g., increment a confidence counter) in the corresponding entry and may store the miss address in the corresponding entry.
The LRU field 325 stores a least recently used (LRU) value ranking the recentness of entry 309 among the entries in the address storage 309. The least recently used entry may be replaced when an address not fitting any of the patterns in the address storage 309 is detected, and the prefetch unit 108 attempts to track a new pattern beginning with that address. It is noted that while the LRU ranking is used in the present embodiment, any replacement strategy may be used (e.g. modified LRU, random, etc.).
In one embodiment, the prefetch unit 108 may operate on physical addresses (i.e. addresses which have been translated through the virtual to physical address translation mechanism of processor core 11). Accordingly, the addresses stored in the address storage 301 are physical addresses. In this manner, translation of prefetch addresses may be avoided. Additionally, in such embodiments, since the prefetch unit 108 may not generate prefetch addresses which cross a page boundary (since virtual pages may be arbitrarily mapped to physical pages, a prefetch in the next physical page may not be part of the same stride pattern of virtual addresses), the prefetch unit 108 keeps track of the page size of each stream for which it is generating prefetch addresses.
More particularly, as described above the prefetch unit 108 receives a page size indication with each address. If a new entry is allocated for the address because the address does not match any of the entries in the address storage 301, the prefetch control unit 305 stores the page size value within the page size field in the entry. In one embodiment, the page boundary comparator 307 may compare prefetch addresses generated by the stream predictor 303 to ensure that a prefetch address does not cross a page boundary. Accordingly, for each prefetch address generated, the page boundary comparator 307 may use the page size information in the entry corresponding to the current prefetch address to determine whether the current prefetch address is within the page boundary for the current stream. More particularly, the page size value in the page size entry may be used to set the limits within the page boundary comparator 307 for each stream independently. This is in contrast to conventional prefetchers in which all streams are subject to the page boundary for one page size, usually the minimum page size supported by the system (e.g., 4 KB). In such a conventional system, if the page size being accessed is, for example, 2 GB, the prefetcher may stop prefetching each time an address is going to cross a 4 KB page boundary. Thus, the conventional prefetcher would need to invalidate that entry, and start and train a new entry to continue prefetching the same stream.
In
The prefetch control unit 305 may check the address storage 301 to see if there is an entry that matches the address (block 403). For example, prefetch control unit 305 may compare the received address with the address stream associated with address stored in the address field 321 of each entry. If the received address matches one of the entries, the stream predictor 303 generates a prefetch address (block 405).
The page boundary comparator 307 checks the prefetch address to ensure that it will not cross the page boundary of the current page of memory that the prefetch address will access (block 407). More particularly, the prefetch control unit 305 may retrieve the page size value from the page size field 323 of the current entry to establish the compare values in the page boundary comparator 307. If the page boundary will be crossed, the prefetch control unit 305 may inhibit the prefetch, and in one embodiment, invalidate the entry in the address storage 301, making that entry available for re-allocation (block 413). In one implementation, the prefetch control unit 305 may change the LRU field 325 of all the entries such that the entry that is being invalidated will have a value that indicates it is the least recently used entry and is therefore subject to reallocation.
Referring back to block 407, if the page boundary will not be crossed, prefetch control unit 305 prefetches the data at the prefetch address (block 411). In one embodiment, the prefetch control unit 305 forwards the prefetch address to the north bridge 12 and or to the memory controller 18. Operation proceeds as described above in conjunction with the description of block 401.
Referring back to block 403, if the received miss address does not match any address within the address storage 301, the prefetch control unit 305 may allocate an entry and store the address and page size information to start a new data stream (block 415). In one embodiment, prefetch control unit 305 may select an entry that has not yet been allocated, or if there are no unallocated entries, the prefetch control unit 305 may allocate the entry having an LRU field 325 that contains a value that is indicative that the entry is the least recently used entry. Operation proceeds as described above in conjunction with the description of block 405 in which a prefetch address is generated.
Turning to
Generally, the database 505 of the processing node 12 carried on the computer accessible storage medium 500 may be a database or other data structure which can be read by a program and used, directly or indirectly, to fabricate the hardware comprising the processing node 2. For example, the database 505 may be a behavioral-level description or register-transfer level (RTL) description of the hardware functionality in a high level design language (HDL) such as Verilog or VHDL. The description may be read by a synthesis tool which may synthesize the description to produce a netlist comprising a list of gates from a synthesis library. The netlist comprises a set of gates which also represent the functionality of the hardware comprising the processing node 2. The netlist may then be placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks may then be used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the processing node 2. Alternatively, the database 505 on the computer accessible storage medium 500 may be the netlist (with or without the synthesis library) or the data set, as desired.
While the computer accessible storage medium 500 carries a representation of the processing node 2, other embodiments may carry a representation of any portion of the processing node 2, such as one of the processor cores 11, as desired.
Thus, the above embodiments may provide a prefetch mechanism that enables many types of prefetchers to prefetch addresses within that page size of the page within which the address falls, rather than having to stop at a page boundary of the minimum page size of the system. Thus, prefetch unit 108, which is page size aware, may be more efficient during prefetch operations.
Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.