1. Field
The present disclosure pertains to the field of data processing apparatuses and, more specifically, to the field of prefetching data in data processing apparatuses.
2. Description of Related Art
In typical data processing apparatuses, data needed to process an instruction may be stored in a memory. The latency of fetching the data from the memory may add to the time required to process the instruction, thereby decreasing performance. To improve performance, techniques for speculatively fetching data before it may be needed have been developed. Such prefetching techniques involve moving the data closer to the processor in the memory hierarchy, for example, moving data from main system memory to a cache, so that if it is needed to process an instruction, it will be take less time to fetch it.
However, the prefetching of data that is not needed to process an instruction is a waste of time and resources. Therefore, important considerations in the implementation of prefetching include a determination of what data to prefetch and when to prefetch it. For example, one approach is to use prefetch circuitry to identify and store the typical distance (the “stride”) between the addresses of data needed for successive iterations of a particular instruction. Then, the decoding of that instruction is used as a trigger to prefetch data from the memory location that is a stride-length away from the address from which data is presently needed.
In a software-based approach to prefetching, a main instruction stream is processed prior to run-time to identify instructions likely to cause a cache miss, to select a subset of the main instruction stream for computing the address of the data needed to prevent the cache miss, and to embed a trigger point in the main instruction stream for triggering the execution of the subset of the instruction stream in a separate thread from the main instruction stream. In this way, at run-time, the separate thread (a “helper thread”) is executed to prefetch the data and the cache miss is prevented.
The present invention is illustrated by way of example and not limitation in the accompanying figures.
The following description describes embodiments of techniques for prefetching based on register tracking. In the following description, numerous specific details such as processor and system configurations are set forth in order to provide a more thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the invention may be practiced without such specific details. Additionally, some well known structures, circuits, and the like have not been shown in detail, to avoid unnecessarily obscuring the present invention.
Embodiments of the present invention provide techniques for prefetching data, where data may be any type of information, including instructions, represented in any form recognizable to the data processing apparatus in which the techniques are used. The data may be prefetched from any level in a memory hierarchy to any other level, for example, from a main system memory to a level one (“L1”) cache, and may be used in data processing apparatuses with any other levels of memory hierarchy, between, above, or below the levels from and to which the prefetching is performed. For example, in a data processing system with a main memory, a level two (“L2”) cache, and an L1 cache, the prefetching techniques may be used to prefetch data to the L1 cache from either the L2 cache or main memory, depending on where the data may be found at the time of the prefetch, and may by used in conjunction with any other hardware or software based techniques for prefetching to either the L1 or the L2 cache, or both.
Processor 100 is shown in
In processor 100, program flow is determined by instruction pointer 101. For example, instruction pointer 101 may be incremented to process instructions sequentially. Program flow may be redirected by executing a branch instruction to change instruction pointer 101. References to instructions or operations in this description may be to any instructions, micro-instructions, pseudo-instructions, operations, micro-operations, pseudo-operations, or information in any other form directly or indirectly executable or interpretable by processor 100, or any subset of such instructions or information.
To process instructions, instruction pointer 101 is used to access instruction cache 102. Instructions from instruction cache 102 are decoded by instruction decode unit 104 into an opcode (“OP”), a destination register designator (“DST”), one or more source register designators (“SRC”s) and an optional immediate (“Immed”) operand. The source register designators are used to read source register operands out of architected register file 106. Source register and immediate operands are sent, along with the opcode, to instruction execution unit 108.
Results from execution unit 108 may be written into architected register file 106, to the register designated by DST, or into L1 cache 120. To execution a load instruction, processor 100 calculates the load address, reads from that load address, and writes the data into architected register file 106, to the register designated by DST. Load and store requests from execution unit 108, along with prefetch requests from IP-based stride prefetcher 110 and p-engine 116, access the memory hierarchy via L1 request queue 118. Load, store, and prefetch requests that miss L1 cache 120 are forwarded to L2 request queue 122. These miss requests access data in L2 cache 124 or system memory 126, returning data to L1 cache 120 as needed. Load requests may also be used by IP-based stride prefetcher 110, according to any known approach, and ART 112, according to an embodiment of the present invention.
ART 112 uses load requests to monitor changes to registers that may be used to contain information for calculating an address of data in system 150. In this embodiment, ART 112 monitors changes to registers in architected register file 106 that may subsequently be used as base or index for memory accesses, such as, for example, the EBX and ESI registers, respectively, in the architecture of the Pentium® Processor Family. Other embodiments may include a temporary register, such as the EAX register in the architecture of the Pentium® Processor Family. In an embodiment where architectural registers are pushed onto and subsequently popped from the stack of an instruction stream or thread, any of a variety of known stack renaming mechanisms may be used to track changes to these registers.
ART 112 also generates pre-computation slices (“p-slices,” where a p-slice is a simple sequence of instructions to pre-compute a result that would subsequently be computed by a main sequence of instructions, where the simple sequence of instructions may or may not be a subset of the main sequence of instructions) based on changes to the contents of the base and index registers, or, in other embodiments, other registers. The p-slices may be used to calculate a memory address based on the contents of the register and to access that memory address, so that if that memory address is not presently accessible by accessing L1 cache 120, a prefetch of the data at or from that address to L1 cache 120 will occur. The address may be an address according to any approach for organizing memory in a data processing apparatus, for example, a physical address or a virtual address. The instruction in the main sequence of instructions that would otherwise cause an L1 cache miss is referred to as a “delinquent load” instruction.
These p-slices are stored in p-cache 114, along with associated trigger and target instruction pointers. In this embodiment, a p-slice may be associated with one or two trigger instruction pointers and one target instruction pointer. A trigger instruction pointer is the IP of a load instruction that loads the base or index register, and the target instruction pointer is the IP of a load instruction that first references the newly loaded base or index register.
The IP associated with each load request is also used to access p-cache 114. Any p-slices in p-cache 114 associated with this IP may be executed by p-engine 116 to prefetch the data. The target instruction pointers associated with these prefetch requests are then used, recursively, to access both p-cache 114 and IP-based stride prefetcher 110. Target instruction pointers of delinquent loads are the most valuable, as far as prefetching is concerned, but there are often several linked accesses between L1 cache misses. A single load instruction will typically trigger a sequence of p-engine prefetch requests and, possibly, one or more IP-based stride prefetch requests.
P-cache 114 may also be used to store p-slices or other instructions or operations generated by any other known approach. For example, p-cache 114 may store helper threads for prefetching according to any known technique, such as software-based prefetching, such that the helper threads may be executed by p-engine 116. P-cache 114 is not used to store p-slices for strided accesses in this embodiment, as prefetch requests from IP-based stride prefetcher 110 are sent directly to L1 request queue 118. However, an embodiment where p-cache 114 is used for strided accesses is possible within the scope of the present invention.
ART 200 is coupled to an instruction decoder, such as ART 112 is coupled to instruction decode unit 104 in the embodiment of
ART array 202 includes one entry per architected register. Each entry includes p-slice valid indicator field 216, trigger-IP field 216, and p-slice field 218 to hold one or more p-slice operations. Any decoded instruction that updates an architected register is also used to update the ART entry associated with that register. A load instruction clears p-slice field 218, sets p-slice valid indicator field 216, and enters the load instruction's IP in trigger-IP field 216 for the entry associated with the destination register. A move instruction copies the ART entry associated with the move instruction's SRC field to the ART entry associated with the move instruction's DST field.
For the remaining decoded instructions, the current p-slice 210 generated by recode logic 204 is appended to the existing ART entries 206 and 208, if any, associated with one or both of the decoded instruction's SRC fields (“SRC0” and “SRC1”) to generate merged ART entry 212. The IP from trigger-IP field 216 for ART entry 206 associated with SRC0 is used as for the trigger-IP field 216 for merged ART entry 212, unless SRC0 is null (i.e., SRC0 is not a valid architected register) or when the value in the decoded instruction's DST field equals the value in the decoded instruction's SRC1 field. P-slice valid indicator field 216 for merged ART entry 212 is the logical AND of the p-slice valid indicators for the current decoded instruction, ART entry 206 associated with SRC0 (unless SRC0 is null), and ART entry 208 associated with SRC1 (unless SRC1 is null), unless length checker 214 determines that there are more than a certain number (e.g., six) p-slices in merged ART entry 212, in which case p-slice valid indicator field 216 is reset. Merged ART entry 212 then replaces the ART entry associated with the current decoded instruction's DST field.
Recode logic 204 also detects whether certain types of compare-branch instruction pairs are used as either loop terminators or array index limit checks, and recodes these instructions with PrefetchCEnd operations and sets the p-slice valid indicator. In this special case, since compare or branch instructions do not normally update destination registers, whichever of the two source registers for the compare instruction was most recently used as an index register is used as the destination register to update ART 202.
If the current decoded instruction is a load instruction, merged ART entry 212 is also forwarded to a p-cache, such as p-cache 120 in the embodiment of
In the embodiment of
P-engine 400 in coupled to a p-cache, such as p-engine 116 is coupled to p-cache 114 in the embodiment of
As shown in the embodiment of
P-engine 400 stalls if a p-slice accesses base register 414 and base-valid bit 424 is not set, or if a p-slice accesses index register 416 and index-valid index 426 is not set. This stall mechanism is used to handle the case of an L1 cache miss on the load instruction that initializes base register 414 or index register 416, and the case of base register 414 or index register 416 being required but not yet initialized.
Upon completion of the execution of all p-slices for a p-cache entry 402, busy bit 422, base-valid bit 424, and index-valid bit 426 are reset. If the last p-slice for the p-cache entry 402 is a PrefetchCEnd operation, p-engine 400 tests the loop ending or array index limit condition and loops back to the first p-slice for the p-cache entry 402 if the condition is not met.
P-engine prefetch requests 412 are only issued when p-engine 400 is able to run ahead of the processor core. If the load instructions are hitting the L1 cache, p-engine prefetch requests 412 are blocked. If the processor stalls, due to either an L1 or L2 cache miss, p-engine 400 begins to issue prefetch requests 412 for recently issued load instructions that hit in the p-cache. Similarly, if subsequently completed load instruction has an IP 501 matching the target-IP associated with instruction register 406, p-engine 400 is reset.
P-engine prefetch requests 412 may be chained. Each p-engine prefetch request 412 has a target-IP associated with it from its p-cache entry 402. When the p-engine prefetch request 412 completes (i.e., data is returned), its target-IP is used to access the p-cache. Any p-cache entries whose base-trigger-IP or index-trigger-IP match the target-IP of prefetch request 412 will be sent to p-engine 400 (or any available p-engine in an embodiment having multiple p-engines) for execution.
Since, as described above, p-engine prefetch requests 412 are associated with a target-IP, p-engine prefetch requests 412 may be used to access a stride-filtering mechanism, a p-cache, and a p-engine in the same manner as a load instruction.
In block 602, an IP-history associative array is checked to determine if an entry exists for the current IP value. If an entry does not exist, flow proceeds to block 604. If the load request encounters a cache miss in block 604, then in block 618, a new entry is created and initialized in the IP-history array. If the load request does not encounter a cache miss is block 604, no new entry is created.
However, if in block 602, an entry already exists for the current IP value in an IP-history associative array, flow proceeds to block 603. From block 603, if the IP-history array match is based on a target-IP of a p-engine prefetch request, then, in block 605, a stride-based prefetch request is triggered. The triggering of a prefetch request in block 605 may be qualified based on the confidence field in the IP-history array entry. The IP-history array is not updated based on p-engine prefetch requests.
From block 603, if the IP-history array match is not based on a target-IP of a p-engine prefetch request, then, in block 606, the stride is calculated based on the current and previous address values in the entry for the current IP value. Then, in block 608, the calculated stride is compared to the current stride value in the entry for the current IP value. If it matches, then in block 612, the confidence field in the entry for the current IP value is updated and a stridematch indicator is set. The stridematch indicator is sent to an L1 request queue to generate one or more prefetch requests, and is also sent to a p-engine to remove strided accesses from a p-cache. Strided access patterns are not included in the p-cache and load instructions with a known, constant stride do not access the p-cache or the p-engine, which may reduce p-cache size and increase overall prefetch effectiveness. From block 612, in block 620, the address and previous stride values in the entry for the current IP value are updated.
If, however, in block 608, the calculated stride does not match the current stride value in the entry for the current IP value, then, in block 610, the calculated stride is compared to the previous stride value in the entry for the current IP value. If it matches, then, in block 616, the stride for the current IP value is updated and the confidence field is cleared. Then, in block 620, the address and previous stride values in the entry for the current IP value are updated.
Within the scope of the present invention, methods 500 and 600 may be performed in a different order, with illustrated blocks omitted, with additional blocks added, or with a combination of reordered, combined, omitted, or additional blocks. For example, in method 500, block 550 may be omitted in an embodiment where software generation of p-slices is not used. Furthermore, embodiments of the present invention may be applied to add prefetch chaining to any type IP-based prefetching and their application is not limited to IP-based prefetching according to method 600.
Processor 100, or any other processor or component designed according to an embodiment of the present invention, may be designed in various stages, from creation to simulation to fabrication. Data representing a design may represent the design in a number of manners. First, as is useful in simulations, the hardware may be represented using a hardware description language or another functional description language. Additionally or alternatively, a circuit level model with logic and/or transistor gates may be produced at some stages of the design process. Furthermore, most designs, at some stage, reach a level where they may be modeled with data representing the physical placement of various devices. In the case where conventional semiconductor fabrication techniques are used, the data representing the device placement model may be the data specifying the presence or absence of various features on different mask layers for masks used to produce an integrated circuit.
In any representation of the design, the data may be stored in any form of a machine-readable medium. An optical or electrical wave modulated or otherwise generated to transmit such information, a memory, or a magnetic or optical storage medium, such as a disc, may be the machine-readable medium. Any of these mediums may “carry” or “indicate” the design, or other information used in an embodiment of the present invention, such as the instructions in an error recovery routine. When an electrical carrier wave indicating or carrying the information is transmitted, to the extent that copying, buffering, or re-transmission of the electrical signal is performed, a new copy is made. Thus, the actions of a communication provider or a network provider may be making copies of an article, e.g., a carrier wave, embodying techniques of the present invention.
Thus, techniques for prefetching based on register tracking are disclosed. While certain embodiments have been described, and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention not be limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art upon studying this disclosure. In an area of technology such as this, where growth is fast and further advancements are not easily foreseen, the disclosed embodiments may be readily modifiable in arrangement and detail as facilitated by enabling technological advancements without departing from the principles of the present disclosure or the scope of the accompanying claims.