Traditional processor designs make use of various cache structures to store local copies of instructions and data in order to avoid lengthy access times of typical DRAM memory. In a typical cache hierarchy, caches closer to the processor (level one or L1) tend to be smaller and very fast, while caches closer to the DRAM (L2 or L3) tend to be significantly larger but also slower (longer access time). The larger caches tend to handle both instructions and data, while quite often a processor system will include separate data cache and instruction cache at the L1 level (i.e. closest to the processor core). All of these caches typically have similar organization, with the main difference being in specific dimensions (e.g. cache line size, number of ways per congruence class, number of congruence classes).
In the case of an L1 Instruction cache, the cache is accessed either when code execution reaches the end of the previously fetched cache line or when a taken (or at least predicted taken) branch is encountered within the previously fetched cache line. In either case, a next instruction address is presented to the cache. In typical operation, a congruence class is selected via an abbreviated address (ignoring high-order bits), and a specific way within the congruence class is selected by matching the address to the contents of an address field within the tag of each way within the congruence class. Addresses used for indexing and for matching tags can use either effective or real addresses depending on system issues beyond the scope of this discussion. Typically, low order address bits (e.g. selecting specific byte or word within a cache line) are ignored for both indexing into the tag array and for comparing tag contents. This is because for conventional caches, all such bytes/words will be stored in the same cache line.
Recently, Instruction Caches that store traces of instruction execution have been used, most notably with the Intel Pentium 4. These “Trace Caches” typically combine blocks of instructions from different address regions (i.e. that would have required multiple conventional cache lines). The objective of a trace cache is to handle branching more efficiently, at least when the branching is well predicted. The instruction at a branch target address is simply the next instruction in the trace line, allowing the processor to execute code with high branch density just as efficiently as it executes long blocks of code without branches. Just as parts of several conventional cache lines may make up a single trace line, several trace lines may contain parts of the same conventional cache line. Because of this, the tags must be handled differently in a trace cache.
In a conventional cache, low-order address lines are ignored, but for a trace line, the full address must be used in the tag. A related difference is in handling the index into the cache line. For conventional cache lines, the least significant bits are ignored in selecting a cache line (both index & tag compare), but in the case of a branch into a new cache line, those least significant bits are used to determine an offset from the beginning of the cache line for fetching the first instruction at the branch target. In contrast, the address of the branch target will be the first instruction in a trace line. Thus no offset is needed. Flow-through from the end of the previous cache line via sequential instruction execution simply uses an offset of zero since it will execute the first instruction in the next cache line (independent of whether it is a trace line or not). The full tag compare will select the appropriate line from the congruence class. In the case where the desired branch target address is within a trace line but not the first instruction in the trace line, the trace cache will declare a miss, and potentially construct a new trace line starting at that branch target.
For a trace cache design to function correctly and with a high level of performance, the trace formation methodology is critical to the design. Trace formation involves fetching instructions from a higher level memory, identifying and predicting all branches in the stream, creating a “basic block” of instructions from this and appending it to the current instruction trace. A basic block is defined as all instructions up to and including the first branch in an instruction stream.
This invention contemplates that branches are predicted taken or not taken using a highly accurate branch history table (BHT). Branches that are predicted not taken are appended to a trace buffer and the next basic block is constructed from the remaining instructions in the fetch buffer. Branches that are predicted taken flush the remaining fetch buffer and the next address is determined using a Branch Target Address Register (BTAC). This address is used to fetch the next instruction stream that will be used to build the next basic block. Multiple basic blocks are typically added to the same trace line, within the constraints of trace termination rules to be described below.
Some of the purposes of the invention having been stated, others will appear as the description proceeds, when taken in connection with the accompanying drawings, in which:
While the present invention will be described more fully hereinafter with reference to the accompanying drawings, in which a preferred embodiment of the present invention is shown, it is to be understood at the outset of the description which follows that persons of skill in the appropriate arts may modify the invention here described while still achieving the favorable results of the invention. Accordingly, the description which follows is to be understood as being a broad, teaching disclosure directed to persons of skill in the appropriate arts, and not as limiting upon the present invention.
The term “programmed method”, as used herein, is defined to mean one or more process steps that are presently performed; or, alternatively, one or more process steps that are enabled to be performed at a future point in time. The term programmed method contemplates three alternative forms. First, a programmed method comprises presently performed process steps. Second, a programmed method comprises a computer-readable medium embodying computer instructions which, when executed by a computer system, perform one or more process steps. Third, a programmed method comprises a computer system that has been programmed by software, hardware, firmware, or any combination thereof to perform one or more process steps. It is to be understood that the term programmed method is not to be construed as simultaneously having more than one alternative form, but rather is to be construed in the truest sense of an alternative form wherein, at any given point in time, only one of the plurality of alternative forms is present.
Instruction traces are created by appending basic blocks into the trace formation register. Various rules (stated below) have been defined for forming and ending traces. The purpose of the rules is to form traces that maximize performance while maintaining functionality. Once a trace has been formed, it is written into the trace cache where it can then be accessed for execution.
The present invention contemplates a method in which a cache runs in normal cache mode and then receives traces generated once branch prediction has “warmed up”. The address of the next trace line is stored at the end of the trace. Branch prediction is not required at the output of the cache, which saves logic/cycles by not having to re-predict the address. Only the address of the first basic block in a trace line is needed to access all basic blocks in the trace. Translation information is implicit within a traceline, Termination of a trace line occurs when the next basic block is taken from a page with different memory attributes than other basic blocks in the trace entry.
Termination of a trace line currently under construction occurs in a number of defined circumstances when: (1) a data dependent branch is encountered; (2) a bdnz instruction is encountered; (3) a branch with negative displacement is encountered; (4) a weakly predicted branch is encountered; (5) too many basic blocks are encountered; and (6) a basic block ends close to the end of a trace line.
New trace generation is initiated when a Trace Cache Miss occurs or when a conventional cache line is found in the cache and there is reason to believe that branch prediction is better now than when the line was placed in the cache. The address of the miss (or hit on conventional line) is used to fetch the next group of instructions from higher level memory (second level cache). This address is also used to access the “branch target address cache” (BTAC) which provides the next expected address that needs to be fetched. This next address will be the target of a branch from the first group of instructions or the next sequential address. Eitherway, this address is first used to access the trace cache and if another miss occurs then it is also sent to the second level cache and is considered a prefetch (i.e. predicted address).
Once instructions are returned from the second level cache they are placed in the instruction fetch register (
A “basic block” of instructions is next formed starting with the 8 instructions from instruction fetch and may continue with additional sequential instruction fetches of 8 instruction blocks until the end of that basic block is detected. The basic block includes the first and subsequent instructions up to the first branch instruction. If there are no branches then the basic block contains all 8 instructions and the next address would be the sequential address (next address after last instruction). The basic block is added to the trace formation buffer by appending to the end of an existing trace or is used to begin a new trace.
Once the basic block is moved to the trace buffer, the next set of instructions (fetch or prefetch) are handled in the same way by predicting branches, decoding and using the BTAC to request the next set of instructions.
Once the trace buffer has been filled with basic blocks (see rules below for determining when full) then the trace line is written into the cache.
The address of the next instruction (after the last basic block) is also stored in the cache along with the trace line. This address is determined in the normal way of branch prediction/BTAC look-up while determining basic blocks. When the trace line is accessed from the cache, the next trace is known without going through the branch prediction logic. Address flow is represented in
This trace cache is capable of storing trace lines or normal cache lines (instructions in sequential order). Also, for performance reasons, all instructions arriving from the second level cache can be bypassed around the trace cache and dispatched as normal cache lines. Therefore, while building trace lines the instructions are sent onto the dispatch/execute engines to maintain forward progress while generating traces. Trace generation can be terminated whenever it has been determined that the line being built is no longer good for function or performance. A series of rules have been developed forforming traces.
The set of basic rules governing the building of trace lines (trace generation terminates and a trace is placed in the cache) is listed hereinafter. A system in accordance with this invention may implement one, all or a subset of these rules.
Trace generation is highly dependent upon the branch prediction success rate. In order to make sure that traces are built using “good” branch prediction, it is necessary to wait for the BHT (containing the branch prediction bits) and the BTAC to “warm up”. This process involves running the code in normal cache mode until it has been determined that the branch prediction has warmed up.
Determination of when the BTAC and BHT are “warmed up” is described in a related patent application filed Oct. 5, 2006 under Ser. No. 11/538,831 entitled “Apparatus and Method for Using Branch Prediction Heuristics for Determination of Trace Formation Readiness”. If the BTAC and BHT are not warmed up, trace formation will not even be attempted. Even after warm up is complete, there are several constraints that branch prediction places on trace formation:
Traces must be made from basic blocks (code segments) containing the same protection attributes as each other. This is required since the address of code segments is not maintained in the trace cache (only the starting address and the next address at the end). Therefore, the translation process occurs on all code segments when the trace line is built but only on the starting address of the trace line when the trace is accessed from the cache.
If the cache access is a Miss (meaning data is NOT resident in the cache) then a request is immediately sent to the second level cache for AddrA. AddrA is also used to access the BTAC to obtain the next address to fetch (AddrB). If the BTAC has a valid match for AddrA then AddrB is used to access the trace cache and then sent to the second level cache (if a trace cache miss). If there is not a valid BTAC match for AddrA then AddrB is not known and therefore must wait for AddrA data to compute AddrB.
Once data arrives from the second level cache for AddrA then the BHT is accessed for branch prediction and the instructions are aligned for adding to the current trace. All branches are then predicted taken/not taken and the next address is determined from the first predicted taken branch. This address is compared against the previous address that was read from the BTAC. If they match then the BTAC is accessed again for the next fetch address. If the addresses do not match then the BTAC entry needs to be corrected and any outstanding second level requests must be canceled.
Instructions from the second level cache are then bypassed around the trace cache and are also appended to the trace buffer to continue forming the current trace. Once the trace buffer is full (or achieves one of the trace termination criteria) it is written into the trace cache.
In the drawings and specifications there has been set forth a preferred embodiment of the invention and, although specific terms are used, the description thus given uses terminology in a generic and descriptive sense only and not for purposes of limitation.