The present invention relates generally to techniques for processing instructions in a processor pipeline and more specifically to techniques for generating an early indication of a target address for an indirect branch instruction.
Many portable products, such as cell phones, laptop computers, personal data assistants (PDAs) or the like, use a processing system having at least one processor, a source of instructions, a source of input operands, and storage space for storing results of execution. For example, the instructions and input operands may be stored in a hierarchical memory configuration consisting of general purpose registers and multi-levels of caches, including, for example, an instruction cache, a data cache, and system memory.
In order to provide high performance in the execution of programs, the processor may use speculative execution to fetch and execute instructions beginning at a predicted branch target address. If the branch target address is mispredicted, the speculatively executed instructions must be flushed from the pipeline and the pipeline restarted at a different address. In many processor instruction sets, there is often an instruction that branches to a program destination address that is derived from the contents of a register. Such an instruction is generally named an indirect branch instruction. Due to the indirect branch dependence on the contents of a register, it is usually difficult to predict the branch target address since the register may have a different value each time the indirect branch instruction is executed. Since correcting a mispredicted indirect branch generally requires back tracking to the indirect branch instruction in order to fetch and execute the instruction on the correct branching path, the performance of the processor can be reduced thereby. Also, a misprediction indicates the processor incorrectly speculatively fetched and began processing of instructions on the wrong branching path causing an increase in power both for processing of instructions which are not used and for flushing them from the pipeline.
Among its several aspects, the present invention recognizes that performance can be improved by minimizing mispredictions of indirect branch instructions. A first embodiment of the invention recognizes that a need exists for a method which predicts a storage address based on contents of a first program accessible register (PAR) specified in a first instruction, wherein the first PAR correlates with a target address specified by a second PAR in a second instruction. Information is speculatively fetched at the predicted storage address prior to execution of the second instruction.
Another embodiment addresses a method which predicts an evaluation result to branch to a target address for a branch instruction, wherein the prediction is based on a program accessible register (PAR) specified in a first instruction and the specified PAR correlates with a taken evaluation of the branch instruction. Instructions are speculatively fetched at the target address prior to execution of the branch instruction.
Another embodiment addresses an apparatus for speculatively fetching instructions. A first program accessible register (PAR) is configured to store a value that correlates to a target address specified in a branch instruction and a second PAR is configured to store the target address for the branch instruction. A decode circuit is configured to identify the first PAR specified in an advance correlating notice (ADVCN) instruction and to identify the second PAR specified in a branch instruction. A prediction circuit is configured to predict a storage address based on the value in response to the ADVCN instruction, wherein the value stored in the first PAR correlates with the target address identified by the second PAR. A fetch circuit is configured to speculatively fetch instructions beginning at the predicted storage address prior to execution of the branch instruction.
Another embodiment addresses a computer readable non-transitory medium encoded with computer readable program data and code for operating a system. A storage address is predicted based on contents of a first program accessible register (PAR) specified in a first instruction, wherein the first PAR correlates with a target address specified by a second PAR in a second instruction. Information at the predicted storage address is speculatively fetched prior to execution of the second instruction.
A further embodiment addresses an apparatus for speculatively fetching instructions. Means is employed for storing a value that correlates to a target address specified in a branch instruction and a second PAR configurable to store the target address for the branch instruction. Means for identifying the first PAR specified in an advance correlating notice (ADVCN) instruction and for identifying the second PAR specified in a branch instruction is also employed. Further, means is employed for predicting a storage address based on the value in response to the ADVCN instruction, wherein the value stored in the first PAR correlates with the target address identified by the second PAR. Means for speculatively fetching instructions beginning at the predicted storage address prior to execution of the branch instruction.
A more complete understanding of the present invention, as well as further features and advantages of the invention, will be apparent from the following Detailed Description and the accompanying drawings.
The present invention will now be described more fully with reference to the accompanying drawings, in which several embodiments of the invention are shown. This invention may, however, be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Computer program code or “program code” for being operated upon or for carrying out operations according to the teachings of the invention may be written in a high level programming language such as C, C++, JAVA®, Smalltalk, JavaScript®, Visual Basic®, TSQL, Perl, or in various other programming languages. Programs for the target processor architecture may also be written directly in the native assembler language. A native assembler program uses instruction mnemonic representations of machine level binary instructions. Program code or computer readable non-transitory medium as used herein refers to machine language code such as object code whose format is understandable by a processor.
In
The processor pipeline 202 includes six major stages, an instruction fetch stage 214, a decode and advance correlating notification (ADVCN) stage 216, a dispatch stage 218, a read register stage 220, an execute stage 222, and a write back stage 224. Though a single processor pipeline 202 is shown, the processing of instructions with indirect branch target address advance notification of the present invention is applicable to super scalar designs and other architectures implementing parallel pipelines. For example, a super scalar processor designed for high clock rates may have two or more parallel pipelines and each pipeline may divide the instruction fetch stage 214, the decode and ADVCN stage 216 having an ADVCN logic circuit 217, the dispatch stage 218, the read register stage 220, the execute stage 222, and the write back stage 224 into two or more pipelined stages increasing the overall processor pipeline depth in order to support a high clock rate.
Beginning with the first stage of the processor pipeline 202, the instruction fetch stage 214, associated with a program counter (PC) 215, fetches instructions from the L1 instruction cache 208 for processing by later stages. If an instruction fetch misses in the L1 instruction cache 208, meaning that the instruction to be fetched is not in the L1 instruction cache 208, the instruction is fetched from the memory hierarchy 212 which may include multiple levels of cache, such as a level 2 (L2) cache, and main memory. Instructions may be loaded to the memory hierarchy 212 from other sources, such as a boot read only memory (ROM), a hard drive, an optical disk, or from an external interface, such as, the Internet. A fetched instruction then is decoded in the decode and ADVCN stage 216 with the ADVCN logic circuit 217 providing additional capabilities for advance notification of a register that correlates to an indirect branch target address value as described in more detail below. Associated with ADVCN logic circuit 217 is a branch target address register (BTAR) 219 and the Ptag circuit 221 which may be located in the control circuit 206 as shown in
The dispatch stage 218 takes one or more decoded instructions and dispatches them to one or more instruction pipelines, such as utilized, for example, in a superscalar or a multi-threaded processor. The read register stage 220 fetches data operands from the GPRF 204 or receives data operands from a forwarding network 226. The forwarding network 226 provides a fast path around the GPRF 204 to supply result operands as soon as they are available from the execution stages. Even with a forwarding network, result operands from a deep execution pipeline may take three or more execution cycles. During these cycles, an instruction in the read register stage 220 that requires result operand data from the execution pipeline, must wait until the result operand is available. The execute stage 222 executes the dispatched instruction and the write-back stage 224 writes the result to the GPRF 204 and may also send the results back to read register stage 220 through the forwarding network 226 if the result is to be used in a following instruction. Since results may be received in the write back stage 224 out of order compared to the program order, the write back stage 224 uses processor facilities to preserve the program order when writing results to the GPRF 204. A more detailed description of the processor pipeline 202 for providing advance notice of a register that correlates to the target address of an indirect branch instruction is provided below with detailed code examples.
The processor complex 200 may be configured to execute instructions under control of a program stored on a computer readable storage medium. For example, a computer readable storage medium may be either directly associated locally with the processor complex 200, such as may be available from the L1 instruction cache 208, for operation on data obtained from the L1 data cache 210, and the memory hierarchy 212 or through, for example, an input/output interface (not shown). The processor complex 200 also accesses data from the L1 data cache 210 and the memory hierarchy 212 in the execution of a program. The computer readable storage medium may include random access memory (RAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), flash memory, read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), compact disk (CD), digital video disk (DVD), other types of removable disks, or any other suitable storage medium.
The teachings of the invention are applicable to a variety of instruction formats and architectural specification. For example,
General forms of indirect branch type instructions may be advantageously employed and executed in processor pipeline 202, for example, branch on register Rx (BX), add PC, move Rx PC, and the like. For purposes of describing the present invention the BX Rx form of an indirect branch instruction is used in code sequence examples as described further below.
It is noted that other forms of branch instructions are generally provided in an ISA, such as a branch instruction having a BTA calculated as a sum of an instruction specified offset address and a base address register, and the like. In support of such branch instructions, the processor pipeline 202 may utilize branch history prediction techniques that are based on tracking, for example, conditional execution status of prior branch instruction executions and storing such execution status for use in predicting future execution of these instructions. The processor pipeline 202 may support such branch history prediction techniques and additionally support the use of the ADVCN instruction to provide advance notification of a register that correlates to an indirect branch target address. For example, the processor pipeline 202 may use the branch history prediction techniques until an ADVCN instruction is encountered which then overrides the branch target history prediction techniques using the ADVCN facilities as described herein.
In other embodiments of the present invention, the processor pipeline 202 may also be set up to monitor the accuracy of using the ADVCN instruction and when the ADVCN correlated target address was incorrectly predicted one or more times, to ignore the ADVCN instruction for subsequent encounters of the same indirect branch. It is also noted that for a particular implementation of a processor supporting an ISA having an ADVCN instruction, the processor may treat an encountered ADVCN instruction as a no operation (NOP) instruction or flag the detected ADVCN instruction as undefined. Further, an ADVCN instruction may be treated as a NOP in a processor pipeline having a dynamic branch history prediction circuit with sufficient hardware resources to track branches encountered during execution of a section of code and enable the ADVCN instruction as described below for sections of code which exceed the hardware resources available to the dynamic branch history prediction circuit. Also, the ADVCN instruction may be used in conjunction with a dynamic branch history prediction circuit for providing advance notice of a register that correlates to an indirect branch target address where the dynamic branch history prediction circuit has poor results for predicting indirect branch target addresses. For example, a predicted branch target address generated from a dynamic branch history prediction circuit may be overridden by a target address provided through the use of an ADVCN instruction. In addition, advantageous automatic indirect-target inference methods are presented for providing advance notification of the indirect branch target address as described below.
When a processor encounters an indirect branch instruction, the processor determines whether to branch or not and also determines a target address of the branch based on the dynamic state of the processor. An indirect branch instruction is generally encoded with a program accessible register (PAR), such as a register from a general purpose register (GPR) file or other program accessible storage location, which contains a branch target address. Thus, a first program accessible register (PAR), such as a register from a general purpose register (GPR) file or other program accessible storage location, is specified in a first instruction to predict a target address based on a second PAR specified by a second instruction. The first PAR correlates with the target address specified by the second PAR. Also, a PAR is specified in a first instruction to predict an evaluation result to branch to a target address specified in a branch instruction. The specified PAR correlates with a taken evaluation of the branch instruction. Also, the processor branches based on a condition being met, such as whether a registered value is equal to, not equal to, greater than, or less than another registered value. Since the indirect branch instruction may change the flow of sequential addressing in a program, a pipelined processor generally stalls fetching instructions until it can be determined whether the branch will be taken or not taken and if taken to what target address. If a branch is determined to be not taken, the branch “falls through” and an instruction at the next sequential address following the branch is fetched. Accurately predicting whether to branch and predicting the branch target address are difficult problems.
The processor pipeline 202 continues to fetch instructions until it can be determined in the execute stage whether the predicted address X was correctly predicted. A disadvantage with a history based approach is a general inaccuracy of the prediction for different types of code, as observed in practice using the combination of branch execution history and current instruction address. This inaccuracy of predicting is due to an inherent unpredictability of certain branch target addresses based on past observations. Mispredictions are costly, as it takes multiple cycles to find a misprediction waiting until the branch executes, and the processor pipeline is essentially stalled or doing work during those cycles which would be flushed.
To address such difficulties, an evaluation of whether to branch or not to branch may be dynamically determined by specifying a register that correlates with such an evaluation result. Also, the branch target address may be dynamically determined by specifying a register that correlates with the target address rather than waiting for the target address encoded within the branch instruction to be resolved in the processor pipeline. While, standard branch prediction techniques, such as described with regard to
As the new instruction sequence 441-447 of
It is noted that for the processor pipeline 202, the load R1 [R2] instruction 442 and the ADVCN R1 instruction 443 have been placed after instruction A 441 without causing any further delay for the case where there is a hit in the L1 data cache 210. However, if there was a miss in the L1 data cache, a stall situation would be initiated. For this case of a miss in the L1 data cache 210, the load R1 [R2] and ADVCN R1 instructions would need to have been placed, if possible, an appropriate number of miss delay cycles before the BX R0 instruction based on the pipeline depth to avoid causing any further delays. It is also noted that instructions C 444 and D 445 do not affect the value stored in register R1.
Generally, placement of the ADVCN instructions in a code sequence is preferred to be N instructions before the BX instruction. In the context of a processor pipeline, N represents the number of stages between a stage that receives the indirect branch instruction and a stage that recognizes the contents of the ADVCN specified register that correlates to the branch target address, such as the instruction fetch stage 214 and the execute stage 222. In the exemplary processor pipeline 202 with use of the forwarding network 226, N is two and, without use of the forwarding network 226, N is three. For processor pipelines using a forwarding network for example, if the BX instruction is preceded by N equal to two instructions before the ADVCN instruction, then the ADVCN register Rm value is determined at the end of the read register stage 220 due to the forwarding network 226. In an alternate embodiment for a processor pipeline not using a forwarding network 226 for ADVCN instruction use, for example, if the BX instruction is preceded by N equal to three instructions before the ADVCN instruction, then the ADVCN target address register Rm value is determined at the end of the execute stage 222 as the BX instruction enters the decode and ADVCN stage 216. The number of instructions N may also depend on additional factors, including stalls in the upper pipeline due to delays in the instruction fetch stage 214, instruction issue width which may vary up to K instructions issued in a super scalar processor, and interrupts that come between the ADVCN and the BX instructions, for example.
In order to more efficiently use the ADVCN instruction, an instruction set architecture (ISA) may recommend the ADVCN instruction be scheduled as early as possible to minimize the effects of pipeline factors. The ISA may also recommend to not place other branches that can mispredict between the ADVCN instruction and the indirect branch being optimized. The ISA may note that any changes to the value in R1, such as could occur with the intermediate instructions in
Profiling and code analysis are tools which may be used to analyze which register to pick for Rm in an ADVCN Rm instruction. In profiling, benchmarks can be profiled and a programmer could see which register value an indirect branch's target address correlates with, and choose that register as the operand for the ADVCN instruction. Generally, correlation means a particular register value is unique for a given target address of the indirect branch. In code analysis, a programmer can also use additional tools like dataflow and control flow graphs, to determine which register values are unique with respect to the target address of an indirect branch, and select at least one of those registers as an operand for a particular ADVCN instruction.
While
The Ptag 525 can either be a hash of the PC and branch history, or a hash of the PC and the advance notice register value. For example, where PC is the current instruction address, a first hash function (hash1) is XOR(PC, ADVCN Rm value) and a second hash function (hash2) is XOR(PC, inverse(ADVCN Rm value)), where inverse is a binary function that reverses the order of a binary input, such as inverse(10011)=11001. Additional examples of hash functions include a third hash function (hash3) that is XOR(PC, History), a fourth hash function (hash4) that is XOR(PC, inverse(History)), and a fifth hash function (hash5) that is XOR(inverse(PC), ADVCN Rm value). Other examples of hash functions include a sixth hash function (hash6) that is XOR(PC, ADVCN Rm(H1)∥inverse(ADVCN Rm(H0)), where ∥ is a catenation of the preceding and following binary digits. Other such variations and the like are possible. Generally, a hash function may be defined that extracts uniqueness from one or more input values. It is also noted that a hash function of a history value may be different than a hash function of an ADVCN Rm value. If the ADVCN register value is not available, the Ptag 525 would be generated by use of the branch history, as described with regard to
The methods described in connection with the embodiments disclosed herein may be embodied in hardware and used by software from a memory module that stores non-transitory signals executed by a processor. The software may support execution of the hardware as described herein or may be used to emulate the methods and apparatus to extend branch target hints as described herein. The software module may reside in random access memory (RAM), flash memory, read only memory (ROM), electrically programmable read only memory (EPROM), hard disk, a removable disk, tape, compact disk read only memory (CD-ROM), or any other form of storage medium known in the art. A storage medium may be coupled to the processor such that the processor can read information from, and in some cases write information to, the storage medium. The storage medium coupling to the processor may be a direct coupling integral to a circuit implementation or may utilize one or more interfaces, supporting direct accesses or data streaming using down loading techniques.
While the present invention has been disclosed in a presently preferred context, it will be recognized that the present teachings may be adapted to a variety of contexts consistent with this disclosure and the claims that follow.