The present invention relates generally to techniques for processing instructions in a processor pipeline and, more specifically, to techniques for generating an early indication of a target address for an indirect branch instruction.
Many portable products, such as cell phones, laptop computers, personal data assistants (PDAs) or the like, require the use of a processor executing a program supporting communication and multimedia applications. The processing system for such products includes a processor, a source of instructions, a source of input operands, and storage space for storing results of execution. For example, the instructions and input operands may be stored in a hierarchical memory configuration consisting of general purpose registers and multi-levels of caches, including, for example, an instruction cache, a data cache, and system memory.
In order to provide high performance in the execution of programs, a processor typically executes instructions in a pipeline optimized for the application and the process technology used to manufacture the processor. Processors also may use speculative execution to fetch and execute instructions beginning at a predicted branch target address. If the branch is mispredicted, the speculatively executed instructions must be flushed from the pipeline and the pipeline restarted at the correct path address. In many processor instruction sets, there is often an instruction that branches to a program destination address that is derived from the contents of a register. Such an instruction is generally named an indirect branch instruction. Due to the indirect branch dependence on the contents of a register, it is usually difficult to predict the branch target address since the register could have a different value each time the indirect branch instruction is executed. Since correcting a mispredicted indirect branch generally requires back tracking to the indirect branch instruction in order to fetch and execute the instruction on the correct branching path, the performance of the processor can be reduced thereby. Also, a misprediction indicates the processor incorrectly speculatively fetched and began processing of instructions on the wrong branching path causing an increase in power both for processing of instructions which are not used and for flushing them from the pipeline.
Among its several aspects, the present invention recognizes that it is advantageous to minimize the number of mispredictions that may occur when executing instructions to improve performance and reduce power requirements in a processor system. To such ends, an embodiment of the invention applies to a method for changing a sequential flow of a program. The method saves a target address identified by a first instruction and changes the speculative flow of execution to the target address after a second instruction is encountered, wherein the second instruction is an indirect branch instruction.
Another embodiment of the invention addresses a method for predicting an indirect branch address. A sequence of instructions is analyzed to identify a target address generated by an instruction of the sequence of instructions. A predicted next program address is prepared based on the target address before an indirect branch instruction utilizing the target address is speculatively executed.
Another aspect of the invention addresses an apparatus for indirect branch prediction. The apparatus employs a register for holding an instruction memory address that is specified by a program as a predicted indirect address of an indirect branch instruction. The apparatus also employs a next program address selector that selects the predicted indirect address from the register as the next program address for use in speculatively executing the indirect branch instruction.
A more complete understanding of the present invention, as well as further features and advantages of the invention, will be apparent from the following Detailed Description and the accompanying drawings.
The present invention will now be described more fully with reference to the accompanying drawings, in which several embodiments of the invention are shown. This invention may, however, be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
Computer program code or “program code” for being operated upon or for carrying out operations according to the teachings of the invention may be initially written in a high level programming language such as C, C++, JAVA®, Smalltalk, JavaScript®, Visual Basic®, TSQL, Perl, or in various other programming languages. A program written in one of these languages is compiled to a target processor architecture by converting the high level program code into a native assembler program. Programs for the target processor architecture may also be written directly in the native assembler language. A native assembler program uses instruction mnemonic representations of machine level binary instructions. Program code or computer readable medium as used herein refers to machine language code such as object code whose format is understandable by a processor.
In
The processor pipeline 202 includes six major stages, an instruction fetch stage 214, a decode and predict stage 216, a dispatch stage 218, a read register stage 220, an execute stage 222, and a write back stage 224. Though a single processor pipeline 202 is shown, the processing of instructions with indirect branch target address prediction of the present invention is applicable to super scalar designs and other architectures implementing parallel pipelines. For example, a super scalar processor designed for high clock rates may have two or more parallel pipelines and each pipeline may divide the instruction fetch stage 214, the decode and predict stage 216 having predict logic circuit 217, the dispatch stage 218, the read register stage 220, the execute stage 222, and the write back stage 224 into two or more pipelined stages increasing the overall processor pipeline depth in order to support a high clock rate.
Beginning with the first stage of the processor pipeline 202, the instruction fetch stage 214, associated with a program counter (PC) 215, fetches instructions from the L1 instruction cache 208 for processing by later stages. If an instruction fetch misses in the L1 instruction cache 208, in other words, the instruction to be fetched is not in the L1 instruction cache 208, the instruction is fetched from the memory hierarchy 212 which may include multiple levels of cache, such as a level 2 (L2) cache, and main memory. Instructions may be loaded to the memory hierarchy 212 from other sources, such as a boot read only memory (ROM), a hard drive, an optical disk, or from an external interface, such as, the Internet. A fetched instruction then is decoded in the decode and predict stage 216 with the predict logic circuit 217 providing additional capabilities for predicting an indirect branch target address value as described in more detail below. Associated with predict logic circuit 217 is a branch target address register (BTAR) 219 which may be located in the control circuit 206 as shown in
The dispatch stage 218 takes one or more decoded instructions and dispatches them to one or more instruction pipelines, such as utilized, for example, in a superscalar or a multi-threaded processor. The read register stage 220 fetches data operands from the GPRF 204 or receives data operands from a forwarding network 226. The forwarding network 226 provides a fast path around the GPRF 204 to supply result operands as soon as they are available from the execution stages. Even with a forwarding network, result operands from a deep execution pipeline may take three or more execution cycles. During these cycles, an instruction in the read register stage 220 that requires result operand data from the execution pipeline, must wait until the result operand is available. The execute stage 222 executes the dispatched instruction and the write-back stage 224 writes the result to the GPRF 204 and may also send the results back to read register stage 220 through the forwarding network 226 if the result is to be used in a following instruction. Since results may be received in the write back stage 224 out of order compared to the program order, the write back stage 224 uses processor facilities to preserve the program order when writing results to the GPRF 204. A more detailed description of the processor pipeline 202 for predicting the target address of an indirect branch instruction is provided below with detailed code examples.
The processor complex 200 may be configured to execute instructions under control of a program stored on a computer readable storage medium. For example, a computer readable storage medium may be either directly associated locally with the processor complex 200, such as may be available from the L1 instruction cache 208 and the memory hierarchy 212 through, for example, an input/output interface (not shown). The processor complex 200 also accesses data from the L1 data cache 210 and the memory hierarchy 212 in the execution of a program. The computer readable storage medium may include random access memory (RAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), flash memory, read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), compact disk (CD), digital video disk (DVD), other types of removable disks, or any other suitable storage medium.
The teachings of the invention are applicable to a variety of instruction formats and architectural specification. For example,
General forms of indirect branch type instructions may be advantageously employed and executed in processor pipeline 202, for example, branch on register Rx (BX), add PC, move Rx PC, and the like. For purposes of describing the present invention the BX Rx form of an indirect branch instruction is used in code sequence examples as described further below.
It is noted that other forms of branch instructions are generally provided in an ISA, such as a branch instruction having an instruction specified branch target address (BTA), a branch instruction having a BTA calculated as a sum of an instruction specified offset address and a base address register, and the like. In support of such branch instructions, the processor pipeline 202 may utilize branch history prediction techniques that are based on tracking, for example, conditional execution status of prior branch instruction executions and storing such execution status for use in predicting future execution of these instructions. The processor pipeline 202 may support such branch history prediction techniques and additionally support the use of the BHINT instruction as an aid in predicting indirect branches. For example, the processor pipeline 202 may use the branch history prediction techniques until a BHINT instruction is encountered which then overrides the branch target history prediction techniques using the BHINT facilities as described herein.
In other embodiments of the present invention, the processor pipeline 202 may also be set up to monitor the accuracy of using the BHINT instruction and when the BHINT identified target address was incorrect for one or more times, to ignore the BHINT instruction for subsequent encounters of the same indirect branch. It is also noted that for a particular implementation of a processor supporting an ISA having a BHINT instruction, the processor may treat an encountered BHINT instruction as a no operation (NOP) instruction or flag the detected BHINT instruction as undefined. Further, a BHINT instruction may be treated as a NOP in a processor pipeline having a branch history prediction circuit with sufficient hardware resources to track branches encountered during execution of a section of code and enable the BHINT instruction as described below for sections of code which exceeds the hardware resources available to the branch history prediction circuit. In addition, advantageous automatic indirect-target inference methods are presented for predicting the indirect branch target address as described below.
While processing instructions, stall situations may be encountered, such as that which could occur with the execution of the load R0 instruction 405. The execution of the load R0 instruction 405 may return the value from the L1 data cache 210 without delay if there is a hit in the L1 data cache. However, the execution of a load R0 instruction 405 may take a significant number of cycles if there is a miss in the L1 data cache 210. A load instruction may use a register from the GPRF 204 to supply a base address and then add an immediate value to the base address in the execute stage 222 to generate an effective address. The effective address is sent over data path 232 to the L1 data cache 210. With a miss in the L1 data cache 210, the data must be fetched from the memory hierarchy 212 which may include, for example, an L2 cache and main memory. Further, the data may miss in the L2 cache leading to a fetch of the data from the main memory. For example, a miss in the L1 data cache 210, a miss in an L2 cache in the memory hierarchy 212, and an access to main memory may require hundreds of CPU cycles to fetch the data. During the cycles it takes to fetch the data after an L1 data cache miss, the BX R0 instruction 406 is stalled in the processor pipeline 202 until the in flight operand is available. The stall may be considered to occur in the read register stage 220 or the beginning of the execute stage 222.
It is noted that in processors having multiple instruction pipelines, the stall of the load R0 instruction 405 may not stall the speculative operations occurring in any other pipelines. Due to the length of a stall on a miss in the L1 D cache 210, a significant number of instructions may be speculatively fetched, which if there was an incorrect prediction of indirect branch target address may significantly affect performance and power use. A stall may be created in a processor pipeline by use of a hold circuit which is part of the control circuit 206 of
Upon resolution of the miss, the load data is sent over path 240 to a write back operation as part of the write back stage 224. The operand is then written to the GPRF 204 and may also be sent to the forwarding network 226 described above. The value for R0 may now be compared to the predicted address X to determine whether the speculatively fetched instructions need to be flushed or not. Since the register used to store the branch target address could have a different value each time the indirect branch instruction is executed, there is a high probability that the speculatively fetched instructions would be flushed using current prediction approaches.
As the new instruction sequence 421-427 of
Generally, placement of the BHINT instructions is N cycles before the BX instruction is decoded, where N is the number of stages between an instruction fetch stage and an execute stage, such as the instruction fetch 214 and the execute stage 222. In the exemplary processor pipeline 202 with use of the forwarding network 226, N is two and, without use of the forwarding network 226, N is three. For processor pipelines using a forwarding network for example, if the BX instruction is preceded by N equal to two instructions before the BHINT instruction, then the BHINT target address register Rm value is determined at the end of the read register stage 220 due to the forwarding network 226. In an alternate embodiment for a processor pipeline not using a forwarding network 226 for BHINT instruction use, for example, if the BX instruction is preceded by N equal to three instructions before the BHINT instruction, then the BHINT target address register Rm value is determined at the end of the execute stage 222 as the BX instruction enters the decode and predict stage 216. The number of instructions N may also depend on additional factors, including stalls in the upper pipeline, such as delays in the instruction fetch stage 214, instruction issue width which may vary up to K instructions issued in a super scalar processor, and interrupts that come between the BHINT and the BX instructions, for example. In general, an ISA may recommend the BHINT instruction be scheduled as early as possible, to minimize the effect of such factors.
By tracking the execution pattern of the code sequence 600, an automatic indirect-target inference method circuit may predict with reasonable accuracy whether the latest value of R0 at the time the BX R0 instruction 607 enters the decode and predict stage 216 should be used as the predicted BTA. In one embodiment, the last value written to R0 would be used as the value for the BX R0 instruction when it enters the decode and predict stage 216. This embodiment is based on an assessment that for the code sequence associated with this BX R0 instruction, the last value written to R0 could be predicted to be the correct value a high percentage of the time.
The first IBP process 700 begins with a fetched instruction stream 702. At decision block 704, a determination is made whether an instruction is received that writes any register Rm that may be a target register of an indirect branch instruction. For example, in a processor having a 14 entry register file with registers R0-R13, instructions that write to any of the registers R0-R13 would be kept track of as possible target registers of an indirect branch instruction. For techniques that monitor multiple passes of sections of code having an indirect branch instruction, a specific Rm may be determined by identifying the indirect branch instruction on the first pass. If the instruction received does not affect an Rm, the first IBP process 700 proceeds to decision block 706. At decision block 706, a determination is made whether the instruction received is an indirect branch instruction, such as a BX Rm instruction. If the instruction received is not an indirect branch instruction, the first IBP process 700 proceeds decision block 704 to evaluate the next received instruction.
Returning to decision block 704, if the instruction received does affect an Rm, the first IBP process 700 proceeds to block 708. At block 708, the address of the instruction that affects the Rm is loaded at the Rm address of the lastwriter table. At block 710, the BTARU is checked for a valid bit at the instruction address. At decision block 712, a determination is made whether a valid bit was found at an instruction address entry in the BTARU. If a valid bit was not found, such as may occur on a first pass through process blocks 704, 708, and 710, the first IBP process returns to decision block 704 to evaluate the next received instruction.
Returning to decision block 706, if an indirect branch instruction, such as a BX Rm instruction, is received the first IBP process 700 proceeds to block 714. At block 714, the lastwriter table is checked for a valid instruction address at address Rm. At decision block 716, a determination is made whether a valid instruction address is found at the Rm address. If a valid instruction address is not found, the first IBP process 700 proceeds to block 718. At block 718, the BTARU bit entry at the instruction address is set to invalid and the first IBP process 700 returns to decision block 704 to evaluate the next received instruction.
Returning to decision block 716, if a valid instruction address is found, the first IBP process 700 proceeds to block 720. If there is a pending update, the first IBP process 700 may stall until the pending update is resolved. At block 720, the BTARU bit entry at the instruction address is set to valid and the first IBP process 700 proceeds to decision block 722. At decision block 722, a determination is made whether the branch target address register (BTAR) has a valid address. If the BTAR has a valid address the first IBP process 700 proceeds to block 724. At block 724, indirect branch instruction Rm is predicted using the stored BTAR value and the first IBP process 700 returns to decision block 704 to evaluate the next received instruction. Returning to decision block 722, if the BTAR is determined to not have a valid address, the first IBP process 700 returns to decision block 704 to evaluate the next received instruction.
Returning to decision block 704, if the instruction received does affect the Rm of an indirect branch instruction, such as may occur on a second pass through the first IBP process 700, the first IBP process 700 proceeds to block 708. At block 708, the address of the instruction that affects the Rm is loaded at the Rm address of the lastwriter table. At block 710, the BTARU is checked for a valid bit at the instruction address. At decision block 712, a determination is made whether a valid bit was found at an instruction address entry in the BTARU. If a valid bit was found, such as may occur on the second pass through process blocks 704, 708, and 710, the first IBP process 700 proceeds to block 726. At block 726, the branch target address register (BTAR), such as BTAR 219 of
Returning to decision block 856, if the instruction received does affect an Rm register, the second IBP process 850 proceeds to block 858. At block 858, the TTT 800 is checked for valid entries to see if the received instruction will actually change a register that a BX instruction will need. At decision block 860, a determination is made whether any matching Rm's have been found in the TTT 800. If at least one matching Rm has not been found in the TTT 800, the second IBP process 850 returns decision block 854 to evaluate the next received instruction. However, if at least one matching Rm was found in the TTT 800, the second IBP process 850 proceeds to block 862. At block 862, the up/down counter associated with the entry is incremented. The up/down counter indicates how many instructions are in flight that will change that particular Rm. It is noted that when an Rm changing instruction executes, the entry's up/down counter value 808 is decremented, the data valid bit 807 is set, and Rm data result of execution is written to the Rm data field 809. If register changing instructions complete out of order, then a latest register changing instruction cancels an older instruction's write to the Rm data field, thereby avoiding a write after write hazard. For processor instruction set architectures (ISAs) that have non-branch conditional instructions, a non-branch conditional instruction may have a condition that evaluates to a no-execute state. Thus, for the purposes of evaluating an entry's up/down counter value 808, the target register Rm of a non-branch conditional instruction that evaluates to no-execute may be read as a source operand. The Rm value that is read has the latest target register Rm value. That way, even if the non-branch conditional instruction having an Rm with a matched valid tag is not executed, the Rm data field 809 may be updated with the latest value and the up/down counter value 808 is accordingly decremented. The second IBP process 850 then returns to decision block 854 to evaluate the next received instruction.
Returning to decision block 854, if the received instruction is a BX Rm instruction, the second IBP process 850 proceeds to block 866. At block 866, the TTT 800 is checked for valid entries. At decision block 868, a determination is made whether a matching tag has been found in the TTT 800. If a matching tag was not found the second IBP process 850 proceeds to block 870. At block 870, a new entry is established in the TTT 800, which includes setting the new entry valid bit 804 to a valid indicating value, placing the BX's Rm in the Rm field 806, clearing the data valid bit 807, and clearing the up/down counter associated with the new entry. The second IBP process 850 then returns to decision block 854 to evaluate the next received instruction.
Returning to decision block 868, if a matching tag is found the second IBP process 850 proceeds to decision block 872. At decision block 872, a determination is made whether the entry's up/down counter is zero. If the entry's up/down counter is not zero, there are Rm changing instructions still in flight and the second IBP process 850 proceeds to step 874. At step 874, the BX instruction is stalled in the processor pipeline until the entry's up/down counter has been decremented to zero. At block 876, the TTT entry's Rm data which is the last change to the Rm data is used as the target for the indirect branch BX instruction. The second IBP process 850 then returns to decision block 854 to evaluate the next received instruction.
Returning to decision block 872, if the entry's up/down counter is equal to zero, the second IBP process 850 proceeds to decision block 878. At decision block 878, a determination is made whether the entry's data valid bit is equal to a one. If the entry's data valid bit is equal to a one, the second IBP process 850 proceeds to block 876. At block 876, the TTT entry's Rm data is used as the target for the indirect branch BX instruction. The second IBP process 850 then returns to decision block 854 to evaluate the next received instruction.
Returning to decision block 878, if the entry's data valid bit is not equal to a one, the second IBP process 850 returns to decision block 854 to evaluate the next received instruction. In a first alternative, the TTT entry's Rm data may be used as the target for the indirect branch BX instruction, since the BX Rm tag matches a valid entry and the up/down counter value is zero. In a second alternative, the processor pipeline 202 is directed to fetch instructions according to a not taken path to avoid fetching down an incorrect path. Since the data in the Rm data field is not valid, there is no guarantee the Rm data even points to executable memory or memory that has been authorized for access. Fetching down the sequential path, the not taken path, is most likely to memory permitted to be accessed. In an advantageous third alternative, the processor pipeline 202 is directed to stop fetching after the BX instruction in order to save power and wait for a BX correction sequence to reestablish the fetch operations.
The decode circuit 902 decodes incoming instructions from the instruction fetch stage 214 of
In the prediction circuit 906, the predict BTA circuit 914 uses a TTT entry, such as TTT entry 802 of
In the correction circuit 908, the track 2 circuit 920 monitors the execute stage 222 of the processor pipeline 202 for execution status of the BX R0 instruction 607. If the BTA was correctly predicted, the speculatively fetched instructions are allowed to continue in the processor pipeline. If the BTA was not predicted correctly, the speculatively fetched instructions are flushed from the processor pipeline and the pipeline is redirected back to a correct instruction sequence. The detection circuit 904 is also informed of the incorrect prediction status and in response to this status may be programmed to stop identifying this particular indirect branch instruction for prediction. In addition, the prediction circuit 906 is informed of the incorrect prediction status and in response to this status may be programmed to only allow prediction for particular entries of the TTT 800.
In the code example 1000, the conditional move R0, targetB instruction 1006 may affect the BTA register R0 depending on whether it executes or not. Two possible situations are considered as shown in the following table:
In the code sequence 1000, the last instruction that is able to affect the indirect BTA is the conditional move R0, targetB instruction 1006 and if it executes, line 2 in the above table, it does not matter whether the move R0, targetA instruction 1002 executes or not. A software code profiling tool such as a profiling compiler may insert a BHINT R0 instruction 1052 directly after the move R0, targetA instruction 1002 as shown in the code sequence 1050 of
While the invention is disclosed in the context of illustrative embodiments for use in processor systems it will be recognized that a wide variety of implementations may be employed by persons of ordinary skill in the art consistent with the above discussion and the claims which follow below. For example, both a BHINT instruction approach and an automatic indirect-target inference method, such as the second indirect BTA prediction circuit 900, for predicting an indirect branch target address may be used together. The BHINT instruction may be inserted in a code sequence, by a programmer or a software tool, such as a profiling compiler, where high confidence of indirect branch target address prediction may be obtained using this software approach. The automatic indirect-target inference method circuit is overridden upon detection of a BHINT instruction for the code sequence having the BHINT instruction.