Disclosed embodiments relate to optimizing short forward branches. More particularly, exemplary embodiments are directed to optimizing hard-to-predict short forward branches.
High-performance microprocessors may be deeply pipelined, and execute several instructions speculatively by predicting the resolution of branch instructions. However, if the branch predictions are incorrect, cycles are lost in flushing speculative instructions, and fetching and executing correct instructions. This lowers performance and hence, mitigating the branch misprediction penalty is of great importance in high-performance microprocessors. For example, if the pipeline throughput is one instruction per cycle, and there is a ten-cycle branch misprediction penalty, then one misprediction per 1000 instructions is roughly a 1% loss in performance.
One approach to minimizing branch misprediction penalties attempts simply to reduce the number of branch instructions. Since branch misprediction can only occur on a branch instruction, a code sequence with no branch instructions can never be mispredicted.
A current method for reducing the number of branch instructions in a code sequence includes the use of predicated instructions. A predicated instruction is an instruction that performs a function if a condition that is specified in the predicated instruction is satisfied. If the condition is not satisfied, the instruction is treated as a NOP.
Predicated instructions can beneficially replace a code sequence that includes a condition setting instruction followed by a conditional branch instruction and a short code sequence that is executed depending upon the status of the condition. In such a sequence, the conditional branch is used to branch around the relatively short code sequence depending upon the state of the condition. In the predicated instruction implementation of such a code sequence, the conditional branch statement is eliminated and each of the instructions in the short code sequence is replaced with a predicated instruction.
There are current hardware solutions which try to mitigate the negative effects of branch mispredictions. Some solutions have looked at identifying hard-to-predict branches via confidence-based mechanisms and stalling the pipeline fetch on encountering such branches to save power. Sophisticated branch predictors have been designed to lower mispredictions, but they are complex to implement. Moreover, some types of branches are hard to predict, and therefore, branch prediction does not work well.
Exemplary embodiments of the invention are directed to systems and method for optimize hard-to-predict short forward branches according to exemplary embodiments.
For example, an exemplary embodiment is directed to a method for of optimizing a forward conditional branch, the method comprising: detecting a forward conditional branch with at least one instruction between the forward conditional branch and forward conditional branch target; and determining whether an instruction of the at least one instruction includes at least one of a conditional branch or a condition-code setter: if the instruction does not include the at least one of a conditional branch or a condition-code setter, dynamically assigning an inverted condition to the at least one instruction to optimize a code path, and determining whether there is a next instruction between the forward conditional branch and forward conditional branch target, if there is a next instruction, moving to the next instruction for analysis, if there is not a next instruction, executing the optimized code path, if the instruction includes either a conditional branch or a condition-code setter, discarding dynamically assigned inverted conditions on previously optimized instructions and executing the detected forward conditional branch.
Another exemplary embodiment is directed to an apparatus comprising: a branch detection circuit configured to detect a forward conditional branch with at least one instruction between the forward conditional branch and forward conditional branch target; an optimization determination circuit configured to determine if a first of the at least one instruction includes at least one of a conditional branch or a condition-code setter: a state machine configured to dynamically assign an inverted condition to the at least one instruction to optimize a code path if the instruction does not include the at least one of a conditional branch or a condition-code setter, and an instruction detector circuit configured to determine whether there is a next instruction between the forward conditional branch and forward conditional branch target; an instruction retrieval circuit configured to move to the next instruction for analysis if there is a next instruction, an execution circuit configured to execute the optimized code path if there is not a next instruction, an optimization discard circuit configured to discard dynamically assigned inverted conditions on previously optimized instructions and execute the detected forward conditional branch if the instruction includes the at least one of a conditional branch or a condition-code setter.
Yet another exemplary embodiment is directed to a processing system comprising: means for detecting a forward conditional branch with at least one instruction between the forward conditional branch and forward conditional branch target; means for determining whether a first of the at least one instruction includes at least one of a conditional branch or a condition-code setter: means for dynamically assigning an inverted condition to the at least one instruction to optimize a code path if the instruction does not include the at least one of a conditional branch or a condition-code setter, and means for determining whether there is a next instruction between the forward conditional branch and forward conditional branch target; means for moving to the next instruction for analysis if there is a next instruction, means for executing the optimized code path if there is not a next instruction, means for discarding dynamically, assigned inverted conditions on previously optimized instructions and executing the detected forward conditional branch if the instruction includes the at least one of a conditional branch or a condition-code setter.
Still another exemplary embodiment is directed to a non-transitory computer-readable storage medium comprising code, which, when executed by a processor, causes the processor to perform operations for switching between execution modes of the processor, the non-transitory computer-readable storage medium comprising: code for detecting a forward conditional branch with at least one instruction between the forward conditional branch and forward conditional branch target; code for determining whether a first of the at least one instruction includes at least one of a conditional branch or a condition-code setter: code for dynamically assigning an inverted condition to the at least one instruction to optimize a code path if the instruction does not include the at least one of a conditional branch or a condition-code setter, and code for determining whether there is a next instruction between the forward conditional branch and forward conditional branch target; code for moving to the next instruction for analysis if there is a next instruction, code for executing the optimized code path if there is not a next instruction, code for discarding dynamically assigned inverted conditions on previously optimized instructions and executing the detected forward conditional branch if the instruction includes the at least one of a conditional branch or a condition-code setter.
Another exemplary embodiment is directed to a method comprising: detecting a forward conditional branch with at least one instruction between the forward conditional branch and forward conditional branch target; retrieving an instruction; determining eligibility of the instruction for transformation or elimination; if the instruction is eligible for transformation or elimination; dynamically assigning an inverted condition to the instruction; and transmitting the modified instruction an execution core, if the instruction is not eligible for transformation or elimination, determining whether there is a next instruction between the forward conditional branch and forward conditional branch target; if there is a next instruction, retrieving the next instruction with predecode logic.
An additional exemplary embodiment is directed to an apparatus comprising: a branch detection circuit configured to detect a forward conditional branch with at least one instruction between the forward conditional branch and forward conditional branch target; an instruction retrieval circuit configured to retrieve an instruction; a predecode logic circuit configured to determine eligibility of the instruction for transformation or elimination; if the instruction is eligible for transformation or elimination: a state machine configured to dynamically assign an inverted condition to the instruction; and a transmitter configured to transmit the modified instruction an execution core, an instruction detector circuit configured to determine whether there is a next instruction between the forward conditional branch and forward conditional branch target if the instruction is not eligible for transformation or elimination; the instruction retrieval circuit configured to retrieve the next instruction with predecode logic if there is a next instruction.
Advantages of the present invention may include an elimination of a need for predicting hard-to-predict forward conditional branches with short offsets by leveraging predication facilities available in an ISA (e.g., condition codes in ARM). In some embodiments, the dynamic predication can reduce the effect of the forward conditional branch and remove any potential pipeline flushes from branch misprediction. In some embodiments, the method can leverage the already available hardware mechanisms that implement predication in an ISA.
The accompanying drawings are presented to aid in the description of embodiments of the invention and are provided solely for illustration of the embodiments and not limitation thereof.
Aspects of the invention are disclosed in the following description and related drawings directed to specific embodiments of the invention. Alternate embodiments may be devised without departing from the scope of the invention. Additionally, well-known elements of the invention will not be described in detail or will be omitted so as not to obscure the relevant details of the invention.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Likewise, the term “embodiments of the invention” does not require that all embodiments of the invention include the discussed feature, advantage or mode of operation.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of embodiments of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising,”, “includes” and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Further, many embodiments are described in terms of sequences of actions to be performed by, for example, elements of a computing device. It will be recognized that various actions described herein can be performed by specific circuits (e.g., application specific integrated circuits (ASICs)), by program instructions being executed by one or more processors, or by a combination of both. Additionally, these sequence of actions described herein can be considered to be embodied entirely within any form of computer readable storage medium having stored therein a corresponding set of computer instructions that upon execution would cause an associated processor to perform the functionality described herein. Thus, the various aspects of the invention may be embodied in a number of different forms, all of which have been contemplated to be within the scope of the claimed subject matter. In addition, for each of the embodiments described herein, the corresponding form of any such embodiments may be described herein as, for example, “logic configured to” perform the described action.
With reference now to
In a non-limiting exemplary embodiment, instructions in memory 104A can allow the processor 102A to detect forward conditional branches (for e.g., with a condition EQ) with short forward targets, wherein a forward target is defined as target address>instr address. In some embodiments, a configuration register can be used to configure the short forward targets. A state machine 110A can then dynamically assign an inverted condition (e.g., using predecode logic to assign an EQ, or equal, instruction to an NE, or not equal, instruction) to each of the at least one instruction fetched following the branch until reaching the branch target address. This dynamic predication can eliminate the effect of the forward conditional branch and remove at least some of the potential pipeline flushes arising out of branch misprediction. If one of the at least one the instruction in the hard-to-predict short forward branch is a conditional branch itself or a condition-code setter, the processor 102A may not attempt to optimize the hard-to-predict short forward branch.
More specifically, a branch detection circuit 106A can detect a forward conditional branch with at least one instruction between the forward conditional branch and forward conditional branch target. An optimization determination circuit 108A can determine if a first of the at least one instruction includes at least one of a conditional branch or a condition-code setter.
If the instruction does not include the at least one of a conditional branch or a condition-code setter, a state machine 110A can dynamically assign an inverted condition to the at least one instruction to optimize a code path. An instruction detector circuit 112A can determine whether there is a next instruction between the forward conditional branch and forward conditional branch target. If there is a next instruction, an instruction retrieval circuit 114A can move to the next instruction for analysis. If there is not a next instruction, an execution circuit 116A can execute the optimized code path (e.g., the optimized branch).
If the instruction includes the at least one of a conditional branch or a condition-code setter, an optimization discard circuit 118A can discard dynamically assigned inverted conditions on previously optimized instructions and execute the detected for conditional branch.
With reference now to
If the instruction is eligible for transformation or elimination, a state machine 110B can dynamically assign an inverted condition to the instruction. A transmitter 120B can transmit the modified instruction an execution core.
If the instruction is not eligible for transformation or elimination, an instruction detector circuit 112B can determine whether there is a next instruction between the forward conditional branch and forward conditional branch target if the instruction is not eligible for transformation or elimination. If there is a next instruction between the forward conditional branch and forward conditional branch target, the instruction retrieval circuit 114B can retrieve the next instruction with predecode logic if there is a next instruction.
With reference to
Similar to
It will be appreciated that embodiments include various methods for performing the processes, functions and/or algorithms disclosed herein. For example, as illustrated in
if the instruction being analyzed does not include the at least one of a conditional branch or a condition-code setter, dynamically assigning an inverted condition to the instruction being analyzed (e.g., dynamically assigning one of the at least one instruction into a NOP; for BNE, applying EQ to following instructions)—Block 306. If there is a next instruction between the forward conditional branch and forward conditional branch target (e.g. a second of at least two sequential instructions), moving to the next instruction for optimization until the last instruction has been analyzed—Block 308. If there is no next instruction, executing the optimized code path—Block 310.
Returning to block 304, if the instruction being analyzed is either a conditional branch or a condition-code setting instruction, the method proceeds to Block 312. The method further comprises discarding dynamically assigned inverted conditions on previously analyzed instructions—Block 312; and executing the detected forward conditional branch—Block 314.
In some embodiments, the at least one instruction can include a forward conditional branch that is a last branch in a branched-over block, and wherein the branch does not disqualify the invention from optimizing the block.
In
If the instruction is not eligible for transformation or elimination, determining whether there is a next instruction—Block 412; if there is a next instruction, retrieving next instruction—Block 404. If instruction is eligible for transformation or elimination, dynamically assigning an inverted condition to the instruction (e.g., dynamically assigning the instruction into an NOP; for BNE, applying EQ to following instructions)—Block 408; and transmitting the modified instruction to the execution core—Block 410.
Similar to the sequence of instructions in
In some embodiments, the efficacy of the forward conditional branch prior to optimization may be evaluated after execution so as to compare it to the efficacy of the branch after optimization. In some embodiments, the forward conditional branch can be further optimized using software methods of optimization. For example,
In some embodiments, the forward conditional branch can be optimized prior to analysis. For example, the at least one instruction can have a condition that disagrees with the condition of the branch, and the at least one instruction can be dynamically assigned into a NOP. In some embodiments, forward conditional branch optimization is qualified by a branch-predictor state. Some examples of software forward conditional branch optimization include the biasing of a combination of AND and OR statements can be increased in software; the branches in a loop can be removed when the conditional does not change during the duration of the loop; and a branch target buffer (BTB) can be used to predict using a history log of previously encountered branches. In some embodiments, the forward conditional branch can be optimized only if a branch predictor has a weak state.
Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Further, those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The methods, sequences and/or algorithms described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Further, those of skill in the aid will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The methods, sequences and/or algorithms described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative the storage medium may be integral to the processor.
Referring to
in a particular embodiment, input device 630 and power supply 644 are coupled to the system-on-chip device 622. Moreover, in a particular embodiment, as illustrated in
It should be noted that although
Accordingly, an embodiment of the invention can include a computer readable media embodying a method for optimizing hard-to-predict short forward branches. Accordingly, the invention is not limited to illustrated examples and any means for performing the functionality described herein are included in embodiments of the invention.
While the foregoing disclosure shows illustrative embodiments of the invention, it should be noted that various changes and modifications could be made herein without departing from the scope of the invention as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the embodiments of the invention described herein need not be performed in any particular order. Furthermore, although elements of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.