The present invention relates generally to processor design, and particularly to methods and systems for run-time code parallelization.
Various techniques have been proposed for dynamically parallelizing software code at run-time. For example, Akkary and Driscoll describe a processor architecture that enables dynamic multithreading execution of a single program, in “A Dynamic Multithreading Processor,” Proceedings of the 31st Annual International Symposium on Microarchitectures, December, 1998, which is incorporated herein by reference.
Marcuellu et al., describe a processor microarchitecture that simultaneously executes multiple threads of control obtained from a single program by means of control speculation techniques that do not require compiler or user support, in “Speculative Multithreaded Processors,” Proceedings of the 12th International Conference on Supercomputing, 1998, which is incorporated herein by reference.
Marcuello and Gonzales present a microarchitecture that spawns speculative threads from a single-thread application at run-time, in “Clustered Speculative Multithreaded Processors,” Proceedings of the 13th International Conference on Supercomputing, 1999, which is incorporated herein by reference.
In “A Quantitative Assessment of Thread-Level Speculation Techniques,” Proceedings of the 14th International Parallel and Distributed Processing Symposium, 2000, which is incorporated herein by reference, Marcuello and Gonzales analyze the benefits of different thread speculation techniques and the impact of value prediction, branch prediction, thread initialization overhead and connectivity among thread units.
Ortiz-Arroyo and Lee describe a multithreading architecture called Dynamic Simultaneous Multithreading (DSMT) that executes multiple threads from a single program on a simultaneous multithreading processor core, in “Dynamic Simultaneous Multithreaded Architecture,” Proceedings of the 16th International Conference on Parallel and Distributed Computing Systems (PDCS'03), 2003, which is incorporated herein by reference.
An embodiment of the present invention that is described herein provides a processor including an execution pipeline and monitoring circuity. The execution pipeline is configured to execute instructions of program code. The monitoring circuity is configured to monitor the instructions in a segment of a repetitive sequence of the instructions so as to construct a specification of register access by the monitored instructions, to parallelize execution of the repetitive sequence based on the corrected specification, and to terminate monitoring of the instructions and discard the specification in response to detecting a branch mis-prediction in the monitored instructions.
In an embodiment, the monitoring circuity is further configured to generate a flow-control trace traversed by the monitored instructions, and to correct the flow-control trace so as to compensate for the branch mis-prediction. In another embodiment, the monitoring circuity is configured to continue monitoring the instructions during parallelized execution. In yet another embodiment, the monitoring circuity is configured to continue to monitor the instructions and construct the specification after discarding the specification.
In an example embodiment, the monitoring circuity is configured to generate a flow-control trace of the monitored instructions based on an output of a fetch unit in the execution pipeline. In another embodiment, the monitoring circuity is configured to generate a flow-control trace of the monitored instructions based on an output of a decoding unit in the execution pipeline. In yet another embodiment, the monitoring circuity is configured to generate a flow-control trace of the monitored instructions based on outputs of both a fetch unit and a decoding unit in the execution pipeline.
In some embodiment, the monitoring circuity is configured to record in the specification a location in the sequence of a last write operation to a register, based on an output of a fetch unit in the execution pipeline. In other embodiments, the monitoring circuity is configured to record in the specification a location in the sequence of a last write operation to a register, based on the instructions being executed in the execution pipeline. In still other embodiments, the monitoring circuity is configured to record in the specification a location in the sequence of a last write operation to a register, based on the instructions that are committed and are not flushed due to the branch mis-prediction.
In some embodiments, the monitoring circuity is configured to collect the register access only after evaluating respective branch conditions of conditional branch instructions of the sequence. In some embodiments, the monitoring circuity is configured to generate a flow-control trace for the monitored instructions, including for a branch instruction that is not known to a branch prediction unit of the processor.
There is additionally provided, in accordance with an embodiment of the present invention, a processor including an execution pipeline and monitoring circuity. The execution pipeline is configured to execute instructions of program code. The monitoring circuity is configured to monitor the instructions in a segment of a repetitive sequence of the instructions so as to construct a specification of register access by the monitored instructions, to parallelize execution of the repetitive sequence based on the corrected specification, and to retain the specification in the processor only provided that no branch mis-prediction is detected in the monitored instructions.
There is also provided, in accordance with an embodiment of the present invention, a method including, in a processor that executes instructions of program code, monitoring the instructions in a segment of a repetitive sequence of the instructions so as to construct a specification of register access by the monitored instructions. Execution of the repetitive sequence is parallelized based on the specification. In response to detecting a branch mis-prediction in the monitored instructions, monitoring of the instructions is terminated and the specification is discarded.
There is further provided, in accordance with an embodiment of the present invention, a method including, in a processor that executes instructions of program code, monitoring the instructions in a segment of a repetitive sequence of the instructions so as to construct a specification of register access by the monitored instructions. Execution of the repetitive sequence is parallelized based on the specification. The specification is retained in the processor only provided that no branch mis-prediction is detected in the monitored instructions.
The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
Embodiments of the present invention that are described herein provide improved methods and devices for run-time parallelization of code in a processor. In the disclosed embodiments, the processor identifies a repetitive sequence of instructions, and creates and executes multiple parallel code sequences referred to as segments, which carry out different occurrences of the sequence. The segments are scheduled for parallel execution by multiple hardware threads.
For example, the repetitive sequence may comprise a loop, in which case the segments comprise multiple loop iterations, parts of an iteration or the continuation of a loop. As another example, the repetitive sequence may comprise a function, in which case the segments comprise multiple function calls, parts of a function or function continuation. The parallelization is carried out at run-time, on pre-compiled code. The term “repetitive sequence” generally refers to any instruction sequence that is revisited and executed multiple times.
In some embodiments, upon identifying a repetitive sequence, the processor monitors the instructions in the sequence and constructs a “scoreboard”—A specification of access to registers by the monitored instructions. The scoreboard is associated with the specific flow-control trace traversed by the monitored sequence. The processor decides how and when to create and execute the multiple segments based on the information collected in the scoreboard and the trace.
Further aspects of instruction monitoring are addressed in a U.S. Patent Application entitled “Run-time code parallelization with continuous monitoring of repetitive instruction sequences,” Attorney docket no. 1279-1004, and a U.S. Patent Application entitled “Register classification for run-time code parallelization,” Attorney docket no. 1279-1004.1, which are assigned to the assignee of the present patent application and whose disclosures are incorporated herein by reference.
In some embodiments, the processor fetches and processes instructions in its execution pipeline. Branch mis-prediction may occur when a conditional branch instruction is predicted to take a branch but during actual execution the branch is not taken, or vice versa. Upon detecting branch mis-prediction, the processor typically flushes the subsequent instructions and respective results.
When branch mis-prediction occurs in a segment whose instructions are being monitored, the register-access information in the scoreboard will typically be incorrect or at least incomplete. Some embodiments described herein provide techniques for correcting the register-access information collected in the scoreboard after detecting a branch mis-prediction event.
In an example embodiment, the processor stops monitoring of the segment in question and discards the register-access information collected in it. In other words, in some embodiments the processor retains the scoreboard only provided that no branch mis-prediction is detected in the monitored instructions. In other embodiments, the processor rolls-back the scoreboard to the state prior to the mis-prediction, and continues to monitor the segment following the correct branch decision.
The processor may roll-back the scoreboard in various ways, such as by saving in advance the states of the scoreboard prior to conditional branch instructions, and reverting to a previously-saved state when needed. Alternatively, the processor may roll-back the scoreboard by tracing back the instructions that follow the mis-prediction and decrementing the register-access counters back to their values prior to the mis-prediction. Rolling-back may be carried out for all conditional branch instructions, or only for a selected subset of the conditional branch instructions. Example criteria for selecting the subset are also described.
In some embodiments, as part of the monitoring process, the processor generates the flow-control trace to be associated with the scoreboard. Upon detecting mis-prediction, the processor typically corrects the generated flow-control trace, as well, using any of the methods described above.
In other disclosed embodiments, the processor reduces the impact of mis-prediction by proper choice of the execution-pipeline stage at which the flow-control trace is generated, and the execution-pipeline stage at which the register-access information is collected.
In various embodiments, the processor may generate the trace from the instructions immediately after fetching, immediately after decoding, or a combination of the two.
The register-access information may be collected, for example, immediately after decoding, after execution (including execution of mis-predicted instructions that will be flushed), or after committing (including only instructions that will not be flushed).
In the present example, processor 20 comprises an execution pipeline that comprises one or more fetching units 24, one or more decoding units 28, an Out-of-Order (OOO) buffer 32, and execution units 36. Fetching units fetch program instructions from a multi-level instruction cache memory, which in the present example comprises a Level-1 (L1) instruction cache 40 and a Level-2 (L2) instruction cache 44.
A branch prediction unit 48 predicts the flow-control traces (referred to herein as “traces” for brevity) that are expected to be traversed by the program during execution. The predictions are typically based on the addresses or Program-Counter (PC) values of previous instructions fetched by fetching units 24. Based on the predictions, branch prediction unit 48 instructs fetching units 24 which new instructions are to be fetched. The flow-control predictions of unit 48 also affect the parallelization of code execution, as will be explained below.
Instructions decoded by decoding units 28 are stored in OOO buffer 32, for out-of-order execution by execution units 36, i.e., not in the order in which they have been compiled and stored in memory. Alternatively, the buffered instructions may be executed in-order. The buffered instructions are then issued for execution by the various execution units 36. In the present example, execution units 36 comprise one or more Multiply-Accumulate (MAC) units, one or more Arithmetic Logic Units (ALU), one or more Load/Store units, and a branch execution unit (BRA). Additionally or alternatively, execution units 36 may comprise other suitable types of execution units, for example Floating-Point Units (FPU).
The results produced by execution units 36 are stored in a register file and/or a multi-level data cache memory, which in the present example comprises a Level-1 (L1) data cache 52 and a Level-2 (L2) data cache 56. In some embodiments, L2 data cache memory 56 and L2 instruction cache memory 44 are implemented as separate memory areas in the same physical memory, or simply share the same memory without fixed pre-allocation.
In some embodiments, processor 20 further comprises a thread monitoring and execution unit 60 that is responsible for run-time code parallelization. The functions of unit 60 are explained in detail below.
The configuration of processor 20 shown in
As yet another example, the processor may be implemented without cache or with a different cache structure, without branch prediction or with a separate branch prediction per thread. The processor may comprise additional elements such as reorder buffer (ROB), register renaming, to name just a few. Further alternatively, the disclosed techniques can be carried out with processors having any other suitable micro-architecture.
Processor 20 can be implemented using any suitable hardware, such as using one or more Application-Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs) or other device types. Additionally or alternatively, certain elements of processor 20 can be implemented using software, or using a combination of hardware and software elements. The instruction and data cache memories can be implemented using any suitable type of memory, such as Random Access Memory (RAM).
Processor 20 may be programmed in software to carry out the functions described herein. The software may be downloaded to the processor in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
In some embodiments, unit 60 in processor 20 identifies repetitive instruction sequences and parallelizes their execution. Repetitive instruction sequences may comprise, for example, respective iterations of a program loop, respective occurrences of a function or procedure, or any other suitable sequence of instructions that is revisited and executed multiple times. In the present context, the term “repetitive instruction sequence” refers to an instruction sequence whose flow-control trace (e.g., sequence of PC values) has been executed in the past at least once. Data values (e.g., register values) may differ from one execution to another.
In the disclosed embodiments, processor 20 parallelizes a repetitive instruction sequence by invoking and executing multiple code segments in parallel or semi-parallel using multiple hardware threads. Each thread executes a respective code segment, e.g., a respective iteration of a loop, multiple (not necessarily successive) loop iterations, part of a loop iteration, continuation of a loop, a function or part or continuation thereof, or any other suitable type of segment.
Parallelization of segments in processor 20 is performed using multiple hardware threads. In the example of
In practice, data dependencies exist between segments. For example, a calculation performed in a certain loop iteration may depend on the result of a calculation performed in a previous iteration. The ability to parallelize segments depends to a large extent on such data dependencies.
The bottom of the figure shows how unit 60 parallelizes this loop using four threads TH1 . . . TH4, in accordance with an embodiment of the present invention. The table spans a total of eleven cycles, and lists which instructions of which threads are executed during each cycle. Each instruction is represented by its iteration number and the instruction number within the iteration. For example, “14” stands for the 4th instruction of the 1st loop iteration. In this example instructions 5 and 7 are neglected and perfect branch prediction is assumed.
The staggering in execution of the threads is due to data dependencies. For example, thread TH2 cannot execute instructions 21 and 22 (the first two instructions in the second loop iteration) until cycle 1, because instruction (the first instruction in the second iteration) depends on instruction 13 (the third instruction of the first iteration). Similar dependencies exist across the table. Overall, this parallelization scheme is able to execute two loop iterations in six cycles, or one iteration every three cycles.
It is important to note that the parallelization shown in
In some embodiments, unit 60 decides how to parallelize the code by monitoring the instructions in the processor pipeline. In response to identifying a repetitive instruction sequence, unit 60 starts monitoring the sequence as it is fetched, decoded and executed by the processor.
In some implementations, the functionality of unit may be distributed among the multiple hardware threads, such that a given thread can be viewed as monitoring its instructions during execution. Nevertheless, for the sake of clarity, the description that follows assumes that monitoring functions are carried out by unit 60.
As part of the monitoring process, unit 60 generates the flow-control trace traversed by the monitored instructions, and a monitoring table that is referred to herein as a scoreboard. The scoreboard of a segment typically comprises some classification of the registers. In addition, for at least some of the registers, the scoreboard indicates the location in the monitored sequence of the last write operation to the register.
Any suitable indication may be used to indicate the location of the last write operation, such as a count of the number of writes to the register or the address of the last write operation. The last-write indication enables unit 60 to determine, for example, when it is permitted to execute an instruction in a subsequent segment that depends on the value of the register. Additional aspects of scoreboard generation can be found in U.S. Patent Applications Attorney docket no. 1279-1004 and 1279-1004.1, cited above.
In some embodiments, processor 20 fetches and processes instructions speculatively, based on a prediction of the branch decisions that will be takes at future branch instructions. Branch prediction is carried out by branch prediction unit 48, and affects the instructions that are fetched for execution by fetch units 24.
Depending on the actual code and on the performance of unit 48, branch prediction may be erroneous. An event in which a conditional branch was predicted to take a branch but in fact the branch was not taken, or vice versa, is referred to herein as branch mis-prediction, or simply mis-prediction for brevity. In an embodiment of
As noted above, in some embodiments monitoring unit monitors the flow-control trace and the register access during execution. In other embodiments unit 60 may monitor the flow-control trace and the register access in various segments simultaneously during parallel execution. When mis-prediction occurs in a segment being monitored, the resulting trace and scoreboard will typically be incorrect. For example, the scoreboard may comprise register-access information that was collected over instructions that follow the mis-predicted branch and will later be flushed.
In some embodiments, unit 60 takes various measures for correcting the scoreboard in the event of mis-prediction. The correction methods described below refer mainly to correction of the register-access information. In some embodiments, unit 60 uses these methods to correct the generated flow-control trace as well.
In some embodiments, in response to a detected mis-prediction event, unit 60 stops monitoring of the segment and discards the register-access information collected so far in the segment. Monitoring will typically be re-attempted in another segment. In these embodiments, unit retains the scoreboard for the segment in question only provided that no branch mis-prediction is detected.
In other embodiments, unit 60 does not discard the register-access information, but rather rolls-back the register-access information to its state prior to the mis-prediction. After rolling back, unit 60 may resume the monitoring process along the correct trace.
Unit 60 may roll-back the scoreboard information in various ways. In some embodiments, unit 60 traces back over the instructions that follow the mis-prediction, and corrects the register-access information to remove the contribution of these instructions. For example, if the register-access information comprises counts of write operations to registers, unit 60 may decrement the counts to remove the contribution of write operations that follow the mis-prediction. If the register-access information comprises some other indications of the locations of the last write operations to registers, unit 60 may correct these indications, as well.
In alternative embodiments, unit 60 prepares in advance for a possible roll-back of the scoreboard to a conditional branch instruction, by saving the state that the scoreboard had prior to that instruction. If mis-prediction occurs in this instruction, unit 60 may revert to the saved state of the scoreboard and resume monitoring from that point. The saved state of the scoreboard typically comprises the register-access information and the register classification prior to the branch instruction. The state may correspond to the exact conditional branch instruction, to the preceding instruction, or to another suitable instruction that is prior to the conditional branch instruction.
In some embodiments, unit 60 saves the scoreboard state prior to every conditional branch instruction, enabling roll-back following any mis-prediction. In alternative embodiments, unit 60 saves the scoreboard state for only a selected subset of the conditional branch instructions in the segment. This technique reduces memory space, but on the other hand enables roll-back for only some of the possible mis-predictions. If mis-prediction occurs in an instruction for which no prior scoreboard state has been saved, unit 60 typically has to abort monitoring the segment and re-attempt monitoring in another segment.
Unit 60 may select the subset of conditional branch instructions (for which the prior state of the scoreboard is saved) using any suitable criterion. Typically, the criterion aims to select conditional branch instructions that are likely to be mis-predicted, and exclude conditional branch instructions that are likely to be predicted correctly. In one embodiment, the subset to be selected is specified in the code or by a compiler that compiles the code. In another embodiment, the subset is chosen by unit 60 at runtime. For example, unit 60 may accumulate mis-prediction statistics and select conditional branch instructions in which branch prediction accuracy is below a certain level.
The embodiments described above refer mainly to correction of the last-write indications in the scoreboard following mis-prediction. Additionally or alternatively, unit 60 may correct any other suitable register access information in the scoreboard that may be affected by mis-prediction. For example, the scoreboard typically comprises a classification of the registers accessed by the monitored instructions based on the order in which the register is used as an operand or as a destination in the monitored instructions. The classification may distinguish, for example, between local (L) registers whose first occurrence is as a destination, global (G) registers that are used only as operands, and global-local (GL) registers whose first occurrence is as operands and are subsequently used as destinations.
In some embodiments, unit 60 may re-classify one or more of the registers so as to reflect their correct classification prior to the mis-prediction. Any of the correction methods described above (e.g., reverting to previously-saved states or tracing back the instruction sequence) can be used for this purpose.
The embodiments described above are depicted purely by way of example. In alternative embodiments, unit 60 may correct the scoreboard in response to branch mis-prediction in any other suitable way.
For example, in some embodiments unit 60 performs only an approximate correction of the specification that only approximately compensates for the effect of the mis-prediction. In these embodiments, unit 60 may roll back the specification to a state that approximates the state prior to the mis-prediction, rather than to the exact prior state. The approximation may comprise, for example, an approximation of the last-write indications of certain registers. In the present context, both exact and approximate corrections are considered types of specification corrections, and both exact and approximate compensation for the mis-prediction are considered types of compensation.
At an invocation step 74, unit 60 invokes multiple hardware threads to execute respective segments of the repetitive instruction sequence. For at least some of the segments, unit 60 continues to monitor the instructions during execution in the threads.
At a mis-prediction detection step 78, processor 20 checks whether branch mis-prediction has occurred in a given segment being executed. If no mis-prediction is encountered, the method loops back to step 74 above.
In case of branch mis-prediction, unit 60 corrects the scoreboard to compensate for the effect of the instructions following the mis-prediction, at a correction step 82. Unit 60 may use any of the techniques described above, or any other suitable technique, for this purpose. In some embodiments, the correction involves correction of the register-access information as well as correction of the generated flow-control trace.
In some embodiments, unit 60 reduces the impact of branch mis-prediction by properly choosing the stage in the execution pipeline at which the trace is generated and the stage in the execution pipeline at which the register-access information is collected. Generally, trace generation and collection of register-access information need not be performed at the same pipeline stage.
In some embodiments, unit 60 generates the trace from the branch instructions being fetched, i.e., based on the branch instructions at the output of fetching units 24. In alternative embodiments, unit 60 generates the trace from the branch instructions being decoded, i.e., based on the branch instructions at the output of decoding units 28.
In yet another embodiment, unit 60 generates the trace based on a combination of branch instructions at the output of decoding units 28, and branch instructions at the output of fetch units 24.
In some embodiments, unit 60 collects the register-access information (e.g., classification of registers and locations of last write operations to registers) at the output of decoding units 28, i.e., from the instructions being decoded.
In other embodiments, unit 60 collects the register-access information based on the instructions being executed in execution units 36, but before the instructions and results are finally committed. In this embodiment, the register-access information includes the contribution of instructions that follow mis-prediction and will later be flushed (as in the case of collecting the register-access information after the decoding unit). In an alternative embodiment, unit 60 collects the register-access information based only on the instructions that are committed, i.e., without considering instructions that are flushed due to mis-prediction.
In yet another embodiment, unit 60 collects the register-access information and/or generates the trace after evaluating the conditions of conditional branch instructions by the branch execution unit, i.e., at a stage where the branch instructions are no longer conditional.
Further additionally or alternatively, unit 60 may generate the flow-control trace and/or collect the register-access information based on any other suitable pipeline stages.
Generally speaking, monitoring instructions early in the pipeline helps to invoke parallel execution more quickly and efficiently, but on the other hand is more affected by mis-prediction. Monitoring instructions later in the pipeline causes slower parallelization, but is on the other hand less sensitive to mis-prediction.
In some embodiments, unit 60 is able to generate a trace even monitoring a conditional branch instruction that is not yet known to branch prediction unit 48. This scenario may occur, for example, when a repetitive instruction sequence is first encountered and not yet identified as repetitive. Nevertheless, the trace is still recorded by the decoding unit (or by a register-renaming unit), and unit 60 may still be able to generate a trace. Typically, the trace will be generated with a branch not taken for this instruction.
It will be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described herein, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.
This application is a continuation-in-part of PCT Application PCT/IB2015/059467, filed Dec. 9, 2015, which claims priority from U.S. patent application Ser. No. 14/583,119, filed Dec. 25, 2014. The disclosures of these related applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 14583119 | Dec 2014 | US |
Child | PCT/IB2015/059467 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/IB2015/059467 | Dec 2015 | US |
Child | 15620837 | US |