1. Technical Field
The present disclosure relates generally to information processing systems and, more specifically, to embodiments of a method and apparatus for analyzing spawning pairs for speculative multithreading.
2. Background Art
In order to increase performance of information processing systems, such as those that include microprocessors, both hardware and software techniques have been employed. One approach that has been employed to improve processor performance is known as “multithreading.” In multithreading, an instruction stream is split into multiple instruction streams that can be executed concurrently. In software-only multithreading approaches, such as time-multiplex multithreading or switch-on-event multithreading, the multiple instruction streams are alternatively executed on the same shared processor.
Increasingly, multithreading is supported in hardware. For instance, in one approach, referred to as simultaneous multithreading (“SMT”), a single physical processor is made to appear as multiple logical processors to operating systems and user programs. Each logical processor maintains a complete set of the architecture state, but nearly all other resources of the physical processor, such as caches, execution units, branch predictors, control logic, and buses are shared. In another approach, processors in a multi-processor system, such as a chip multiprocessor (“CMP”) system, may each act on one of the multiple threads concurrently. In the SMT and CMP multithreading approaches, threads execute concurrently and make better use of shared resources than time-multiplex multithreading or switch-on-event multithreading.
For those systems, such as CMP and SMT multithreading systems, that provide hardware support for multiple threads, several independent threads may be executed concurrently. In addition, however, such systems may also be utilized to increase the throughput for single-threaded applications. That is, one or more thread contexts may be idle during execution of a single-threaded application. Utilizing otherwise idle thread contexts to speculatively parallelize the single-threaded application can increase speed of execution and throughput for the single-threaded application.
The present invention may be understood with reference to the following drawings in which like elements are indicated by like numbers. These drawings are not intended to be limiting but are instead provided to illustrate selected embodiments of a method and apparatus for analyzing spawning pairs for a speculative multithreading processor.
Described herein are selected embodiments of a method, apparatus and system for analyzing spawning pairs for speculative multithreading. In the following description, numerous specific details such as thread unit architectures (SMT and CMP), number of thread units, variable names, data organization schemes, stages for speculative thread execution, and the like have been set forth to provide a more thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the embodiments may be practiced without such specific details. Additionally, some well-known structures, circuits, and the like have not been shown in detail to avoid unnecessarily obscuring the embodiments discussed herein.
As used herein, the term “thread” is intended to refer to a sequence of one or more instructions. The instructions of a thread are executed in a thread context of a processor, such as processor 300 or processor 800 illustrated in
The method embodiments for analyzing spawning pairs, discussed herein, may thus be utilized in a processor that supports speculative multithreading. For at least one speculative multithreading approach, the execution time for a single-threaded application is reduced through the execution of one or more concurrent speculative threads. One approach for speculatively spawning additional threads to improve throughput for single-threaded code is discussed in commonly-assigned U.S. patent application Ser. No. 10/356, 435 “Control-Quasi-Independent-Points Guided Speculative Multithreading”. Under such approach, single-threaded code is partitioned into threads that may be executed concurrently.
For at least one embodiment, a portion of an application's code may be parallelized through the use of the concurrent speculative threads. A speculative thread, referred to as the spawnee thread, is spawned at a spawn point. The spawned thread executes instructions that are ahead, in sequential program order, of the code being executed by the thread that performed the spawn. The thread that performed the spawn is referred to as the spawner thread. For at least one embodiment, a CMP core separate from the core executing the spawner thread executes the spawnee thread. For at least one other embodiment, the spawnee thread is executed in a single-core simultaneous multithreading system that supports speculative multithreading. For such embodiment, the spawnee thread is executed by a second SMT logical processor on the same physical processor as the spawner thread. One skilled in the art will recognize that the method embodiments discussed herein may be utilized in any multithreading approach, including SMT, CMP multithreading or other multiprocessor multithreading, or any other known multithreading approach that may encounter idle thread contexts.
A spawnee thread is thus associated with a spawn point as well as a point at which the spawnee thread should begin execution. The latter is referred to as a target point. These two points together are referred to as a “spawning pair.” A potential speculative thread is thus defined by a spawning pair, which includes a spawn point in the static program where a new thread is to be spawned and a target point further along in the program where the speculative thread will begin execution when it is spawned.
Well-chosen spawning pairs can generate speculative threads that provide significant performance enhancement for otherwise single-threaded code.
That is not to say that the spawned speculative thread 142 necessarily begins execution at the target point 106Sp immediately after the speculative thread has been spawned. Indeed, for at least one embodiment, certain initialization and data dependence processing may occur before the spawned speculative thread begins execution at the target point 106. Such processing is represented in
After such initialization stage 204, a slice stage 206 may occur. During the slice stage 206, live-in input values, upon which the speculative thread is anticipated to depend, may be calculated. For at least one embodiment, such live-in values are computed via execution of a “precomputation slice.” For the embodiments discussed herein, live-in values for a speculative thread are pre-computed using speculative precomputation based on backward dependency analysis. For at least one embodiment, the precomputation slice is executed, in order to pre-compute the live-in values for the speculative thread, before the main body of the speculative thread instructions are executed. The precomputation slice may be a subset of instructions from one or more previous threads. A “previous thread” may include the main non-speculative thread, as well as any other “earlier” (according to sequential program order) speculative thread.
Such live-in calculations may be particularly useful if the target processor for the speculative thread does not support synchronization among threads in order to correctly handle data dependencies. Details for at least one embodiment of a target processor is discussed in further detail below in connection with
Brief reference is made to
A speculative thread 1112 may include two portions. Specifically, the speculative thread 1112 may include a precomputation slice 1114 and a thread body 1116. During execution of the precomputation slice 1114, the speculative thread 1112 determines one or more live-in values in the infix region 1110 before starting to execute the thread body 1116 in the postfix region 1102. The instructions executed by the speculative thread 1112 during execution of the precomputation slice 1114 correspond to a subset (referred to as a “backward slice”) of instructions from the main thread in the infix region 1110 that fall between the spawn point 1108 and the target point 1104. This subset may include instructions to calculate data values upon which instructions in the postfix region 1102 depend. For at least one embodiment of the methods described herein, the time that it takes to execute a slice is referred to as slice time 205.
During execution of the thread body 1116, the speculative thread 1112 executes code from the postfix region 1102, which may be an intact portion of the main thread's original code.
Returning to
After the speculative thread has completed execution of its thread body 1116 during the body stage 208, the thread enters a wait stage 210. The time at which the thread has completed execution of the instructions of its thread body 1116 (
The wait stage 210 represents the time that the speculative thread must wait until it becomes the least speculative thread. The wait stage reflects assumption of an execution model in which speculative threads commit their results according to sequential program order. At this point, a discussion of an example embodiment of a target SpMT processor may be helpful in understanding the processing of the wait stage 210.
Reference is now made to
For embodiments of the analysis method discussed herein (such as, for example, method 400 illustrated in
For at least one embodiment, such as that illustrated in
While the CMP embodiments of processor 300 discussed herein refer to only a single thread per processor core 304, it should not be assumed that the disclosures herein are limited to single-threaded processors. The techniques discussed herein may be employed in any CMP system, including those that include multiple multi-threaded processor cores in a single chip package 303.
The thread units 304a-304n may communicate with each other via an interconnection network such as on-chip interconnect 310. Such interconnect 310 may allow register communication among the threads. In addition,
The topology of the interconnect 310 may be a multi-drop bus, a point-to-point network that directly connects each thread unit 304 to each other, or the like. In other words, any interconnection approach may be utilized. For instance, one of skill in the art will recognize that, for at least one alternative embodiment, the interconnect 310 may be based on a ring topology.
According to an execution model that is assumed for at least one embodiment of method 400 (
For at least one embodiment of the execution model assumed for an SpMT processor, the requirements to spawn a thread are: 1) there is a free thread-unit 304 available, OR 2) there is at least one running thread that is more speculative than the thread to be spawned. That is, for the second condition, there is an active thread that is further away in sequential time from the “target point” for the speculative thread that is to be spawned. In this second case, the method 400 assumes an execution model in which the most speculative thread is squashed, and its freed thread unit is assigned to the new thread that is to be spawned.
Among the running threads, at least one embodiment of the assumed execution model only allows one thread (referred to as the “main” thread) to be non-speculative. When all previously-spawned threads have either completed execution or been squashed, then the next speculative thread becomes the non-speculative main thread. Accordingly, over time the current non-speculative “main” thread may alternatively execute on different thread units.
Each thread becomes non-speculative and commits in a sequential order. A speculative thread must wait (see wait stage 210,
As is stated above, speculative threads can speed the execution of otherwise sequential software code. As each thread is executed on a thread unit 304, the thread unit 304 updates and/or reads the values of architectural registers. The thread unit's register values are not committed to the architectural state of the processor 300 until the thread being executed by the thread unit 304 becomes the non-speculative thread. Accordingly, each thread unit 304 may include a local register file 306. In addition, processor 300 may include a global register file 308, which can store the committed architectural value for each of R architectural registers. Additional details regarding at least one embodiment of a processor that provides local register files 306 for each thread unit 304 may be found in co-pending patent application U.S. Pat. Ser. No. 10/896,585, filed Jul. 21, 2004, and entitled “Multi-Version Register File For Multithreading Processors With Live-In Precomputation”.
Returning to
The speculative thread may then enter the commit stage 212 and the local register values for the thread unit 304 (
The commit time 218 illustrated in
The effectiveness of a spawning pair may depend on the control flow between the spawn point and the start of the speculative thread, as well as on the control after the start of the speculative thread, the aggressiveness of the compiler in generating the p-slice that precomputes the speculative thread's input values (discussed in further detail below), and the number of hardware contexts available to execute speculative threads. Additionally, for at least some embodiments, multiple instances of a particular speculative thread can be active at a given point in time. Determination of the true execution speedup due to speculative multithreading must take the interaction between various instances of the thread into account. Thus, the determination of how effective a potential speculative thread will be can be quite complex.
For at least one embodiment, the method 400 may be performed by a compiler to analyze, at compile time, the expected benefits of a set of spawning pairs for a given sequence of program instructions. To perform such analysis, the method 400 models execution of the program instructions as they would be performed on the target SpMT processor, taking into account the behavior induced by the specified set of spawning pairs, and tracks certain information during such modeling.
Thus, during its execution, the method 400 keeps track of certain information as it models expected execution behavior for the sequence of program instructions, given the specified set of spawning pairs. Accordingly, the method 400 may receive as inputs a set of spawning pairs (referred to herein as a pairset) and a representation of a sequence of program instructions.
For at least one embodiment, the pairset includes one or more spawning pairs, with each spawning pair representing at least one potential speculative thread. (Of course, a given spawning pair may represent several speculative threads if, for instance, it is enclosed in a loop). A given spawning pair in the pairset may include the following information: SP (spawn point) and TGT (target point). The SP indicates, for the speculative thread that is indicated by the spawning pair, the static basic block of the main thread program that fires the spawning of a speculative thread when executed. The TGT indicates, for the speculative thread indicated by the spawning pair, the static basic block that represents the starting point, in the main thread's sequential binary code, of the speculative thread associated with the SP.
In addition, each spawning pair in the pairset may also include precomputation slice information for the indicated speculative thread. The precomputation slice information provided for a spawning pair may include the following information. First, an estimated probability that the speculative thread, when executing the precomputation slice, will reach the TGT point (referred to as a start slice condition), and the average length of the p-slice in such cases. Second, an estimated probability that the speculative thread, when executing the p-slice, does not reach the TGT point (referred to as a cancel slice condition), and the average length of the p-slice in such cases.
The sequence of program instructions provided as an input to the method 400 may be a subset of the instructions for a program, such as a section of code (a loop, for example) or a routine. Alternatively, the sequence of instructions may be a full program. For at least one embodiment, rather than receiving the actual sequence of program instructions as an input, the method 400 may receive instead a program trace that corresponds to the sequence of program instructions.
A program trace is a sequence of basic blocks that represents the dynamic execution of the given section of code. For at least one embodiment, the program trace that is provided as an input to the method 400 may be the full execution trace for the selected sequence of program instructions. For other embodiments, the program trace that is provided as an input to the method 100 may be a subset of the full program trace for the target instructions. For example, via sampling techniques a subset of the full program trace may chosen as an input, with the subset being representative of the whole program trace.
In addition to the pairset and the trace (or other representation of program instructions), the method 400 may also receive as an input the number of thread units that are available on the target SpMT processor. As is stated above, at least one embodiment of the method 400 assumes that the number of available thread units is a fixed number. For purposes of simplicity, the examples that are presented below assume only two thread units, TU0 and TU1. However, the embodiments described herein certainly contemplate more than two thread units.
Generally,
From the structure of the trace 900, we can see that the first basic block of the trace is basic block A, beginning at time 0, and the last basic block of the trace 900 is N, which begins (and ends) at time 120. In other words, we may assume that, when the basic blocks were selected for the trace 900, both the first (A) and last (N) basic blocks associated with the full sequence of program instructions were selected to be the first and last, respectively, basic blocks of the trace 900.
In
For each thread, its state is maintained in order to emulate its evolution over its lifetime. The main attribute of this maintained state is the activity currently being performed. The activity may be reflected, for example, by tracking whether the thread is in its slice stage (see 206,
Hereinafter,
As the method 400 traverses the basic blocks in the input trace, two global variables, “current time” and “current thread” (discussed below) are updated. For at least one embodiment, not all basic blocks of the trace are analyzed. Instead, only “key” basic blocks are analyzed. “Key” basic blocks may be defined as the first and last basic blocks of the trace, as well as any basic block that includes the spawn point or target point for any spawning pair in the pairset.
The first global variable, referred to herein as “current time”, reflects the time at which the current basic block instance is being executed. As is stated above, it is assumed that the number of instructions in each basic block is known. For at least one embodiment of the method, the time that it takes a basic block to execute may be computed by multiplying the instructions of a basic block by the execution time needed for each instruction. For the sake of simplicity in discussing selected embodiments of the method 400, it is assumed assume that the execution of any instruction in the trace takes a single unit of time, and that each instruction takes that same amount of time to execute. However, in other embodiments different execution times may be used for each instruction. Such execution times may be determined, for instance, via profiling.
The other global variable that is updated during traversal is “current thread.” The current thread variable indicates the thread that executes the current basic block instance that is under analysis.
The current time and current thread variables may be maintained in a known manner, including variables, records, tables, arrays, objects, etc. For ease of illustration for specific examples, the variable values are illustrated in table format in Tables 2, 3, 5, 7a, 7b, 8, 9, 11a, 11b and 12, below.
As an output, the method 400 may generate an SpMT execution time. The execution time reflects the estimated time required to execute the selected program instructions (as reflected, for instance, in the input program trace), given the speculative threads indicated in the pairset, on a target SpMT machine.
During traversal of the program trace, one or more of the following types of information may be maintained for each thread:
Otherwise, processing proceeds to block 410, where it is determined whether the current basic block is associated with a target point, as defined in the pairset. If so, processing proceeds to block 412. Otherwise, processing proceeds to block 414.
At block 414 it is determined whether the current basic block is associated with a spawn point, as defined in the pairset. If so, then processing proceeds to block 416. Otherwise, processing proceeds to block 418.
At block 418, the method 400 determines whether the current basic block is the last basic block of the trace. If so, processing proceeds to block 420. Otherwise, processing proceeds to block 422. At block 422, the method 400 traverses to the next key basic block in the trace and updates current time. Processing the loops back to block 406, in order to traverse the remaining blocks in the trace.
One of skill in the art will realize that a basic block may be associated with more than one event. For instance, in a trace having a single basic block, the single block will be associated with both an INIT and END event. Similarly, a basic block may be both a spawn point (for one spawning pair) and a target point (for another spawning pair). Also, the first basic block may be associated with a spawn point. Accordingly,
In order to further illustrate operation of the method,
Table 1 illustrates the new thread (Thr=Thr0) that is modeled at block 504. Table 1, indicates that the model has spawned a single thread, Thr0, that begins at basic block A and sequentially executes all basic blocks of the trace, through basic block N.
From block 504, processing proceeds to block 506. At block 506, the global current thread value is set to reflect the thread, Thr0, that has been “spawned” at block 504. (One will note, of course, that when the term “spawned” is used in relation to
From block 506, processing proceeds to block 508. At block 508, the current time is set to time 0, to reflect that execution of the first instruction of the first basic block of the input trace is being modeled. Processing then ends at block 510, and processing proceeds to block 410 of
Table 2 illustrates the global values for current time and current thread, as well as the current basic block and event type, at the end of block 408 processing:
Returning to
Accordingly, at block 422, the current time is updated to a value of ‘5’ to reflect that basic block A has been traversed. Now, the current basic block being traversed is basic block B. Accordingly, after execution of the first pass of block 422, the value of the global current time and current block values are as set forth in Table 3:
From block 422, processing proceeds to block 406. Because basic block B is associated only with a spawn (SP) event, the determinations at blocks 406 and 410 evaluate to “false”, and processing proceeds to block 414. The determination at block 414 evaluates to “true”, and processing then proceeds to block 416. A more detailed illustration of at least one embodiment of block 416 processing is set forth at
Turning to
If a target point associated with an SP basic block is not found in the trace, then processing proceeds to block 606. At block 606, it is determined whether a thread unit is available. If not, then processing for block 416 ends at block 616 and processing returns to block 418 of
If, however, the target point is found, processing proceeds to block 610. In such case, spawning of an additional (speculative) thread should be modeled. At block 610, it is thus determined whether a thread unit is free in order to modeling spawning of the new thread on the free unit. If not, processing proceeds to block 614. At block 614, it is determined whether a currently-allocated thread unit should be freed up for the current speculative thread under consideration. Such processing 614, 618, 620 is discussed in further detail below in connection with sample basic block D.
To determine whether a thread unit is free for the new thread at block 610, the current time is considered. That is, the method 400 searches its modeling information at block 610 to determine whether any thread unit is free at current time 5. For our example, the current modeling information (see Table 1) indicates that a thread unit 0 is busy with Thr0 from time 0 through time 120. Accordingly, it is not free at time 5. However, because we have assumed an SpMT processor that has two thread units, the second thread unit is free. Accordingly, processing proceeds to block 612.
At block 612, an entry for the new speculative thread, Thr1, is modeled. The new thread, Thr1, is spawned at block B, at time 5, and is to begin execution at the beginning of basic block I, and is to execute the remainder of the trace (through basic block N). The trace 900b in
Table 4 also reflects that the starting basic block (BBS) for Thr1 is basic block I and that Thr1 is spawned at time 5 (TimeSp). Because execution of Thr1 is modeled as concurrent with execution of Thr0, the cumulative time of 75, as reflected for basic block I in the annotated trace 900b, is not an accurate reflection of the actual time at which Thr1 will begin its modeled execution. Instead, Thr1 will begin execution shortly after it is spawned at time 5. For simplicity, we assume for this example that all init overhead 213 times are zero and that all slice times 205 are zero. With such assumption, start time (TimeST) 214=spawn time (TimeSP)=5.
The end time (TimeE) for Thr1 depends on how long it takes to execute the thread. The cumulative time values illustrated in the annotated trace 900b indicate that the execution of basic block I through basic block 120 takes from sequential cumulative time 75 through time 120. The time to execute Thr1 is therefore 120−75=45. If Thr1 begins its modeled execution at time 5 and takes 45 time units to execute, its end time is thus 45+5=50. Table 4 reflects an end time (TimeE) of 50 for Thr1.
One will note that the commit time, TimeC, for Thr1 is later than its end time. This is due to the assumed constraint, discussed above, that threads commit their results in sequential program order. Thr1, which begins at time 5, occurs later, in sequential program order, than Thr0, which begins at time 0. Accordingly, the later thread, Thr1, may not commit its results until its previous thread, Thr0, has committed its results. Table 4 indicates a commit time of 75 for Thr1's previous thread, Thr0. Accordingly, Table 4 also reflects a commit time of 75 for Thr1 as well.
Table 4 also reflects changes in the modeling information for Thr0. The new thread, Thr1, will begin execution at basic block I. The first thread, Thr0, need no longer execute the entire trace, but may complete its execution when it reaches basic block I. Accordingly, the model may be updated to reflect that Thr0 is now modeled to be busy only through time 75. At time 75, Thr0 may commit its results. Table 4 reflects this modification. Processing for block 412 then ends at block 616, and returns to block 418 of
Returning to
At this second pass of block 422 for our example, the method 400 traverses to the next key basic block and the current time is updated accordingly. For the sample input trace 900b illustrated in
Processing then loops back to block 406, falls through the checks at blocks 406 and 410, and proceeds to block 414. At block 414, it is determined that the current block (basic block D) is associated with a spawn event. Processing thus proceeds to block 416, an embodiment of which is, again, illustrated in further detail in
Turning to
At block 610, it is determined whether a thread unit is available to begin execution at the current time. The modeling information illustrated in Table 4, above, indicates that both thread unit 0 (TU0) and thread unit 1 (TU1), are busy at time 20. That is, TU0 is busy from time 0 to time 75, and TU1 is busy from time 5 to time 75. Accordingly, the evaluation at block 610 evaluates to “false” and processing thus proceeds to block 614.
At block 614 it is determined whether the most speculative thread that is currently modeled as busy is modeled as executing a thread that is more speculative than the speculative thread under consideration. The most speculative thread may be identified as that thread denoted as “normal” type and having a null value for its “next thread” value.
For our example, Table 4 indicates that most speculative thread is the thread modeled for TU1, because it has a null value in its next thread field. Table 4 indicates that the speculative thread modeled for TU1 has a target point associated with basic block I, which begins at sequential cumulative time 75.
The speculative thread under consideration is the speculative thread indicated by the second spawning pair—the indicated target point is associated with beginning of basic block G, which begins at sequential cumulative time 55.
The thread currently modeled for TU1 is thus more speculative that the thread under consideration, because it is designated to begin execution at a point farther from the beginning of the trace (according to sequential program order). Accordingly, there is a more speculative that can be squashed in order to allow modeled spawning of a speculative thread for the second spawning pair in the pairset 910. The evaluation at block 614 thus evaluates to “true,” and processing proceeds to block 618.
At block 618, the thread currently modeled for the thread unit to be freed is canceled. This is accomplished, in part, by marking the thread as “cancel” type. For a canceled thread, commit time=end time=time that the thread is canceled. Table 5, above, indicates that the current time, at which the thread is being canceled, is time 20. Accordingly, commit time for the canceled thread is time 20. In addition, the previous thread and next thread for a canceled thread are null. Accordingly, Table 6 reflects that the commit time, end time, next thread and previous thread for Thr1 are updated accordingly at block 618. Processing then proceeds to block 620.
At block 620, an entry for the new speculative thread, Thr2, is modeled. Table 6, below, indicates that the model reflects, as a result of block 620 processing, that thread Thr2 is modeled to execute on newly freed thread unit (“TU”) 1. The new thread, Thr2, is spawned at block D, at time 20, and is to begin execution at the beginning of basic block G, and is to execute the remainder of the trace (through basic block N). The trace 900b in
Table 6 also reflects that the starting basic block (BBS) for Thr2 is basic block G and that Thr2 is spawned at time 20 (TimeSP). For Thr2, spawn time (TimeSP) start time (TimeST)=20.
Again, the end time (TimeE) for Thr2 depends on how long it takes to execute the thread. The cumulative time values illustrated in the annotated trace 900b indicate that the execution of basic block G through basic block N takes from sequential cumulative time 55 through time 120. The time to execute Thr2 is therefore 120−55=65. If Thr2 begins its modeled execution at time 20 and takes 65 time units to execute, its end time is thus 20+65=85. Table 6 reflects an end time (TimeE) of 85 for Thr2.
Because the end time (TimeE) for Thr2 occurs after then commit time indicated for Thr0, Thr2 need not wait to commit its results. Accordingly, TimeE=85=TimeC for Thr2.
Table 6 also reflects changes in the modeling information for Thr0. The new thread, Thr2, will begin execution at basic block G. The first thread, Thr0, need no longer execute the trace up to basic block I, but may complete its execution when it reaches basic block G. Accordingly, the model may be updated to reflect that Thr0 is now modeled to be busy only until time 55. At time 55, Thr0 may commit its results. Table 6 reflects this modification. Processing for block 412 then ends at block 616, and returns to block 418 of
Returning to
At this third pass of block 422 for our example, the method 400 traverses to the next key basic block and the current time is updated accordingly. For the sample input trace 900b illustrated in
From block 422, processing loops back to block 406, falls through the check at block 406, and proceeds to block 410. At block 410, it is determined that the current block (basic block G) is associated with a target event. Processing thus proceeds to block 412, an embodiment of which is illustrated in further detail in
Turning to
If, however, the determination at block 704 evaluates to “true,” then a thread, other than the current thread, has been modeled to begin execution at the current basic block. For the example trace 900b illustrated in
At block 710, an internal variable, Thr, is set to the thread that was identified at block 710. For our example, Thr=Thr2 at block 710. Processing then proceeds to block 712.
At block 712, modeling for completion of the current thread (i.e., Thr0) is completed. One will note that, as is reflected above in Table 6, thread T0 may commit its results at time 55. Accordingly, at block 712 the method 400 models commitment of Thr0 values. Other thread completion tasks may also be modeled at block 712. Processing then proceeds to block 714.
At block 714, the global current thread value is updated. The current thread variable indicates that thread that executes the current basic block instance that is under analysis. As is reflected in Table 6, above, the current basic block instance under analysis is the instance of basic block G that is to begin execution at current time 20. Such instance is performed by Thr2, not Thr0. Because Thr0 has completed execution, the current thread is now updated, for our example, to reflect Thr2. Processing then proceeds to block 716.
At block 716, the global current time value is updated. That is, Table 6 reflects that Thr2 is modeled to begin its execution at time 20. Thus, the current time is 20. The modifications that occur at blocks 714 and 716 are reflected in Table 7b.
From block 716, processing ends at block 718. Processing then proceeds back to block 414 of
During the fourth iteration of block 422, the method 400 traverses to the next key basic block in the trace, which is block I. The current time is updated accordingly. Because block I is performed by a separate thread (Thr2) that is modeled to execute concurrently with the first thread (Thr0) discussed above, the sequential cumulative time value (75) for 1 that is reflected in the sample trace 900b does not reflect the actual current time at which basic block I is modeled to execute. Table 6 indicates that Thr2 begins execution at basic block G at a current time of 20. The sample trace 990b indicates that G is associated with sequential cumulative time 55 and block I is associated with sequential accumulated time 745. Thus, the time from the beginning of Thr2 execution until execution of basic block I is 75−55=20. Because Thr2 is modeled to begin execution at a current time of 20, current time for execution of basic block I is 20+20=40. Accordingly, the current time is updated at the fourth iteration of block 422 as indicated in Table 8:
From block 422, processing loops back to block 406, falls through the check at block 406, and proceeds to block 410. At block 410, it is determined that the current block (basic block I) is associated with a target event. Processing thus proceeds to block 412, an embodiment of which is, again, illustrated in further detail in
Turning to
Returning to
For the fifth iteration of block 422 for our example, the method 400 traverses to the next key basic block in the sample trace 900b. The method 400 thus traverses to basic block K, and the current time is updated accordingly. Regarding current time, one can see that basic block K is associated, in the annotated sample trace 900b, with cumulative sequential time 95. Because Thr2 is modeled to begin execution at time 55, the time it takes to execute to the beginning of basic block K may be modeled as 95−55 =40. Because execution of thread Thr2 is modeled to begin at current time 20, current time for execution of basic block K in Thr2 is 20+40=60. Table 9 reflects these modifications that occur at the fifth iteration of block 422:
From block 422, processing loops back to block 406, falls through the checks at blocks 406 and 410, and proceeds to block 414. The determination at block 414 evaluates to “true” because, as is illustrated in sample trace 900B, basic block K is associated, for our example, with a spawn event. Processing thus proceeds to block 416. A more detailed illustration of at least one embodiment of block 416 processing is, again, set forth at
Turning to
At block 610, it is determined that thread unit TU0 is free. Table 9 reflects that the current time is 60, and Table 6 reflects that Th0, which was modeled to execute on TU0, will have completed execution by current time 55. Accordingly, the determination at block 610 evaluates to “true,” and processing proceeds to block 612.
At block 612, a new thread is modeled to begin execution on TU0, much in the manner described above in connection with block 612 and Thr1. A new thread (Thr3) is modeled to spawn at basic block I, to begin execution of basic block M at current time 60. Accordingly, Table 10, below, indicates that the model reflects, as a result of current block 612 processing, that thread Thr3 is modeled to execute on thread unit (“TU”) 1.
Table 10 also reflects that the starting basic block (BBS) for Thr1 is basic block I and that Thr3 is spawned at current time 60 (TimeSP), with a start time (TimeST) of 60.
The end time (TimeE) for Thr3 is reflected in Table 10 as 75. The cumulative time values illustrated in the annotated trace 900b indicate that the execution of basic block M through basic block N takes from sequential cumulative time 105 through time 120. The time to execute Thr3 is therefore 120−105=15. If Thr3 begins its modeled execution at time 60 and takes 15 time units to execute, its end time is thus 60+15=75. Table 10 thus reflects an end time (TimeE) of 75 for Thr3.
Table 10 also reflects changes in the modeling information for Thr2. The new thread, Thr3, will begin execution at basic block M. The previous thread, Thr2, need no longer execute the entire trace, but may complete its execution when it reaches basic block M. Accordingly, the model may be updated to reflect that Thr2 is now modeled to be busy only through time 70. The value of 70 is calculated as follows. Table 10 reflects that Thr2 begins execution at current time 20. Modeled execution time for (basic block G through basic block L)=105−55=50. The duration value of 50, added to the start time value of 20=70.
Accordingly, Thr2 may commit its results at time 70. Thus, at time 75, Thr3 need not wait for its prior thread to commit, and may immediately commit its own results. Table 10 thus reflects that commit time for Thr3 is 75.
From block 612, processing for block 416 then ends at block 616. Processing then returns to block 418 of
At the sixth iteration of block 422, the method 400 traverses to the next key block in the trace, which is block M. Block M is associated, for our example, with a target event. Specifically, block M is designated in the sample pairset 910 as the target point for the spawn point at basic block K. Accordingly, processing for basic block M is performed along the same lines as is discussed above in connection with block 412, basic block G and
The modifications made as a result of the sixth iteration of block 422 are reflected in Table 11a. The modifications made as a result of blocks 714 and 716 (
Processing for block 416 is performed for basic block M, processing proceeds to block 418, which evaluates to “false,” because block M is not the last block of the trace. Processing then proceeds to block 422.
For the seventh iteration of block 422, for our example, the method 400 traverses to block N, the last block of the sample trace 900b, and the current time is updated accordingly. Table 12 reflects such processing:
Processing then loops back to block 406, falls through the checks at blocks 406, 410, and 414, and processing proceeds to block 418. Because block N is the last basic block of the sample trace 900b, the determination at block 418 evaluates to “true.” Processing thus proceeds to block 420. For at least one embodiment, additional details for block 420 processing are set forth in
Turning to
From block 1106, processing for block 420 ends at block 1008. Returning to
In sum, embodiments of the methods discussed herein provide for determining the effect of a set of spawning pairs on the execution time for a sequence of program instructions for a particular multithreading processor. The spawning pairs indicate concurrent speculative threads that may be spawned during execution of the sequence of program instructions and may thus reduce total execution time. The total execution time is determined by modeling the effects of the spawning pairs on execution of the sequence of program instructions.
Embodiments of the method may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented as computer programs executing on programmable systems comprising at least one processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code may be applied to input data to perform the functions described herein and generate output information. The output information may be applied to one or more output devices, in known fashion. For purposes of this application, a processing system includes any system that has a processor, such as, for example; a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
The programs may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The programs may also be implemented in assembly or machine language, if desired. In fact, the method described herein is not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language
The programs may be stored on a storage media or device (e.g., hard disk drive, floppy disk drive, read only memory (ROM), CD-ROM device, flash memory device, digital versatile disk (DVD), or other storage device) readable by a general or special purpose programmable processing system. The instructions, accessible to a processor in a processing system, provide for configuring and operating the processing system when the storage media or device is read by the processing system to perform the procedures described herein. Embodiments of the invention may also be considered to be implemented as a machine-readable storage medium, configured for use with a processing system, where the storage medium so configured causes the processing system to operate in a specific and predefined manner to perform the functions described herein.
An example of one such type of processing system is shown in
Processor 804 includes N thread units 104a-104n, where each thread unit 104 may be (but is not required to be) associated with a separate core. For purposes of this disclosure, N may be any integer >1, including 2, 4 and 8. For at least one embodiment, the processor cores 104a-104n may share the memory system 850. The memory system 850 may include an off-chip memory 802 as well as a memory controller function provided by an off-chip interconnect 825. In addition, the memory system may include one or more on-chip caches (not shown).
Memory 802 may store instructions 840 and data 841 for controlling the operation of the processor 804. For example, instructions 840 may include a compiler program 808 that, when executed, causes the processor 804 to compile a program (not shown) that resides in the memory system 802. Memory 802 holds the program to be compiled, intermediate forms of the program, and a resulting compiled program. For at least one embodiment, the compiler program 808 includes instructions to model execution of a sequence of program instructions, given a set of spawning pairs, for a particular multithreaded processor.
Memory 802 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM) and related circuitry. Memory 802 may store instructions 840 and/or data 841 represented by data signals that may be executed by processor 804. The instructions 840 and/or data 841 may include code for performing any or all of the techniques discussed herein. For example, at least one embodiment of a method for determining an execution time is related to the use of the compiler 808 in system 800 to cause the processor 804 to model execution time, given one or more spawning pairs, as described above. The compiler may thus, given the spawn instructions indicated by the spawning pairs, model a multithreaded execution time for the given sequence of program instructions.
Turning to
The instructions 1200 may also receive as an input a pairset that identifies spawn instructions for helper threads. For at least one embodiment, each spawn instruction is represented as a spawning pair that includes a spawn point identifier and a target point identifier. As is mentioned above, the target point identifier may be a control-quasi-independent point.
As is indicated in the discussion of
Specifically,
The compiler 808 may also include a spawn block modeler 1222 that, when executed by the processor 804 (
The compiler 808 may also include target block modeler 1224, when executed by the processor 804 (
Also, the compiler 808 may include a last block modeler 1226 that, when executed by the processor 804 (
While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that changes and modifications can be made without departing from the present invention in its broader aspects. The appended claims are to encompass within their scope all such changes and modifications that fall within the true scope of the present invention.