The present invention relates to a technique for processing a sequential processing program with a parallel processor system in parallel, and more particularly, to a method and a device that generate a parallelized program from a sequential processing program.
As a method of processing a single sequential processing program in parallel in a parallel processor system, there has been known a multi-threading method (see, for example, patent documents 1 to 5, non-patent documents 1 and 2). In the multi-threading method, a sequential processing program is divided into instruction streams called threads and executed in parallel by a plurality of processors. A parallel processor that executes multi-threading is called multi-threading parallel processor. In the following, a description will be given of conventional multi-threading methods first and then a related program parallelizing method.
Generally, in a multi-threading method in a multi-threading parallel processor, to create a new thread on another processor is called “forking”. A thread which executes a fork is referred to as “parent thread”, while a newly generated thread is referred to as “child thread”. The program location where a thread is forked is referred to as “fork source address” or “fork source point”. The program location at the beginning of a child thread is referred to as “fork destination address”, “fork destination point”, or “child thread start point”.
In the aforementioned patent documents 1 to 4 and the non-patent documents 1 to 2, a fork command is inserted at the fork source point to instruct the forking of a thread. The fork destination address is specified in the fork command. When the fork command is executed, child thread that starts at the fork destination address is created on another processor, and then the child thread is executed. A program location where the processing of a thread is to be ended is called a terminal (term) point, at which each processor finishes processing the thread.
In contrast, according to a multi-threading method in a multi-threading parallel processor, as shown in
As shown in
There is another multi-threading method, as shown in
There is a commonly known method that can be used in the case where no processor is available on which to create a child thread when a processor is to execute a fork command. That is, the processor waits to execute the fork command until a processor on which a child thread can be created becomes available. Besides, as shown in the patent document 4, there is described another method in which the processor invalidates or nullifies the fork command to continuously execute instructions subsequent to the fork command and then executes instructions of the child thread.
To implement the multi-threading of the fork-one model, in which a thread creates a valid child thread at most once in its lifetime, for example, the technique disclosed in the non-patent document 1 places restrictions on the compilation for creating a parallelized program from a sequential processing program so that every thread is to be a command code to perform a valid fork only once. In other words, the fork-once limit is statically guaranteed on the parallelized program. On the other hand, according to the patent document 3, from a plurality of fork commands in a parent thread, one fork command to create a valid child thread is selected during the execution of the parent thread to thereby guarantee the fork-once limit at the time of program execution.
For a parent thread to create a child thread such that the child thread performs predetermined processing, the parent thread is required to pass to the child thread the value of a register, at least necessary for the child thread, in a register file at the fork point of the parent thread. To reduce the cost of data transfer between the threads, in the patent document 2 or the non-patent document 1, a register value inheritance mechanism used at thread creation is provided through hardware. With this mechanism, the contents of the register file of a parent thread is entirely copied into a child thread at thread creation. After the child thread is produced, the register values of the parent and child threads are changed or modified independently of each other, and no data is transferred therebetween through registers.
As another conventional technique concerning data passing between threads, there has been proposed a method as disclosed in the non-patent document 2. In this method, the register value inheritance mechanism is provided through hardware, and a required register value is transferred between threads when a child thread is generated and after the child thread is generated. Further alternatively, there has also been proposed a parallel processor system provided with a mechanism to individually transfer a register value of each register by a command.
In the multi-threading method, basically, previous threads whose execution has been determined are executed in parallel. However, in actual programs, it is often the case that not enough threads can be obtained, whose execution has been determined. Additionally, the parallelization ratio may be low due to dynamically determined dependencies, limitation of the analytical capabilities of the compiler and the like, and desired performance cannot be achieved. Accordingly, in the patent document 1, control speculation is adopted to support the speculative execution of threads through hardware. In the control speculation, threads with a high possibility of execution are speculatively executed before the execution is determined. The thread in the speculative state is temporarily executed to the extent that the execution can be cancelled via hardware. The state in which a child thread performs temporary execution is referred to as temporary execution state. When a child thread is in the temporary execution state, a parent thread is said to be in the temporary thread creation state. In the child thread in the temporary execution state, writing to a shared memory and a cache memory is restrained, and data is written to a temporary buffer additionally provided.
When is confirmed that the speculation is correct, the parent thread sends a speculation success notification to the child thread. The child thread reflects the contents of the temporary buffer in the shared memory and the cache memory, and then returns to the ordinary state in which the temporary buffer is not used. The parent thread changes from the temporary thread creation to thread creation state.
On the other hand, when failure of the speculation is confirmed, the parent thread executes a thread abort command “abort” to cancel the execution of the child thread and subsequent threads. The parent thread changes from the temporary thread creation to non-thread creation state. Thereby, the parent thread can generate a child thread again. That is, in the fork-one model, although the thread creation can be carried out only once, if control speculation is performed and the speculation fails, a fork can be performed again. Also in this case, only one valid child thread can be produced.
A description will now be given of the technique to generate a parallel program for a parallel processor to implement the multi-threading.
Then, in a fork point determining step, a combination of fork points indicating optimal parallel execution performance is determined with an iterative improvement method with respect to the selected sequential processing program (see for example paragraph 0154 of the patent document 6). At this time, the above-described inter-instruction dependency is maintained while changing only the combination of the fork points without performing exchange of the instruction sequences. This is, in other words, a technique in which the dependency is maintained by a unit of a plurality of instructions. This unit of a plurality of instructions corresponds to an element in which the sequential execution trace when the sequential processing program is sequentially executed by the input data is divided with all the terminal point candidates as a division point. Lastly, in a fork inserting step, a fork command for parallelization is inserted to generate a parallelized program 25 divided into a plurality of threads.
However, according to the related program parallelizing apparatus, the parallel execution time may not be shortened as is expected and the time required to determine the parallelized program is also made longer. This point will be described hereinafter in detail.
(1) According to the program parallelizing apparatus shown in
In
After execution of the basic block B1, the control moves to the basic block B2, where the function calling instruction L3 is executed, and thereafter the control moves to the basic block B3. This control flow will be shown by solid arrows. In this program, there is a dependency by the data flow in which the data (r3) defined by the instruction L1 is referred to by the instruction L2. Further, there is a dependency by the data flow in which the data (memory data stored in an address r2) defined by the instruction L2 is referred to by the instruction L5. When there is dependency by the data flow from one instruction X to one instruction Y, it is assumed that the instruction Y must be executed at a time obtained by adding an execution delay time to the execution time of the instruction X or later, and the execution delay time of all the instructions is one cycle.
The same thing can be said about a program parallelizing apparatus shown in
In summary, according to the related program parallelizing apparatus, since only a partial analysis is performed for an instruction in one function and an instruction of a function group of a descendant of the function in a function calling graph, a schedule in which the parallel execution time becomes undesirably long may be determined.
(2) The second problem of the related program parallelizing apparatus is that it takes longer time in the determination process when it is attempted to obtain a parallelized program with shorter parallel execution time. For example, there are two reasons for it in the program parallelizing apparatus shown in
The present invention has been made in view of such a circumstance, and an exemplary object of the present invention is to provide a program parallelizing method and a program parallelizing device that enable efficient generation of a parallelized program with shorter parallel execution time.
According to the present invention, parallelization of a program is performed by scheduling instructions by referring to inter-instruction dependency. In summary, inter-instruction dependency between a first instruction group including at least one instruction and a second instruction group including at least one instruction is analyzed, so as to execute instruction scheduling of the first instruction group and the second instruction group by referring to the inter-instruction dependency. The schedule whose execution time is shorter can be obtained by referring to the inter-instruction dependency.
According to one exemplary embodiment, when the first instruction group is correlated with a lower level of the second instruction group, the instruction scheduling of the first instruction group is executed, and thereafter the instruction scheduling of the second instruction group is executed by referring to the inter-instruction dependency. For example, this case includes when the second instruction group includes a calling instruction that calls for the first instruction group.
When the instruction scheduling of the second instruction group is executed after executing the instruction scheduling of the first instruction group, information of the inter-instruction dependency is preferably added to the calling instruction included in the second instruction group, and thereafter the instruction scheduling of the second instruction group is executed. This is because it is possible to refer to the inter-instruction dependency added to the calling instruction in scheduling the second instruction group.
According to another aspect of the present invention, each of the first instruction group and the second instruction group forms a strongly connected component including at least one function that includes at least one instruction. It is especially preferable to repeat analysis of the instruction dependency and the scheduling for a plurality of times for the strongly connected component of a form in which functions depend on each other. In summary, a) the instruction scheduling is executed for each function included in one strongly connected component, b) the instruction dependency with another function is analyzed for each function, and c) a) and b) are repeated with respect to each strongly connected component for a specified number of times set in accordance with a form of the strongly connected component.
According to one exemplary embodiment of the present invention, the execution cycle and the execution processor of the instruction are analyzed for dependency between an instruction in one function and an instruction of a function group of a descendant of the function in a function calling graph, and parallelization is performed with the analysis result. Accordingly, it is possible to realize parallel processing while keeping the dependency between an instruction in one function and an instruction of a function group of a descendant of the function, whereby the parallelized program with shorter parallel execution time can be generated.
According to the present invention, the inter-instruction dependency is referred to schedule the instruction, whereby the schedule whose execution time is shorter can be obtained. For example, the dependency between an instruction in one function and an instruction of a function group of a descendant of the function in a function calling graph is analyzed to execute parallelization with the analysis result, whereby it is possible to instruct to execute an instruction in one function and an instruction of a function group of a descendant of the function in parallel.
Further, according to the present invention, the search for a combination of fork points is not performed in parallelization. The extremely large number of available candidates of the combination of the fork points makes it difficult to perform high-speed program parallelization as stated above. However, as the search of the combination of the fork points is not performed in the present invention, it is possible to generate the parallelized program with shorter parallel execution time in high speed.
Hereinafter, a program parallelizing method according to the first exemplary embodiment of the present invention will be described with reference to
According to the present invention, parallelization of a program is executed with reference to inter-instruction dependency. Especially, according to the first exemplary embodiment of the present invention, an execution cycle and an execution processor of instructions are determined based on dependency between an instruction in one function and an instruction of a function group of a descendant of the function in a function calling graph, so as to produce a parallelized program.
However, in this description, it is assumed as follows for the sake of clarity. A function f0 is a function that is not called by other functions, and two ends of a function group of its descendant are called functions fp and fq. In this example, an instruction Lp_k of the function fp is a calling instruction of the function fq. Further, as one example, it is assumed that there is dependency of data flow in which a result of an instruction L0_r of the function f0 is referred to by an instruction Lq_i of the function fq and a result of an instruction Lq_j of the function fq is referred to by an instruction Lp_1 of the function fp. In summary, a dashed arrow where the instruction Lq_j of the function fq is a source (instruction of start point) and the instruction Lp_1 of the function fp is a destination (instruction of end point) indicates inter-instruction dependency between the instruction Lq_j and the instruction Lp_l, and a dashed arrow where the instruction L0_r of the function f0 is a source and the instruction Lq_i of the function fq is a destination indicates inter-instruction dependency between the instruction L0_r and the instruction Lq_i. Note that the inter-instruction dependency is merely an example for description, and the inter-instruction dependency may be shown between any other functions. Further, the inter-instruction dependency includes not only the dependency by the data reference but also the dependency by a branch instruction or the like.
As shown in
Now, scheduling of an instruction means to decide a processor and a cycle (execution time) where the instruction is executed. In other words, it means to decide in which position of the schedule space designated by the cycle number and the processor number the instruction should be allocated. Further, “schedule space” means a space indicated by a coordinate axis of the cycle number indicating the execution time and a plurality of processor numbers. As there is a limit in the number of processors, however, it is needed to set the limit in the processor number of the schedule space, or otherwise use a residue obtained by dividing the processor number of the schedule space by an actual number of processors as the processor number for execution without limiting the processor number of the schedule space.
Further, “relative schedule” here means a schedule indicating an increasing amount from a basis, which is the processor number and the execution cycle where the function (function fq, in this embodiment) starts the execution. Although the schedule of the instruction of the function fq in step S2 is determined by referring to the existing inter-instruction dependency, only the relative positional relation in the schedule space is determined for these instructions Lq. This is because, as the function fq is called by the function calling instruction Lp_k of the function fp, the schedule of the instruction of the function fq is never determined unless the schedule of the instruction Lp_k is determined. Thus, in this example, unless the schedule of the final function f0 is determined, the schedule of the instruction of the function group of its descendant is not determined.
Then, the inter-instruction dependency between the instruction Lq_j and the instruction Lp_l is referred, and the relative schedule of the instruction of the function fp is determined so as to meet the scheduling condition to realize the shortest instruction execution time as a whole and to keep the inter-instruction dependency (step S3). At this time, the inter-instruction dependency between the instruction L0_r and the instruction Lq_i is continued in the function calling instruction Lp_k of the function fp, which is referred similarly as in step S3 in scheduling the function of the ancestor of the function fp. As such, steps S2 and S3 are recursively executed for the function f0. Finally, the schedule of the instruction of the function f0 is determined, and schedules of the instructions of all the functions are determined.
The schedules thus determined satisfy the scheduling condition to realize the shortest instruction execution time and to keep the inter-instruction dependency. If this scheduling condition is generalized, (a) the dependency between the instruction in the function f and the instruction of the function group of the descendant of the function f in the function calling graph is satisfied, and (b) the whole execution time of the instructions in the function f and in the function group of its descendant becomes the shortest.
Note that the program parallelizing method according to the first exemplary embodiment may be implemented by executing the program parallelizing program on the program control processor, or may be implemented by hardware.
Although the functions fp and fq are shown as the function groups of the descendants of the function f0 in
Next, a case will be described in which the first exemplary embodiment is applied to the input program of
The control in such a case is such that the basic block B1 is executed, and thereafter the operation moves to the basic block B2, where the function calling instruction L3 is executed, and thereafter the operation moves to the basic block B3. This control flow is shown by solid arrows. Further, as there are inter-instruction dependency by a data flow in which the data defined by the instruction L1 is referred to by the instruction L2 and inter-instruction dependency by a data flow in which the data defined by the instruction L2 is referred to by the instruction L5, each of the inter-instruction dependencies is shown by a dashed arrow. When there is dependency by the data flow from one instruction X to one instruction Y, the instruction Y should be executed at a time where the execution delay time is added to the execution time of the instruction X or later, and the execution delay time of all the instructions is one cycle.
As described above, the relative schedule has been completed in the function f2, and as a result, the instruction L4, the instruction L5, and the instruction L6 are arranged in one processor in this order (the cycle number and the processor number have not been determined).
According to the first exemplary embodiment, the information regarding the execution processor and the execution cycle of the instruction can be analyzed for the dependency between the instruction in one function and the instruction of the function group of the descendant of its function in the function calling graph. By this analysis, it can be seen that 1) there is dependency from the instruction L2 to the instruction L5; 2) as the instruction L5 is executed through the function calling instruction L3, the relation of the execution time between the instruction L2 and the instruction L3 may satisfy the dependency from the instruction L2 to the instruction L5; 3) the function f2 starts execution one cycle later than the execution of the instruction L3, and the instruction L5 is executed on the same processor as the start point one cycle later than the start.
Further, the function calling instruction L3 is determined to be arranged in a position (0,1) from the condition of the shortest execution time of the above scheduling constraint condition (b). As such, according to the first exemplary embodiment, the instruction L3 can be arranged in a cycle prior to the instruction L2. In execution, the processing is performed as shown in
As stated above, according to the present invention, the scheduling is executed in consideration of the dependency between the instruction in one function f and the instruction of the function group of the descendant of this function f in the function calling graph, whereby the instruction can be arranged in the appropriate time (cycle) and the processor to obtain the parallelized program with shorter parallel execution time.
As described above, in performing analysis of the dependency of a function, information of a function called by the function is needed, and therefore, the analysis is performed from deeper functions. However, the order of the analysis cannot be determined for the function group having interdependency by the mutual recursive call. Accordingly, the function group having such an interdependency is collectively analyzed as “strongly connected component” of the function calling graph.
According to the second exemplary embodiment of the present invention, in the strongly connected component that is formed of a function group having interdependency, a method is employed for determining the instruction schedule by performing analysis of the inter-instruction dependency in each function for a predetermined number of times. The “strongly connected component” in the second exemplary embodiment will be described first.
(Strongly Connected Component)
An algorithm for obtaining the strongly connected component has already been known. For example, vertices of the graph (corresponding to functions in this example) are firstly numbered with a post-order, and thereafter, a graph which is obtained by reversing all the directed sides of the graph is created. Then, a depth-first search is started at a vertex whose number is maximum on the reversed graph, so as to create a tree by traversed ones. Then, the depth-first search is started at a vertex whose number is maximum for vertices that have not been searched, so as create a tree by traversed ones. This process is repeated. Each tree that is produced is the strongly connected component. Other algorithms include a method disclosed in pp. 195 to 198 of “Data Structure and Algorithm” (A. V. Eiho et al., translated by Yoshio Ohno, Baifukan Co., LTD, 1987). Next, specific examples of the function calling graph and the strongly connected component will be described.
The control is moved to the basic block B22 after executing the basic block B21, and moved to the basic block B23 after executing the function calling instruction in the basic block B22. Further, the instruction L24 of the basic block B23 is a conditional branch instruction, and the control is moved to a basic block B25 or a basic block B26 in accordance with the condition. Further, the control is moved to the basic block B26 after executing the function calling instruction in the basic block B24, and is moved to a basic block B27 after executing the basic block B26. Further, the control is moved to the basic block B23 after executing the function calling instruction in the basic block B27, and is moved to the basic block B25 after executing the basic block B24. Each control flow will be shown by a solid arrow.
Such a function calling relation is shown in
The sequential processing intermediate program 302 is created by a program analyzing apparatus which is not shown, and is represented as a graph. For example, the sequential processing intermediate program 302 is a program in which the functions, the basic blocks, and the dependencies thereof shown in
The inter-instruction dependency information 304 is information of inter-instruction dependency and information related to it. The inter-instruction dependency information 304 is, for example, information regarding inter-instruction dependency shown by dotted arrows in
The dependency analyzing/scheduling unit 102 includes a function internal/external dependency analyzing unit 103 and an instruction scheduling unit 104. The function internal/external dependency analyzing unit 103 analyzes the inter-instruction dependency by referring to the dependency information 304 between the instructions. In short, the dependency between an instruction in one function f and an instruction of the function group of the descendant of the function f in the function calling graph is analyzed. According to the analyzed dependency, the instruction scheduling unit 104 determines the execution time and the execution processor of the instructions, and the execution order of the instructions is determined so as to realize the execution time and the execution processor of the instructions that are determined to insert the fork command. The parallelization intermediate program 306 is thus registered in the storage device 305.
Note that the processing apparatus 101 is the information processing apparatus such as a central processing unit CPU, and the storage devices 301, 303, and 305 are storage devices such as a magnetic disk unit. The program parallelizing apparatus 100 may be realized by a program and a computer such as a personal computer and a work station. The program is recorded in a computer-readable recording medium such as a magnetic disk, read out by a computer when it is activated, controls the operation of the computer, so as to realize function means such as the dependency analyzing/scheduling unit 102 on the computer. For example, the processing apparatus may be configured as shown in
The strongly connected component extracting unit 203 extracts the strongly connected component from the input sequential processing intermediate program 302, and assigns a number to each of the functions from the deeper functions in a way that smaller numbers are assigned to the deeper functions. For example, in the function calling graph shown in
Although described later in detail, the scheduling/dependency analysis count managing unit 204 manages the number of times of execution of the dependency analysis and the scheduling of the strongly connected component in accordance with the dependency form of the function that forms the strongly connected component.
The source/destination function internal/external dependency analyzing unit 205 refers to the inter-instruction dependency information 304, as described above, and analyzes the dependency between the instruction in one function f and the instruction of the function group of the descendant of the function f in the function calling graph. According to the dependency that is analyzed, the instruction scheduling unit 206 determines the execution time and the execution processor of the instructions, and determines the execution order of the instructions to realize the execution time and the execution processor of the instructions that are determined, to insert the fork command.
Note that the device that generates the inter-instruction dependency information 304 may be provided. In the following, the inter-instruction dependency information generating circuit will be described in brief.
The schedule region forming unit 101.2 refers to the control flow analysis result and the profile data of the sequential processing program, so as to determine the schedule region which will be a unit of the instruction schedule.
The register data flow analyzing unit 101.3 refers to the control flow analysis result and the schedule region determined by the schedule region forming unit 101.2 to analyze the data flow in accordance with the reading or writing of the register.
The inter-instruction memory data flow analyzing unit 101.4 refers to the control flow analysis result and the profile data of the sequential processing program to analyze the data flow in accordance with the reading or writing of a memory address.
The analysis result of the data flow in accordance with the reading or writing of the register and the memory obtained by the register data flow analyzing unit 101.3 and the inter-instruction memory data flow analyzing unit 101.4 is output to the dependency analyzing/scheduling unit 102 as the inter-instruction dependency information 304, and the control flow analysis result and the schedule region are output as the sequential processing intermediate program 302 to the dependency analyzing/scheduling unit 102.
First, the strongly connected component extracting unit 203 refers to the sequential processing intermediate program 302 to obtain the strongly connected component of the function calling graph. Next, the strongly connected component of the function calling graph is processed in a specific order. For example, in order to prevent the function that has already been processed from being processed again, all the strongly connected components are firstly marked as unselected, and then the processed one is marked as selected. As such, in a specific order, the unselected one among the strongly connected components of the function calling graph is set to a strongly connected component s (step S101). The order for selecting the strongly connected components is determined in a way that one function that forms the strongly connected component is selected and the one having smaller index value of the post-order of the function is preceded.
Next, the unselected one among the functions that form the strongly connected component s is set to a function f in a specific order (step S102). The function having smaller index value applied in the pre-order of the function calling graph may be preceded, for example, as the order of the functions that form the strongly connected component s.
Then, the instruction scheduling unit 206 performs the instruction schedule for each function. More specifically, the execution time and the execution processor of the instruction are determined for each schedule region in the function, and the execution order of the instructions is determined so as to realize the execution time and the execution processor of the instruction that are determined. Then, the fork command is inserted to be stored in a memory which is not shown (step S103).
Next, the controller 201 judges whether all the functions of the strongly connected component s are scheduled (step S104), and when there is a function that is not scheduled (No in step S104), the control is made back to step 102.
If the schedules of all the functions included in the selected strongly connected component s are completed (Yes in step S104), the controller 201 instructs the source/destination function internal/external dependency analyzing unit 205 to execute the function internal/external dependency analysis regarding the source (step S105) and the function internal/external dependency analysis regarding the destination (step S106) of the directed side that shows the dependency of the strongly connected component s. The function internal/external dependency analysis regarding the source will be described in detail with reference to
Then, the scheduling/dependency analysis count managing unit 204 judges whether the repeat count of the loop from step S102 to step S106 has reached a specified value of the strongly connected component s (step S107). If the repeat count has not reached the specified value (No in step S107), the scheduling/dependency analysis count managing unit 204 sets all the functions that form the strongly connected component s to unselected (step S108), and the control is made back to step S102. The analysis from step S102 to step S106 is repeatedly performed because, when there is interdependency by recursive call or mutual recursive call in the functions that form the strongly connected component s, the results of the dependency analysis and the schedule in one function need to be employed in the dependency analysis and the schedule in other functions. The repeat count can be set to once or a plurality of times according to the form of the strongly connected component s in the function calling graph. For example, when there is a directed side between the functions that form the strongly connected component s in the function calling graph, the repeat count may be set to a plurality of times (four times, for example). Further, the repeat count may be set to a plurality of times (four times, for example) also when only one function forms the strongly connected component s and this function performs the self recursive call. The repeat count may be set to once in other cases. Alternatively, the repeat count may be set to four times when the strongly connected component s represents a loop, for example, and may be set to once in other cases. As such, by repeating the analysis and the schedule, it is possible to respond to the change of the position of the dependency destination instruction by the schedule, and to obtain better schedule with respect to the strongly connected component representing a loop.
When the repeat count reaches the specified value (Yes in step S107), it is judged whether all the strongly connected components are searched (step S109). If there is a strongly connected component that is not searched (No in step S109), the control is made back to step S101. When all the strongly connected components are searched (Yes in step S109), the dependency analysis and the schedule processing are terminated.
Next, the function internal/external dependency analyzing processing regarding the source executed by the source/destination function internal/external dependency analyzing unit 205 (step S105) will be described in detail.
In
Next, the source/destination function internal/external dependency analyzing unit 205 performs function internal/external dependency analysis regarding the source for each function (step S202). The detail will be described with reference to
The controller 201 judges whether all the functions that form the strongly connected component which is the processing target are searched (step S203), and when there is a function that is not searched (No in step S203), the control is made back to S201. When all the functions are searched (Yes in step S203), it is judged whether the repeat count of the processing loop from step S201 to step S203 has reached a specified value (step S204). If the repeat count has not reached the specified value (No in step S204), all the functions that form the strongly connected component s is made unselected (step S205), and the control is made back to step S201.
The analyzing processing from step S201 to step S203 is repeatedly performed because there is interdependency by the recursive call or the mutual recursive call between the functions that form the strongly connected component s, as described above. The repeat count may be set to once or a plurality of times in accordance with the form of the strongly connected component s in the function calling graph. For example, when there is a directed side between the functions that form the strongly connected component s in the function calling graph, the repeat count may be set to a plurality of times (four times, for example). Further, the repeat count may be set to a plurality of times (four times, for example) also when there is one function that forms the strongly connected component s and this function performs the self recursive call. The repeat count may be set to once in other cases. Alternatively, when the strongly connected component represents a loop and the repeat count of this loop is known, the repeat count may be set to the repeat count of this loop.
When the repeat count has reached the specified value (Yes in step S204), the function internal/external dependency analyzing processing regarding the source for each strongly connected component is completed.
Next, with reference to
First, it is judged whether there is unselected one among the instructions of the function that is the processing target (step S301), and when there is no unselected one (No in step S301), the control is moved to step S307 stated below. When there is unselected one (Yes in step S301), in a specified order, the unselected one among the instructions of the function that is the processing target is set to an instruction i (step S302). The order of the address of the instruction may be used, for example, as the order of the selection of the instruction.
Then, it is judged whether there is unselected one among the directed sides of the dependency where the instruction i is the source (step S303), and when there is no unselected one (No in step S303), the control is moved to step S301. For example, when the function fq is the strongly connected component s in
When there is unselected one (Yes in step S303), in a specified order, the unselected one among the directed sides of the dependency where the instruction i is the source is set to a directed side e (step S304). Any order may be employed as the order of the selection of the directed side.
Next, the directed side e is duplicated, and the source of the directed side which is duplicated is replaced with the node representing the function of the processing target (step S305). Then, the relative values of the execution processor number and the execution time of the instruction i with a basis of the start time of the function of the processing target are added to the relative values of the execution processor number and the execution time regarding the source added to the directed side (step S306). Further specific operation of the processing of step S306 will be made clear in the description with reference to
Note that the directed side of the dependency regarding the data flow where the source is the node representing the function may be represented as a table for each function, as the number of registers is known in advance. This table includes a register number as an index, and the delay time of the instruction of the source and the relative values of the execution processor number and the execution time regarding the source added to the directed side as a content. By representing it by a table, the memory capacity that is used can be made smaller compared with a case in which a list representation is employed.
Next, it is judged whether there is unselected one among the function calling instructions that call for functions of the processing target (step S307), and when there is no unselected one (No in step S307), the function internal/external dependency analyzing processing regarding the source for each function is completed. When there is unselected one (Yes in step S307), in a specified order, the unselected one among the function calling instructions that call for functions of the processing target is set to the function calling instruction c (step S308).
Next, it is judged whether there is unselected one among the directed sides that are duplicated (step S309), and when there is no unselected one (No in step S309), the control is moved back to step S307. When there is unselected one (Yes in step S309), in a specified order, the unselected one among the directed sides is set to the directed side e (step S310).
Next, the directed side e is duplicated to create a directed side where the source of the directed side that is duplicated is set to the instruction c (step S311), and the relative values of the execution processor number and the start time of the function of the processing target with a basis of the execution time of the instruction c are added to the relative values of the execution processor number and the execution time regarding the source added to the directed side (step S312). The specific operation of the processing of step S312 will be made clear in the description with reference to
Then, the control is made back to step S309, and steps S310 to S312 are repeated until when there is no unselected one among the directed sides that are duplicated.
Next, the function internal/external dependency analyzing processing regarding the destination executed by the source/destination function internal/external dependency analyzing unit 205 (step S106) will be described in detail.
In
In the following, the function internal/external dependency analysis regarding the destination for each function is performed (step 402). The detail thereof will be described in
The controller 201 judges whether all the functions that form the strongly connected component which is the processing target are searched (step S403). When there is a function which is not searched (No in step S403), the control is made back to step S401. When all the functions that form the strongly connected component which is the processing target are searched (Yes in step S403), it is judged whether the repeat count of the loop processing from step S401 to step S404 has reached a specified value (step S404). When the repeat count has not reached the specified value (No in step S404), all the functions that form the strongly connected component s are marked as unselected (step S405), and the control is made back to step S401. The repeat count may be set to once or a plurality of times according to the form of the strongly connected component s in the function calling graph. For example, in the function calling graph, when there is a directed side between the functions that form the strongly connected component s, the repeat count may be set to a plurality of times (four times, for example). Furthermore, the repeat count may be set to a plurality of times (four times, for example) also when there is one function that forms the strongly connected component s and this function performs the self recursive call. The repeat count may be set to once in other cases. Alternatively, when the strongly connected component represents a loop and the repeat count of this loop is known, the repeat count may be set to the repeat count of this loop.
When the repeat count of the loop has reached the specified value (Yes in step S404), the function internal/external dependency analyzing processing regarding the destination for each strongly connected component is completed.
Referring now to
First, it is judged whether there is unselected one among the instructions of the function of the processing target (step S501), and if there is no unselected one (No in step S501), the control is moved to step S507. If there is unselected one (Yes in step S501), in a specified order, the unselected one among the instructions of the function of the processing target is set to an instruction i (step S502). The order of the address of the instruction may be used, for example, as the order of the selection of the instruction.
Then, it is judged whether there is unselected one among the directed sides of the dependency where the instruction i is the destination (step S503), and when there is no unselected one (No in step S503), the control is made back to step S501. When there is unselected one (Yes in step S503), in a specified order, the unselected one among the directed sides of the dependency where the instruction i is the destination is set to a directed side e (step S504). Any order may be employed as the order of the selection of the directed side.
Next, the directed side e is duplicated, and the destination of the directed side which is duplicated is replaced with the node representing the function of the processing target (step S505). The relative values of the execution processor number and the execution time of the instruction i with a basis of the start time of the function of the processing target are added to the relative values of the execution processor number and the execution time regarding the destination added to the directed side (step S506). This step S506 corresponds to operation op1 in
Note that, as the number of registers is known in advance, the directed side of the dependency regarding the data flow where the destination is the node representing the function may be represented as a table for each function. This table includes a register number as an index, and the relative values of the execution processor number and the execution time regarding the destination added to the directed side as a content. By representing it by a table, the memory capacity that is used can be made smaller compared with a case in which the list representation is employed.
Next, it is judged whether there is unselected one in the function calling instructions that call for functions of the processing target (step S507). When there is no unselected one (No in step S507), the function internal/external dependency analyzing processing regarding the source for each function is terminated. When there is unselected one (Yes in step S507), in a specified order, the unselected one among the function calling instructions that call for functions of the processing target is set to the function calling instruction c (step S508).
Next, it is judged whether there is unselected one among the directed sides that are duplicated (step S509), and when there is no unselected one (No in step S509), the control is moved to step S507. When there is unselected one (Yes in step S509), in a specified order, the unselected one among the directed sides is set to the directed side e (step S510).
Then, the directed side e is duplicated to create a directed side where the destination of the directed side which is duplicated is set to the instruction c (step S511), and the relative values of the execution processor number and the start time of the function of the processing target with a basis of the execution time of the instruction c are added to the relative values of the execution processor number and the execution time regarding the destination added to the directed side (step S512). This step S512 corresponds to the operation op2 in
Then, the control is made back to step S509, and steps S510 to S512 are repeated until when there is no unselected one among the directed sides that are duplicated.
The specific example of the schedule processing and the dependency analysis shown in
The control is moved to the basic block B12 after executing the basic block B11. After executing the function calling instruction L13 in the basic block B12, the control is moved to the basic block B13. This control flow will be shown by solid arrows. Further, in this example, as the instruction L16 needs to be executed after executing the instruction L12, the dependency by this data flow will be shown by a dashed arrow.
By analyzing the register data flow and the memory data flow, a directed side that shows the dependency of the data flow from the instruction L12 to the instruction L16 is created. It is assumed that the relative value of the execution time regarding the source added to the directed side of the dependency is zero, the relative value of the execution processor is zero, and the delay time is one, which is the delay time of the instruction L12. The relative value of the execution time regarding the destination is assumed to be zero, and the relative value of the execution processor is assumed to be zero.
As shown in
Next, the schedule processing and the dependency analysis with respect to the specific example shown in
First, in step S101 of
In step S103, the relative instruction schedule of the function f12 is executed. The term “relative schedule” means the schedule that indicates the increase amount from a basis which is the processor number and the execution cycle in which the function (function f12 in this example) has started execution.
Since all the functions that form the strongly connected component have been scheduled in this example (Yes in step S104), the operation moves to step S105 to perform function internal/external dependency analysis regarding the source for each strongly connected component. In this example, the directed side of the dependency is not added in step S105, and thus, explanation will be omitted.
Next, in step S106, the function internal/external dependency analysis regarding the destination for each strongly connected component is performed. This point will be described with reference to
First, as the strongly connected component that is selected is formed only of the function f12, the function f12 is selected in step S401 of
As all the instructions of the function f12 are unselected in step S501 of
As there is a directed side of the dependency which is the destination in the instruction L16, the directed side e of the dependency from the instruction L12 to the instruction L16 is selected in steps S503 and S504. Then, in step S505, the directed side e is duplicated to create the directed side of the dependency from the instruction L12 to the function f12.
Next, in step S506, the relative value of the execution processor number and the relative value of the execution time of the instruction L16 with a basis of the start time of the function f12 are added to the relative value regarding the destination added to the directed side. The relative values regarding the destination added to the directed side are zero for both of the execution time and the processor number as shown in
Next, in step S503, it is judged whether there is unselected one of the directed sides of the dependency where the instruction L16 is the destination. As there is no unselected one, the control is moved back to step S501. Then, the instruction L17 is selected in steps S501 and 5502. As there is no directed side of the dependency where the instruction L17 is the destination in step S503, the control is moved back to step S501. It is judged in step S501 whether there is an unselected instruction, and as there is no unselected instruction, the control is moved to step S507. In steps S507 and S508, the function calling instruction L13 that calls for the function f12 is selected.
Then, in steps S509 and S510, the directed side of the dependency from the instruction L12 to the function f12 is selected, and the directed side is duplicated to create the directed side of the dependency from the instruction L12 to the instruction L13 in step S511.
Next, in step S512, each of the relative value of the execution processor number and the relative value of the start time of the function f12 with a basis of the execution time of the instruction L13 is added to the relative value regarding the destination added to the directed side. In this example, it is assumed that the function f12 starts execution on the same processor one cycle later than the execution of the instruction L13, and thus, the execution processor 0 and the execution time 1 are added to the relative value (execution time 1, processor 1) regarding the destination added to the directed side. As a result, the operation op2 in
Next, in step S509, as there is no unselected one among the directed sides that are duplicated, the control is moved to step S507. As there is no unselected one among the function calling instructions that call for the function f12 in step S507, the function internal/external dependency analyzing processing regarding the destination for each function is completed.
Next, as all the functions of the strongly connected component that is formed of the function f12 have been searched in step S403 of
Next, it is judged in step S107 of
By executing the operations op1 and op2 shown in
As the strongly connected component that is formed of the function f12 has been selected in step S101, the strongly connected component that is formed of the remaining function f11 is selected. As the selected strongly connected component is formed only of the function f11 in step S102, the function f11 is selected.
In step S103, the instruction schedule of the function f11 is executed. In the instruction schedule, as shown in
In determining the time and the processor in which the instruction L13 is arranged, the directed side of the dependency from the instruction L12 to the instruction L13 and the relative value (execution time 2, execution processor 1) added to the directed side are referred. The relative value regarding the source added to the directed side means the following point. That is, the data defined by the instruction L12 becomes available at a time obtained by adding the delay time and the relative time regarding the source to the execution time of the instruction L12 and on a processor in which the relative processor number regarding the source is added to the execution processor of the instruction L12.
Further, the relative value regarding the destination added to the directed side means the following point. That is, the instruction L16 that refers to the data is executed at a time obtained by adding the relative time regarding the destination to the execution time of the instruction L13 and on a processor in which the relative processor number regarding the destination is added to the execution processor of the instruction L13.
Accordingly, the data that is defined by the instruction L12 is made available in the cycle 2 in which the delay time 1 and the relative time 0 regarding the source are added to the cycle 1 where the instruction L12 is executed, and on a processor 0 in which the relative processor number 0 regarding the source is added to the processor 0 where the instruction L12 is executed.
Further, the instruction L16 is executed at a time in which the relative time 2 regarding the destination is added to the execution time of the instruction L13 and on a processor in which the relative processor number 1 regarding the destination is added to the execution processor of the instruction L13. It is only required that the execution time and the execution processor of the instruction L16 are the time and the processor in which the data defined by the instruction L12 can be obtained. It means that, in other words, it is only required that the time in which two is added to the execution time of the instruction L13 and the processor in which zero is added to the execution processor of the instruction L13 are equal to or larger than the cycle 2 and the processor number 0. Under such a condition, the instruction L13 is arranged at a time having the smallest execution time.
By arranging the instruction L13 in the cycle 0 and the processor 1, execution of all the instructions is completed in four cycles.
On the other hand, according to the first exemplary example, as the dependency between the instruction L12 in the function f11 and the instruction L16 in the function f12 that is called by the function f11 is analyzed, the execution time of the parallelization schedule according to the present invention can be made shorter. More specifically, the processor and the time in which the data defined by the instruction L12 can be obtained and the relative value that indicates how far the instruction L16 is deviated in execution from the execution time and the execution processor of the instruction L13 that calls for the function f12 are analyzed, and thereafter the execution time and the execution processor of the instruction L13 that calls for the function f12 are arranged using this analysis result. Accordingly, the execution time of the instruction L13 can be made earlier, and thus the start time of the function f12 can be made earlier.
Further, according to the first exemplary example, the search for the combination of the fork points is not performed in parallelization. Although it is difficult to speed up the program parallelization as the number of possible candidates of the combination of the fork points is extremely large, the searching of the combination of the fork points is not performed in this exemplary example, and thus the parallelized program with shorter parallel execution time can be generated in high speed.
Further, in the second exemplary example, the control flow analyzing unit 101.1, the schedule region forming unit 101.2, the register data flow analyzing unit 101.3, and the inter-instruction memory data flow analyzing unit 101.4 described in
In the storage device 401, the sequential processing program 402 having a machine instruction form generated by a sequential complier which is not shown is stored. In the storage device 403, a profile data 404 used in a process of converting the sequential processing program 402 to the parallelized program is stored. Further, the parallelized program 406 generated by the processing apparatus 101A is stored in the storage device 405. The storage devices 401, 403, and 405 are recording media such as magnetic disks or the like.
The program parallelizing apparatus 100A according to the second exemplary example receives the sequential processing program 402 and the profile data 404 to generate the parallelized program 406 for a multi-threading parallel processor. Such a program parallelizing apparatus 100A can be implemented by a program and a computer such as a personal computer and a work station. The program is recorded in a computer-readable recording medium such as a magnetic disk or the like, and read out by a computer when it is activated. By controlling the operation of the computer, the function means such as a control flow analyzing unit 101.1, a schedule region forming unit 101.2, a register data flow analyzing unit 101.3, an inter-instruction memory data flow analyzing unit 101.4, a dependency analyzing/scheduling unit 102, a register allocating unit 101.5, and a program outputting unit 101.6 is realized on the computer.
The control flow analyzing unit 101.1 receives the sequential processing program 402 and analyzes the control flow. The loop may be converted to the recursive function by referring to this analysis result. Each iteration of the loop may be parallelized by this conversion.
The schedule region forming unit 101.2 refers to the analysis result of the control flow by the control flow analyzing unit 101.1 and the profile data 404 to determine the schedule region which will be the target of the instruction schedule that determines the execution time and the execution processor of the instruction.
The register data flow analyzing unit 101.3 refers to the analysis result of the control flow and the determination of the schedule region by the schedule region forming unit 101.2 to analyze the data flow in accordance with the reading or writing of the register.
The inter-instruction memory data flow analyzing unit 101.4 refers to the analysis result of the control flow and the profile data 404 to analyze the data flow in accordance with the reading or writing of one memory address.
The dependency analyzing/scheduling unit 102 refers to, as described in the first exemplary example, the analysis result of the data flow of the register by the register data flow analyzing unit 101.3 and the analysis result of the data flow between instructions by the inter-instruction memory data flow analyzing unit 101.4, so as to analyze the dependency between instructions. Especially, the dependency analyzing/scheduling unit 102 analyzes the dependency between the instruction in one function and the instruction of the function group of the descendant of the function in the function calling graph. Then, as already stated, the dependency analyzing/scheduling unit 102 determines the execution time and the execution processor of the instruction according to the dependency, determines the execution order of the instruction to realize the execution time and the execution processor of the instruction that are determined, and inserts the fork command.
The register allocating unit 101.5 refers to the fork command and the execution order of instructions determined by the instruction scheduling unit 104 to allocate the register. The program outputting unit 101.6 refers to the result of the register allocating unit 101.5 to generate the executable parallelized program 406.
Next, the operation of the program parallelizing apparatus 100A according to the second exemplary example will be described. As the operation of the dependency analyzing/scheduling unit 102 has been described with reference to
First, the control flow analyzing unit 101.1 receives the sequential processing program 402 and analyzes the control flow. In the program parallelizing apparatus 101A, the sequential processing program 402 is represented by a form of graph, as is similar to the first exemplary example.
The schedule region forming unit 101.2 refers to the analysis result of the control flow by the control flow analyzing unit 101.1 and the profile data 404, and determines the schedule region which is the target of the instruction schedule that determines the execution time and the execution processor of instructions. The schedule region may be a basic block or may be a plurality of basic blocks, for example.
The register data flow analyzing unit 101.3 refers to the analysis result of the control flow and the determination of the schedule region by the schedule region forming unit 101.2, to analyze the data flow in accordance with the reading or writing of the register. The analysis of the data flow may be performed only in a function, or may be performed across functions. The data flow is represented by a directed side that connects the nodes representing the instructions as the inter-instruction dependency. As already described, the relative value of the execution time regarding the source, the relative value of the execution processor number, and the delay time of the instruction of the source are added to the directed side. At this point, the relative value of the execution time is set to zero, the relative value of the processor number is set to zero, and the delay time is set to the delay time of the instruction of the source. The relative value of the execution time regarding the destination and the relative value of the execution processor number are added to the directed side. At this point, the relative value of the execution time is set to zero and the relative value of the processor number is set to zero.
The inter-instruction memory data flow analyzing unit 101.4 refers to the analysis result of the control flow and the profile data 404, to analyze the data flow in accordance with the reading or writing with respect to one memory address. The data flow is shown by the directed side that connects the nodes indicating the instructions, as described above, as the inter-instruction dependency.
The register allocating unit 101.5 allocates the registers with reference to the fork command and the execution order of the instructions determined by the instruction scheduling unit 104. The program outputting unit 101.6 refers to the result of the register allocating unit 101.5 to generate the executable parallelized program 406.
As such, the inter-instruction dependency information may be generated on the processing apparatus 101A such as the program control processor or the like and the register is allocated to the parallelization intermediate program to output the executable parallelized program 406. As the dependency analyzing/scheduling unit 102 is included similarly to the first exemplary example, the parallelized program with shorter parallel execution time can be generated in high speed.
Note that the present invention is not limited to the above-described exemplary examples, but various additions or modifications can be made without changing the characteristics of the present invention. For example, the profile data 44 may be omitted in the second exemplary example.
The program parallelizing method and the program parallelizing apparatus according to the present invention are applied to a method and an apparatus that generate parallel programs having high execution efficiency, for example.
Number | Date | Country | Kind |
---|---|---|---|
2007-014525 | Jan 2007 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2007/072185 | 11/15/2007 | WO | 00 | 7/24/2009 |