This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2008-170975, filed Jun. 30, 2008, the entire contents of which are incorporated herein by reference.
1. Field
One embodiment of the invention relates to an information processing apparatus, a program execution method, and a storage medium storing computer program for parallel processing.
2. Description of the Related Art
In order to increase the processing speed of a computer, multithreading is used to execute a plurality of processes in parallel. In a program for parallel execution based on conventional multithreading, a plurality of threads are created, and programming taking into account that each thread will undergo synchronous execution must be adopted. For example, to appropriately maintain execution order, it is necessary to insert code for guaranteeing synchronism at various points in a program, which makes debugging of the program difficult, and increases the maintenance cost.
As an example of such a program for parallel execution, there is a multithread execution method described in Jpn. Pat. Appln. KOKAI Publication No. 2005-258920. Here, there is disclosed a method for realizing, when a plurality of threads (thread 1 can be executed only after completion of thread 2) having interdependence are created, a method for realizing parallel execution on the basis of the execution results of the threads and the interdependence between the threads.
In this method, it is necessary to hard-code the interdependence between the threads in the program, and hence there have been problems that the program lacks flexibility in allowing changes to be made, that description of managing synchronization between the threads is difficult, and that it is difficult to obtain scalability in the number of processors.
A general architecture that implements the various feature of the invention will now be described with reference to the drawings. The drawings and the associated descriptions are provided to illustrate embodiments of the invention and not to limit the scope of the invention.
Various embodiments according to the invention will be described hereinafter with reference to the accompanying drawings. In general, according to one embodiment of the invention, an information processing apparatus comprises a storage storing program modules and parallel execution control description describing relationships of the program modules; a conversion module extracting a part relating to the program module from the parallel execution control description, and creating graph data structure creation information including preceding and succeeding information of the program module; an adding module extracting graph data structure creation information to which the input data is given, creating a node, and adding the created node to a formerly created graph data structure; and an execution module subjecting the graph data structure to at least one of depth-first search and breadth-first search with a restricted breadth, selecting one node from nodes stored in the node memory, and executing a program module corresponding to the selected node.
Processors 100i (i=1, 2, . . . ) for realizing parallel processing, a main memory 101, and a hard disk drive (HDD) 102 are connected to an internal bus 103. Each of the processors 100i has functions of interpreting program code stored in various storage devices such as the main memory 101, HDD 102, and the like, and executing processing described in advance as a program. It is assumed that three processors 100i each of which is capable of equal throughput are provided. However, the processors are not necessarily identical processors, and those different from each other in throughput, and those for processing different types of code may be included.
The main memory 101 includes a storage device formed of a semiconductor memory such as a DRAM and the like. Programs to be executed by the processors 100i are read into the main memory 101 accessible at a relatively high speed prior to the processing, and are accessed from the processors 100i in accordance with the program execution.
Although the HDD 102 can store a larger amount of data than the main memory 101, the HDD 102 is disadvantageous in the access speed in many cases. Program code to be executed by the processors 100i is stored in advance in the HDD 102, and only parts to be executed are read into the main memory 101.
The internal bus 103 is a common bus for interconnecting the main memory 101 and HDD 102, thereby exchanging data.
Further, although not shown, an image display device for displaying a processing result or an input/output device such as a keyboard or the like for inputting processing data may be provided.
Next, an outline of a program for parallel execution according to this embodiment will be described below.
The programs are not executed independently of each other, and when a processing result of the other program is used for processing of a program, or for securing consistency of data, completion of processing of a specific part must wait in some cases. When programs having such a feature are executed in parallel, contrivances for acquiring execution states of the other programs must be embedded in various parts of the programs. By embedding the contrivance (also called synchronous processing), the configuration has been made in such a manner that data security or exclusive control is realized between the programs, and cooperative operations are obtained.
For example, when a predetermined event has occurred during the processing of program 300, program 301 is requested to take a predetermined processing (event 303). Upon receipt of event 303, program 301 executes a predetermined processing, and when a predetermined condition has been established, further issues an event 304 to program 302. Program 301 returns the result of the processing requested by program 300 to execute by means of event 303 to program 300 as an event 305.
However, when a description for synchronizing the parallel processes is included in the program itself, considerations not connected with the original program logic become necessary, making the program complicated. Also, while waiting for processing by other programs to complete, resources are wasted. In addition, there are many instances where subsequent program modification becomes difficult such as in a case where the processing efficiency varies greatly because of a slight deviation in timing.
In contrast, in this embodiment, a program is divided into basic modules (also called serial execution modules) and a parallel execution control description. The basic module is executable on condition that input data has been given irrespectively of the execution states of the other programs and is executed without serial and synchronous processing. The parallel execution control description describes relationships between parallel processing of a plurality of basic modules by using graph data structure creation information with the basic module being a node. By describing a part that requires synchronization or delivery of data by means of the parallel execution control description, it is made possible to promote conversion of the basic modules into components, and compactly manage the parallel execution control description.
Program 400 executes thread 402, and program 401 executes thread 407. When thread 402 is executed up to a point 406, it is necessary for program 400 to deliver the processing result to program 401. Thus, upon completion of the execution of thread 402, program 400 notifies the processing result to program 401 as an event 404. Only after both event 404 and the processing result of thread 407 are obtained, program 401 can execute next thread 405. On the other hand, upon completion of the execution of thread 402, program 400 executes the program subsequent to the point 406 as thread 403.
As described above, in programs 400 and 401, there are parts in which processing can be unconditionally advanced such as threads 402 and 407, and a point at which a certain processing result to be notified to the other thread can be obtained while the program is executed such as the point 406, or a point at which the processing can be started on condition that a processing result from the other thread is obtained.
Thus, as shown in
The interdependence in
Basic modules 200j (j=1, 2, . . . ) constitute a program to be executed by the system according to this embodiment. Each of the basic modules 200j can receive one or more parameters 198, and can adjust the execution load by changing, for example, algorithm to be applied or by changing a threshold or coefficient on the algorithm on the basis of a value or values of the parameter or parameters 198.
The parallel execution control description 201 includes data to be referred at the time of execution. The parallel execution control description 201 indicates interdependence (
As for the translator 202, in addition to the case where conversion is performed in advance before the execution of the basic module 200, a method is conceivable in which, during execution of the basic module, the processing is performed while translation is successively executed by a run-time task or the like.
The software on the information processing apparatus 203 at the point of execution is constituted of the basic modules 200j, the run-time library 206 (for storing the graph data structure creation information 204), a multithread library 208, and an operating system 210.
The run-time library 206 includes an application program interface (API) and the like used when the basic modules 200j are executed on the information processing apparatus 203, and is also provided with a function for realizing exclusive access control which is needed when the basic modules 200j undergo parallel execution. On the other hand, the configuration may be made in such a manner that the function of the translator 202 is called from the run-time library 206, and when the function is called in the process of executing the basic module 200, the parallel execution control description 201 of a part to be executed next time may be converted each time. By the configuration described above, a resident task for translation becomes unnecessary, and the parallel processing can be made more compact.
The operating system 210 manages the whole system such as the hardware of the information processing apparatus 203, and task scheduling. By introducing the operating system 210, the merits can be obtained that the programmer is liberated from management of the system of various kinds, can concentrate on programming, and can also develop software that can be run on a general type of apparatus.
In the information processing apparatus according to this embodiment, the program is divided at a part requiring synchronous processing or data delivery, and the matters associated with the division are defined as the parallel execution control description, whereby it is possible to promote the conversion of the basic module into components, and compactly manage the parallel processing definition. The execution load of each basic module converted into a component can be dynamically adjusted.
As shown in
As shown in
The connector 602 is provided with identification information indicating what is data output from the node 600 after processing. A succeeding node can determine whether or not conditions for enabling the node itself to be executed have been fulfilled on the basis of the identification information of the connector 602 and the parallel execution control description 201.
When the conditions for enabling the node 600 to be executed are regarded by the run-time library 206 as being fulfilled, IDs (or basic module IDs) of the nodes 600 are stored in an executable pool 603 in units of nodes as shown in
In the information on the links to the preceding nodes, a condition of a node to be a node precedent to the node concerned is defined. For example, definitions of a node outputting data of a predetermined type, a node including a specific ID, and the like are conceivable.
The graph data structure creation information 204 is used as information for expressing the corresponding basic module 200 as a node, and information for adding the basic module to the existing graph data structure shown in
When the flow has been executed, if the execution of the preceding node has been completed, a node executable at the time is created on the basis of the graph data structure creation information 204, and is stored in the executable pool 603.
The run-time library 206 managing the multithreading accepts input data which becomes the object to be executed (block B01).
The run-time library 206 sets the operating environment in such a manner that the library 206 is called from each core to execute multithreading. This makes it possible to perceive the parallel program as a model in which each core operates independently from a model in which the run-time processing operates independently, and keep the amount of synchronism waiting in the parallel processing at a small value by making the overhead of the run-time processing small. If the operating environment is configured in such a manner that the basic module is called by a single run-time task, switching between a task executing the basic module and the run-time task is executed complicatedly, and hence the overhead increases.
The run-time library 206 determines whether or not input data is present (block B02). When input data is not present (No), the series of the processing flow is terminated.
When input data is present (Yes) in block 802, graph data structure creation information 204 to which the input data is input is extracted, thereby acquiring the information 204 (block B03).
Output data of the basic module 200 is classified in advance into a plurality of types to be described in the output buffer types of the graph data structure creation information 204. In extracting the graph data structure creation information 204 to which the input data is input, it is sufficient if information in which a data type coincides with that of preceding input data is extracted on the basis of a data type to be the input data included in the information on the link to the preceding node described in the graph data structure creation information 204.
Then, a node corresponding to the graph data structure creation information 204 acquired in block B03 is created (block B04).
Here, when a plurality of graph data structure creation information items 204 are extracted, a node corresponding to each of the information items 204 is created.
The created node is then added to the existing graph data structure (block B05). The existing graph data structure mentioned here is a structure obtained by structuralizing interdependence precedent to and subsequent to a created node as shown in, for example,
Then, it is determined whether or not the processing of all the nodes corresponding to the preceding nodes of the added node has been completed (block B06).
When the processing is completed (Yes) with respect to all the preceding nodes of a certain node, the conditions for starting to execute the node 600 are regarded as being fulfilled, and the node is stored in the executable pool 603 (block B07).
On the other hand, when there is a preceding node for which the processing is not completed yet (No), the processing of the node itself cannot be started, and the flow is terminated.
As described above, even when a node is created, the basic module corresponding to the node is not immediately executed, and the processing is reserved until interdependence with the other node of the added graph data structure is satisfied.
A node to be executed next is selected from nodes which are stored in the executable pool 603, and have already become executable on the basis of a predetermined condition (block B11).
The predetermined condition can be selected on the basis of a point of reference such as the oldest stored node, a node having many succeeding nodes, a node with high cost, and the like.
The cost of each node may be obtained by the following calculation.
Cost of an added node=(α×past average execution time)+(β×amount of usage of output buffer)+(γ×number of succeeding nodes)+(δ×execution frequency at nonscheduled time)
In general, it is conceivable that starting processing from the node of higher cost makes the throughput of the parallel processing larger. Here, the execution frequency at the nonscheduled time implies a frequency at which a state where none of the nodes is stored in the executable pool 603 during the execution of the basic module appears. This state means that an underflow of the executable pool 603 has occurred, which degrades the degree of the parallel processing, and hence is undesirable. The cost of the basic module 200 in execution at this time is calculated higher, and hence the basic module is executed earlier, whereby it is possible to expect an effect on the avoidance of a bottleneck.
As each of the coefficients α to δ of the linear expressions of the cost calculating formula, a predetermined value may be used, or the coefficients may also be configured to dynamically change while observing the state of the processing.
An example of acquisition of a node will be described later.
When a node to be executed next is acquired, an output buffer in which the processing result of the node is to be stored is secured before the execution (block B12).
The output buffer is secured on the basis of the definition of the output buffer type defined by the graph data structure creation information 204.
When the output buffer can be secured, one or more parameter values that can be received by the basic module are set on the basis of the performance information obtained and preserved at the time of the last execution of the basic module corresponding to this node (block B13), and execution of the basic module 200 corresponding to this node is started (block B14).
Further, when the processing of the basic module 200 is completed, the performance information is acquired and preserved (block B15), and an execution completion flag of the node concerned in the graph data structure is set processing-completed (block B16).
In block B15, a set of the parameter of the basic module 200 for which the processing has been completed, and the execution time is recorded as performance information.
Then, it is determined whether or not all the succeeding nodes included in the graph data structure of the node concerned are processing-completed (block B17). When all the succeeding nodes are processing-completed (Yes), the node can be deleted from the graph data structure (block B18). At this time, the output data of the node is not used, and hence the output buffer secured in block B12 is released. Conversely, when there is any node which is still processing-uncompleted in the succeeding nodes, there is the possibility of the output data of the node being used by the basic module of the succeeding node, and hence the node must not be deleted from the graph data structure.
Then, it is determined, with respect to each of all the nodes included in the graph data structure, whether or not all the preceding nodes of the node are processing-completed (block B19). When there is a node preceding nodes of which are all processing-completed (Yes), the node is regarded as having fulfilled the execution start conditions, and is stored in the executable pool 603 (block B20).
When even only one of the preceding nodes is processing-uncompleted (No), the determination is performed again when the processing of the preceding node is completed.
As described above, when the run-time processing accepts an input, a list of a “set of a node and a connection destination” (
In the basic module selection processing, and update processing of the graph data structure, exclusive control becomes necessary. However, this is performed by the run-time processing, and hence the parallel program designer is not conscious of the exclusive control.
The basic module does not include the synchronous processing, and hence is serially executed to the last and, when the execution is completed, the flow returns to the run-time processing.
Next, a method of selecting a basic module to be executed in block B11 will be described below.
In this embodiment, the parallel processing is constituted of a basic module to be executed serially, and run-time processing for assigning basic modules to a plurality of processors in regular order. Reduction in processing time of run-time processing is desired, and the processing time depends on the occurrence of a cache error. Accordingly, by observing the occurrence state of the cache error, and appropriately determining, on the basis of the observation result, to which processor a node to be executed next is to be assigned, it is possible to shorten the runtime processing time.
Although this embodiment does not limit the memory hierarchy of the system, it is assumed for convenience' sake of explanation that the system includes a cache memory hierarchy of three stages as shown in
When a certain CPU has completed processing of a certain node, there are two methods of searching for a node to be executed next, i.e., depth-first search and breadth-first search.
The breadth-first search is search in which search for a node is made up to a node of the highest level, and search for a node is made up to an unexecuted node of the lowest level while search for a node is made with respect to closest possible nodes to the highest level node. On the other hand, the depth-first search is search in which search for a node is made toward the higher level in the tree structure, it is determined at each node whether or not a child node is unexecuted, and when an unexecuted child node is present, the search is turned back at the child node, thereby reaching an unexecuted child node.
In the structure in which interdependence is as shown in
In order to specifically explain the synchronous processing of the cache, it is assumed that depth-first search is performed in the graph data structure indicating the interdependence as shown in
Assuming that nodes B and C have already been assigned to CPUs #1 and #2, respectively, the next node F is assigned to CPU #1 that has completed the processing of node B. However, in nodes C and F which have close interdependence, it is common that the data areas to be referred to overlap each other, and hence the data areas required by CPUs #1 and #2 overlap each other as shown in
It is assumed that breadth-first search is performed in the graph data structure indicating the interdependence as shown in
Assuming that nodes B and C have already been assigned to CPUs #1 and #2, respectively, the next node J is assigned to CPU #1 that has completed the processing of node B. There is hardly any interdependence between nodes C and J, and hence there is the possibility of the data areas required by CPUs #1 and #2 being unable to be contained in the L2 cache. In this case, the synchronous processing between the L2 cache and the main memory is frequently performed, and the processing efficiency is lowered.
Thus, in this embodiment, as shown in
If the occurrence frequency of the synchronous processing between the L2 cache and the main memory can be detected, determination of the upper limit of the return position in the breadth-first search can be made on the basis of the detection result. However, at present, it is difficult to detect the occurrence frequency of the synchronous processing. Thus, by profiling the processing performance while adaptively changing the upper limit of the return position, it is possible to substantially detect the occurrence frequency of the synchronous processing.
In block B34, processing of the processing unit is started. The processing unit is, for example, when the data of the object to be processed is image data, image data of one frame. In block B36, the CPU clock is started. In block B38, a node to be executed next is determined by the breadth-first search in which an upper limit is set for the return position as shown in
When the processing of the one frame of the image data is completed in block B40, the CPU clock is stopped in block B42. The counted value T(i) is recorded in block B44. It is determined in block B46 whether or not variable i has reached the maximum value. If variable i has not reached the maximum value, variable i is incremented in block B48, and the flow returns to block B34. When the variable has reached the maximum value, the minimum value of the counted value T(i) is detected in block B50, and the value i is made the upper limit layer. This is because the fact that the processing time obtained when the upper limit is changed to perform the actual processing is the shortest makes it possible to determine that the frequency of the synchronous processing performed between the L2 cache and the main memory is the minimum. At this time, it is possible to determine that the frequency of the synchronous processing performed between the L1 cache and the CPU is also the minimum.
As described above, according to this embodiment, when execution of a certain basic module has been completed in the parallel processing, and when a node to be executed next is searched for in order to determine which basic module should be executed next, it is possible to prevent the processing efficiency from being lowered by the synchronous processing performed between the L1 cache and the L2 cache while suppressing occurrence of a cache error of the L2 cache by performing breadth-first search in which the breadth is restricted by restricting the return position. Therefore, the processing time can be minimized on the basis of the correlation between the processing and the data area to be accessed. This makes it possible to enhance the performance of the whole processing.
Although the above description has been given of the search for a node based on the breadth-first search, a situation where the depth-first search can be carried out without any problem also exists. When all the processing has been completed with respect to a certain node and nodes on which the node depends, it is possible to prevent the Processing efficiency from being lowered by the synchronous processing performed between the L1 cache and the L2 cache by selecting a node to be executed next by the depth-first search.
A data area to which a brother node refers is close to a data area to which the original node has referred in many cases. However, as shown in, for example,
Although the above-mentioned search method is based on one of the breadth-first search and the depth-first search, the optimum method may be determined by trial and error by combining both of them. More specifically, the processing load (for example, the processing time shown in
Next, although not a node search method, a method of improving the overall performance by updating the graph data structure will be described below. As shown in
As described above, according to this embodiment, the parallel processing can be divided into a serial execution part (basic module) including neither synchronous processing nor exclusive processing, and a parallel designation part for describing the parallel operations, and hence it is possible to improve the descriptiveness of the parallel program, easily perform the program change at the time of performance tuning, and reduce the maintenance cost. Further, by means of the run-time processing for efficiently operating the parallel program prepared in this way, it is possible to obtain parallel execution performance scalable for the number of processors. The run-time task independently selects the executable basic module 200, and successively updates the graph data structure, whereby the parallel processing is performed. Accordingly, the series of processing need not be considered by an application program. Further, the basic module 200 does not include a part from which the other task branches off, and hence it is not necessary to consider any arbitration between itself and the other task being executed. Moreover, in accordance with the situation at each time, a contrivance capable of dynamically adjusting the execution load of each program is also realized.
Accordingly, it is possible to provide a programming environment which allows programs to be created without taking parallel processing into account, and which enables flexible execution of parallel processing by multithreading.
As has been described above, according to the present invention, it is not necessary to hard-code interdependence between threads, and hence the invention is excellent in the flexibility of program change, and the description of the synchronous processing between the threads is facilitated. Further, an effect of facilitating acquisition of scalability of the number of processors is also obtained.
While certain embodiments of the inventions have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the inventions. The various modules of the systems described herein can be implemented as software applications, hardware and/or software modules, or components on one or more computers, such as servers. While the various modules are illustrated separately, they may share some or all of the same underlying logic or code. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2008-170975 | Jun 2008 | JP | national |