This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2012-258186, filed on Nov. 27, 2012, the entire contents of which are incorporated herein by reference.
This invention relates to a technique for controlling a parallel computer.
Barrier synchronization is known as a method for synchronizing the processing executed by plural computation nodes in a parallel computer. Here, a computation node is a portion in a parallel computer, which executes a computational processing, and includes a CPU (Central Processing Unit) as a processor, or processor cores as processing units. The barrier synchronization is made possible by calling, by each computation node, a barrier function at a predetermined position within a program for a job. For example, in case of using a Message Passing Interface (MPI) library, it is possible to achieve the barrier synchronization by calling the MPI_Barrier function in a program for a job. Each of the computation nodes is unable to advance execution of the program for the job until all of the computation nodes in the parallel computer confirm that the barrier synchronization is complete.
A following technique is known for execution of the program for the job in a parallel computer. More specifically, in a parallel computer, synchronization for re-execution of the program is performed based on the access history for a shared memory. After that, the program is executed again from a checkpoint with the shared memory and processor state information, which were reproduced based on recorded information.
However, no technique is established in which a job is temporarily stopped and then is restarted later, in a parallel computer for which the barrier synchronization is being performed. When the job was stopped during execution of the barrier synchronization, there is a possibility that the barrier synchronization will not be suitably performed after the job is restarted, accordingly, advancement of the job will stop. Therefore, when there is an instruction from a user to stop the job during execution of the barrier synchronization, there is a problem in which the job cannot be stopped immediately, and stopping the job will be put on hold until the barrier synchronization is completed.
A control method relating to this invention is executed by a first node among plural nodes included in a parallel computer. Then, the control method includes: (A) upon detecting that execution of a program for a job is stopped in each of the plural nodes, collecting information concerning a state of progress of barrier synchronization from each of the plural nodes; and (B) first determining a restart position of the program for the job in the first node, based on a stop position of the program for the job in the first node and the information collected from each of the plural nodes.
The object and advantages of the embodiment will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the embodiment, as claimed.
In communication in a parallel computer, there is point-to-point data communication (here, this includes collective communication), and one-to-many communication that is used when performing the barrier synchronization. In the point-to-point data communication, the communication library, such as a MPI library, can confirm whether or not the other party received the transmitted data. Therefore, by performing a processing to stop a job after confirming that the other party received the data and setting that stop position as the restart position, communication is performed after restarting the job with no problem. On the other hand, when executing the barrier synchronization in one-to-many communication, the communication library is able to confirm the starting and ending of the barrier synchronization, however, the communication library is not able to confirm how far barrier synchronization has progressed (in other words, the process of barrier synchronization in progress).
In order to explain this in more detail,
Here, the case in which there is an instruction from the user to stop a job will be considered. When there was an instruction to stop a job and the computation nodes 1 to 6 have completed transmission of synchronization data from gate 1, the barrier synchronization at the computation nodes 1 to 6 is complete. However, when there is a computation node that has not finished transmitting synchronization data, one of the relay gates waits for the reception of the synchronization data from the computation node that has not finished transmitting the synchronization data, and synchronization data does not reach the ending gate. At this time, the communication library is not able to confirm the states of the barrier synchronization at the relay gates of the respective computation nodes. Therefore, it is not possible to know at which relay gate the barrier synchronization is stopped. Therefore, the communication library does not know from which position to restart the program of the job in order for the barrier synchronization to be suitably performed.
Therefore, in the following, a method for determining a restart position of the program for the job so that the barrier synchronization will be suitably performed after restarting the job will be explained for the case of temporarily stopping the job during execution of the barrier synchronization and restarting the job later.
Each of the computation nodes 1 to N has a motherboard N0 on which a CPU N1 and memory N2 are mounted. A network interface (abbreviated as NI in
Each of the network interfaces N4 is connected to a network switch 200, which is, for example, a layer-2 switch. Each computation node performs point-to-point communication with other computation nodes by receiving and transmitting data by way of the network interface N4. The network switch 200 relays communication data between the network interfaces of the computation nodes.
Each of the barrier interfaces N3 is connected with other barrier interfaces by way of the barrier network 100. The computation nodes 1 to N belong to the same barrier group, and as illustrated in
A program that is executed at the computation nodes 1 to N will be explained using
The collection unit 102 collects, from other computation nodes, information about the state of progress of the barrier synchronization, and stores the collected information in the data storage unit 104. The determination unit 103 uses the data that is stored in the data storage unit 104 to execute a processing for determining a restart position of the program for the job. The job manager 105 receives a stop instruction from the user to stop execution of the program for the job, and outputs a swap-out request to the resource manager 106. The job manager 105 also controls the job execution unit 107. When the resource manager 106 receives the swap-out request, the resource manager 106 activates the synchronization manager 101. The resource manager 106 executes a processing for releasing resources in the barrier interface 13, network interface 14 and the like. The communication processing unit 108 is a communication library such as a MPI library, and executes a processing relating to communication.
Next, an operation of the system illustrated in
After receiving a stop instruction to stop execution of the program for the first job, the job manager 105 requests the job execution unit 107 to stop the first job. In response to this request, the job execution unit 107 stops the processing. The job manager 105 then stores information representing a stop position where the program for the first job was stopped, in the memory 12.
The job manger 105 outputs a swap-out request of the first job to the resource manager 106. The resource manager 106 transmits a swap-out request to the computation node 2 to computation node N, that are remaining computation nodes. The resource manager 106 in each of the computation nodes 1 to N invokes the synchronization manager 101 as a thread for executing the processing in this embodiment.
When the computation node 1 is performing communication with another computation node, the synchronization manager 101 causes the communication processing unit 108 to stop the communication. For example, the communication processing unit 108 stops the transmission of communication data by the network interface 14, and stops the transmission of synchronization data for performing the barrier synchronization. In addition, the synchronization manager 101 saves (in other words, swaps out) the information to be saved, which is stored in the memory of the network interface 14, in a storage device such as a hard disk. After performing the aforementioned processing, the synchronization manager 101 outputs a command to the communication processing unit 108 to enable communication between computation nodes. As a result, the resources of the network interface 14 can be used for communication.
Resources of the barrier interface 13 and the resources of the network interface 14 are independent, so as will be explained below, it is possible to use the resources of the network interface 14 in order to confirm the state of progress of the barrier synchronization.
The synchronization manager 101 then executes a processing to set the restart position in the program for the first job. This processing will be explained using
First, the collection unit 102 in the synchronization manager 101 collects, from the computation nodes 2 to N, information concerning the state of progress of the barrier synchronization by performing communication by way of the network interface 14 (
The information concerning the state of progress of the barrier synchronization includes information that represents whether or not synchronization data for performing the barrier synchronization has been transmitted, and a sequence number that represents the state of completion of the barrier synchronization. Information that represents whether or not synchronization data has been transmitted is either “R(1)” or “B(0)”. R(1) represents that the synchronization data has not yet been transmitted, and B(0) represents that synchronization data has been transmitted and it is in a waiting state for the completion of the barrier synchronization. The sequence number that represents the state of completion of the barrier synchronization is incremented by “1” each time the barrier synchronization is completed. The initial value is “0”. These kinds of information are updated by a firmware or the like in the barrier interface.
The collection unit 102 stores the information concerning the state of progress of the barrier synchronization, which was collected at the step S1, in the data storage unit 104.
For example, when data such as illustrated in
After the processing at the step S1 is first executed, the collection unit 102 collects information concerning the state of progress of the barrier synchronization, periodically, for example, and updates the data storage unit 104, until the barrier synchronization converges.
Returning to the explanation in
Functions that are called from inside the MPI_Barrier function and that are included in a communication library in a lower level than MPI will be explained using
In this embodiment, the barrier synchronization is considered to be executing while the MPI_Barrier function is called. When the stop position of the program for the job is inside the MPI_Barrier function, the position where the program for the job was stopped is specifically identified at step S3 (for example, a1, (1), a2, (2), a3 or the like).
Returning to the explanation of
When the transmission of the synchronization data is not finished (step S5: NO route), the determination unit 103 references a field of the convergence test in the data storage unit 104 to determine whether the barrier synchronization has converged (step S7). At the step S7, when a circle mark is set in the field of the convergence test, the determination unit 103 determines that the barrier synchronization has converged, and when “X” is set in the field of the convergence test, the determination unit 103 determines that the barrier synchronization has not converged.
When the barrier synchronization has not converged (step S7: NO route), the determination unit 103 retries determination until it can be determined by updating the data storage unit 104 by the collection unit 102 that the barrier synchronization has converged.
When the barrier synchronization has converged (step S7: YES route), in order to send synchronization data, the determination unit 103 sets the position before synchronization data transmission as the restart position (step S9). The determination unit 103 stores information representing the restart position in the memory 12. The processing then moves to
On the other hand, when the transmission of the synchronization data has finished (step S5: YES route), the determination unit 103 determines whether the stop position of the program for the first job represents that the computation node 1 has already finished confirming that the barrier synchronization is complete (step S11). In the case of the example in
When it is confirmed that the barrier synchronization is not complete (step S11: NO route), the determination unit 103 references the field of the convergence test in the data storage unit 104, and determines whether or not the barrier synchronization has converged (step S13). At the step S13, when a circle mark is set in the field of the convergence test, the determination unit 103 determines that the barrier synchronization has converged, and when “X” is set in the field of the convergence test, the determination unit 103 determines that the barrier synchronization has not converged.
When the barrier synchronization has not converged (step S13: NO route), the determination unit 103 retries determination until it can be determined by updating the data storage unit 104 by the collection unit 102 that the barrier synchronization has converged.
When the barrier synchronization has converged (step S13: YES route), the determination unit 103 determines whether the barrier synchronization is complete (step S15). At the step S15, the determination unit 103 performs determination for its own computation node (here, the computation node 1) by referencing information that represents whether or not synchronization data has been transmitted, and is stored in the data storage unit 104. It is determined at the step S5 that the synchronization data has been transmitted, so when the data is R(1), it can be considered that the barrier synchronization is complete, and when the data is B(0), it can be considered that it is in the waiting state for the completion of the barrier synchronization (in other words, the barrier synchronization is not complete).
When the barrier synchronization is complete (step S15: YES route), in order to confirm the completion of the barrier synchronization, the determination unit 103 sets a position before the confirmation of the barrier synchronization completion as the restart position (step S17). The determination unit 103 stores information that represents the restart position in the memory 12. Processing then moves to
On the other hand, when the barrier synchronization is not complete (step S15: NO route), in order to resend the synchronization data, the determination unit 103 sets a position before the transmission of the synchronization data as the restart position (step S19). The determination unit 103 stores information representing the restart position in the memory 12. Processing then moves to
Returning to the explanation of
Moving to an explanation of
When the barrier synchronization has not converged (step S21: NO route), the determination unit 103 retries determination until it can be determined by updating the data storage unit 104 by the collection unit 102 that the barrier synchronization has converged.
When the barrier synchronization has converged (step S21: YES route), the determination unit 103 determines whether the stop position of the program for the first job represents that the computation node 1 has already finished transmitting synchronization data for the next barrier synchronization (step S23). In the example in
when synchronization data for the next barrier synchronization has been transmitted (step S23: YES route), the determination unit 103 sets a position before the transmission of synchronization data for the next barrier synchronization as the restart position (step S27). The determination unit 103 stores information that represents the restart position in the memory 12. Processing then ends. The position before transmission of the synchronization data for the next barrier synchronization is (3) in the example in
On the other hand, when synchronization data for the next barrier synchronization has not been transmitted (step S23: NO route), the determination unit 103 sets a position after where confirmation of barrier synchronization completion was finished as the restart position (step S25). The determination unit 103 stores information that represents the restart position in the memory 12. Processing then ends. The position after where the confirmation of the barrier synchronization completion was finished is (3) in the example in
The relationship between the stop position and restart position of the program for the job will be explained using
In the example in
In the example in
In the example in
In the example in
In the example in
In the example in
By executing the processing such as described above, it is possible to restart execution of the program for the job so that the barrier synchronization is suitably performed even though a job is stopped during the execution of the barrier synchronization.
After the processing described above has been executed, the synchronization manager 101 saves information, that is used when executing the program for the first job again and that is stored in a storage device such as memory in the barrier interface 13, in a storage device such as a hard disk. Here, the resource manager 106 releases resources such as memory and other hardware in the barrier interface 13 and network interface 14.
Then, by activating the program for the second job by the job manager 105, the second job is executed in the computation node 1. When the execution of the second job is completed, the synchronization manager 101 returns the information that was saved in order to re-execute the first job to the original state (in other words, swaps-in the information). The job manager 105 then identifies, from the memory 12, information that represents the restart position. The job manager 105 also activates the program for the first job, and re-executes the first job in the computation node 1 from the determined restart position. Also in the computation nodes 2 to N, as in the case of the computation node 1, the second job is executed after the first job is stopped, and then after the execution of the second job is completed, the execution of the first job is restarted.
Although the embodiment of this invention is explained, this invention is not limited to the embodiment. For example, the functional block diagram of the aforementioned computation nodes 1 to N may not always correspond to a program module configuration.
Moreover, the aforementioned structure of each table is a mere example, and may be changed. Furthermore, as for the processing flow, as long as the processing result does not change, the turn of the steps may be exchanged. Furthermore, plural steps may be executed in parallel.
Moreover, in the aforementioned example, an example was explained in which the barrier interface is used for the mechanism for executing the barrier synchronization. However, without the barrier interface, a similar function may be provided in the firmware in the network switch.
The aforementioned embodiment of this invention is outlined as follows:
A control method relating to the embodiment is executed by a first node among plural nodes (CPUs or CPU cores) included in a parallel computer. Then, the control method includes: (A) upon detecting that execution of a program for a job is stopped in each of the plural nodes, collecting information concerning a state of progress of barrier synchronization from each of the plural nodes; and (B) first determining a restart position of the program for the job in the first node, based on a stop position of the program for the job in the first node and the information collected from each of the plural nodes.
With this configuration, even when the job is stopped during execution of the barrier synchronization, it is possible to restart the execution of the program for the job so as to appropriately perform the barrier synchronization.
Moreover, the aforementioned first determining may include: (b1) second determining, based on the information collected from each of the plural nodes, whether or not a state of the parallel computer reached a state that the barrier synchronization does not advance any more; and (b2) upon determining that the state of the parallel computer reached the state that the barrier synchronization does not advance any more, third determining the restart position of the program for the job.
Immediately after the stop of the program for the job, the barrier synchronization may advance by the synchronization data transmitted before the stop of the program for the job. Therefore, by performing the aforementioned processing, it is possible to prevent from determining an inappropriate restart position.
Moreover, the second determining may include: (b1-1) determining whether or not at least one of the plural nodes has not finished transmitting synchronization data for the barrier synchronization and whether the barrier synchronization has been completed in each of the plural nodes or has not been completed in any one of the plural nodes. When it is determined that the aforementioned conditions are satisfied, it is determined that the state of the parallel computer reached the state that the barrier synchronization advances no further, in other words, it is considered that the state of progress of the barrier synchronization does not change any more.
Moreover, the aforementioned first determining may include: upon detecting that the first node has finished transmitting synchronization data for the barrier synchronization and the first node waits for receipt of the synchronization data from nodes other than the first node, setting, as a restart position, a position before a position that the synchronization data was transmitted. In such a case, because the nodes other than the first node do not transmit the synchronization data, the barrier synchronization is not completed. Then, by restarting the execution of the program for the job at the position that the synchronization data was transmitted, the barrier synchronization can be performed, appropriately.
In addition, the first determining may include: upon detecting that the barrier synchronization has been completed in the first node and the first node has finished transmitting synchronization data for next barrier synchronization, setting, as a restart position, a position before a position that the synchronization data for the next barrier synchronization was transmitted. In such a case, there may be a node that has not transmitted the synchronization data for the next barrier synchronization. Therefore, the next barrier synchronization may not be completed. Then, by restarting the execution of the program for the job at the position before the synchronization data for the next barrier synchronization was transmitted, the next barrier synchronization is appropriately transmitted.
Incidentally, it is possible to create a program causing a computer to execute the aforementioned processing, and such a program is stored in a computer readable storage medium or storage device such as a flexible disk, CD-ROM, DVD-ROM, magneto-optic disk, a semiconductor memory, and hard disk. In addition, the intermediate processing result is temporarily stored in a storage device such as a main memory or the like.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present inventions have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2012-258186 | Nov 2012 | JP | national |