1. Field of the Invention
The present invention relates to a technique for acquiring checkpoints in making iteration-method computer calculations in parallel to effectively utilize the acquired data for recovery.
2. Description of Related Art
As the scale of supercomputers increases, the increase in time required for checkpoints is becoming problematic. The acquisition of a checkpoint takes a lot of time. Since a checkpoint of memory is acquired at a particular point of time while rewriting continues, overhead for securing consistency, such as suspension of calculation during the acquisition of the checkpoint, is required.
A first example of a technique currently used is copy-on-write and incremental checkpointing. After write-protecting memory by using copy-on-write in this scheme, a checkpoint is acquired in advance without stopping (interrupting) calculation. The calculation is stopped after acquiring the checkpoint in advance, and an updated part copied by the copy-on-write mechanism during acquisition of the checkpoint is reflected on the checkpoint acquired in advance.
A disadvantage of this scheme is that this approach can be said to be effective only when a small extent of the memory is updated. In the case of applying this approach to LU decomposition calculation, a method of solving Poisson's equation and the like, a large extent of memory is updated during acquisition of a checkpoint. Therefore, stop time for reflecting changes on the checkpoint acquired in advance is required, and the stop time cannot be saved.
A second example of a technique currently used is the use of a nonvolatile medium other than a disk, such as a flash memory, an MRAM or the like. In this scheme, time is reduced by temporarily copying data to a high-speed nonvolatile medium before writing the data to a low-speed medium such as an HDD.
A disadvantage of this scheme is the high additional cost for the nonvolatile memory.
In addition, as for element techniques related to the acquisition of a checkpoint, there are techniques as disclosed in Japanese Patent Laid-Open No. 7-271624 and Japanese Patent Laid-Open No. 9-204318. However, none of these relate to iteration-method calculation.
The object of the present invention is to acquire checkpoints in making iteration-method computer calculations in parallel and to effectively utilize the acquired data for recovery.
In order to overcome these deficiencies, the present invention provides a method implemented in a system including a certain node and at least one other node, the method including: starting, by the certain node, computer calculations based on a data group for calculation belonging to a certain discrete time and executing an iteration-method calculation until a result of the calculations are converged within a predetermined range; acquiring, by the certain node, an intermediate calculation group as a checkpoint at a predetermined timing, in parallel with the execution of the iteration-method calculation, without stopping the started computer calculations; storing, by the certain node, the acquired intermediate calculation group as a checkpoint into an external memory; waiting, by the certain node, until it is confirmed that all the above-stated processes are performed in parallel in the other node and have been completed before evolving the certain discrete time to a next discrete time; and referring, by the certain node, in response to the completion being confirmed, to a converged calculation result and starting next computer calculations based on a next data group for calculations belonging to the next discrete time.
According to another aspect, the present invention provides a system including a certain node and at least one other node, wherein: the certain node starts computer calculations based on a data group for calculation belonging to a certain discrete time and executes an iteration-method calculation until a result of the calculation is converged within a predetermined range; the certain node acquires an intermediate calculation group as a checkpoint at a predetermined timing, in parallel with the execution of the iteration-method calculation, without stopping the started computer calculations; the certain node stores the acquired intermediate calculation group as a checkpoint into an external memory; the certain node waits until it is confirmed that all the above-stated processes are performed in parallel in the other node and have been completed before evolving the certain discrete time to a next discrete time; and in response to the completion being confirmed, the certain node refers to a converged calculation result and starts next computer calculations based on a next data group for calculation belonging to the next discrete time.
According to yet another aspect, the present invention provides A node capable of independently making computer calculations, including a CPU, a check system and a memory, the node being linked with at least one other node so as to be communicable with each other, the computer calculations being made in parallel between these multiple nodes while a data group for calculation belonging to some discrete time is evolved from a certain discrete time to a next discrete time, wherein the node: starts computer calculations based on the data group for calculation belonging to the certain discrete time and executes an iteration-method calculation until a result of the calculation is converged within a predetermined range; acquires an intermediate calculation group as a checkpoint at a predetermined timing in parallel with the execution of the iteration-method calculation without stopping the started computer calculation; stores the acquired intermediate calculation group as a checkpoint into an external memory; waits until it is confirmed that all the above-stated processes are performed in parallel in the other node and have been completed before evolving the certain discrete time to the next discrete time; and in response to the completion being confirmed, refers to a converged calculation result and starts next computer calculations based on a next data group for calculation belonging to the next discrete time.
Each node includes a CPU (calculation body), a checkpoint system and a memory and can independently make computer calculations. In
Regarding the data group for calculation, for example, a differential equation expressed by a Poisson's equation is discretized in a form like meshes in a two-dimensional space expressed by x or y as shown in the figure, and a physical variable is given at each of the mesh intersections (x1, y1), (x2, y1), (x3, y1), . . . . In a computer calculation, the amount of memory occupied is reduced by overwriting a new value calculated as the value of a mesh intersection in the process of time evolution. In common programming, an array in a computer program is used as a framework for storing values corresponding to the number of mesh intersections×the number of kinds of physical variables until the next discrete time.
At the certain discrete time (t=k−1), (convergence) calculation is started. The calculation is not advanced to the next discrete time (t=k) until the calculation result is converged within a predetermined range. The name “iteration method” is derived from the fact that the calculation is iteratively repeated until the calculation result is converged. As for the “predetermined range” for use in determining whether the calculation result has been converged or not, one skilled in the art could introduce various kinds of threshold decisions or appropriately change the range according to the condition of convergence. It is known that the condition of convergence also influences the degree of discretization of time t [here, the interval between (k−1) and k].
In the present embodiment, an intermediate calculation data group as a check point is acquired at a predetermined timing (point of time) in the course of execution of the iteration-method calculation. This acquisition is performed by an asynchronous I/O (input/output) operation without stopping/suspending the started computer calculation.
Therefore, it is important for the self-node to store the acquired intermediate calculation data group as a check point in the external memory. This is because the computer calculation is started there in the case of recovery from the checkpoint.
The calculation body starts convergence calculation at 10. At 20, a checkpoint acquisition instruction is transmitted to the checkpoint system of the self-node (coordination with the checkpoint system). At 30, the convergence calculation is resumed and executed to the end thereof. At 40, an end notification is received from the checkpoint system (coordination with the checkpoint system). At 50, the procedure returns to 10 for convergence calculation for the next discrete time.
At 60, the checkpoint system receives a checkpoint acquisition start instruction from the calculation body. At 70, the contents of the memory are stored in the external memory. At 80, the checkpoint system waits until it is confirmed that all the above-stated steps performed in parallel in all the relevant nodes have been completed, by barrier synchronization between the at least one other node (non-self-node) and the checkpoint system before time-evolving discrete time to the next discrete time.
At 90, the checkpoint system transmits a checkpoint acquisition end notification to the calculation body of the self-node in response to the completion being confirmed, and the notification is received by the calculation body at 40 (coordination with the calculation body). Thereby, at 50, the calculation body of the self-node refers to the converged calculation result and starts a computer calculation based on a data group for calculation belonging to the next discrete time. At 100, the procedure returns to 60 for convergence calculation for the next discrete time. Before time evolution to the next discrete time, it is possible to continuously acquire (or prepare to acquire) a checkpoint at a different timing (point of time).
At 110, the calculation body transmits a checkpoint recovery start instruction to the checkpoint system of the self-node (coordination with the checkpoint system). At 120, a checkpoint recovery end instruction is received from the checkpoint system (coordination with the checkpoint system). At 130, execution of the convergence calculation being executed at the time of acquiring the checkpoint is resumed from the start thereof.
At 140, the checkpoint system receives a checkpoint recovery start instruction from the calculation body of the self-node (coordination with the calculation body). At 150, the contents of the memory are recovered from the external memory. At 160, the checkpoint system waits until it is confirmed that all the above-stated steps performed in parallel in all the relevant nodes have been completed, by barrier synchronization between the at least one other node (non-self-node) and the checkpoint system. At 170, a checkpoint recovery end notification is transmitted to the calculation body of the self-node, and the notification is received by the calculation body at 120. Thereby, at 130, the calculation body of the self-node resumes execution from the start of the convergence calculation being executed at the time of acquiring the checkpoint.
In the present embodiment, since calculation is not stopped at the time of acquiring a checkpoint, the data in which the contents of the memory acquired at different timings (points of time) are mixed are used for a process of recovery from the checkpoint. The reason why use of such data is permitted is that its use is limited to iteration-method convergence calculation. In general, in an iteration method, an approximate value calculated in another method, a fixed value (for example, all zeros), a random number or the like is used as an initial value of a solution. In the calculation, approximation is performed on the basis of a given initial value so that difference from a correct solution (residual) becomes smaller every iteration, and the iteration is repeated until the residual is equal to or smaller than a value specified in advance.
In the present approach, among checkpoint data, the data in which values at different points of time are mixed is acquired. However, in the present embodiment, since the problem that a convergence destination does not depend on an initial value is assumed, convergence to the same value is guaranteed regardless of an initial value. That is, among checkpoint data, even if the data in which values at different points of time are mixed is used, the termination of calculation in the case of being recovered and the validity of a calculation result are guaranteed.
Next, the number of iterations for convergence in the case of being recovered from the data in which values at different points of time are mixed, among checkpoint data, will be described. In an iteration method, the current solution is made closer to a correct solution every iteration. Therefore, in general, by using an initial value closer to the correct solution, convergence to the correct solution becomes possible by a smaller number of iterations. Thus, an initial value closer to a correct solution can be obtained by using a value after more iterations have been performed even if acquisition points of time are mixed, like the checkpoint acquisition method of the present invention, and thereby, the number of iterations performed until convergence at the time of recovery can be reduced.
The approach of the present embodiment can be embodied as a node, a method implemented in the node, or a method or system for making computer calculations in parallel among multiple nodes. The present approach can be also embodied as a computer program product including a computer readable storage medium having computer readable non-transient program code embodied therein, causing a CPU (calculation body), a check system or an integration thereof which is included in a certain node (self-node), to execute each step of the method.
However, the calculation is performed on the condition that the calculation time is not increased by the background checkpoint acquisition overhead. (It is assumed that resources other than a CPU performing calculations are not used at all or almost at all. In the case of using I/O resources, the effect of the invention may be reduced according to the rate of the use.)
The “proposed (estimation)” data in the graph indicates theoretical overhead values when the present invention is applied. Other data indicate overhead when the checkpoint acquisition interval is set as 1 hour, 2 hours, 6 hours and 1 day, respectively. The present embodiment was successful in reducing overhead of 11.1% in the case of the checkpoint interval of 1 day and the MTBF of 10 days to about 0.4%.
Calculation conditions are enumerated below:
Equation: Poisson's equation
Calculation algorithm: Gauss-Seidel
The number of input data (=two-dimensional data arrays): 16384 (=128×128)
Checkpoint acquisition speed: 32 points/iteration (=checkpoint acquisition interval of 512 iterations)
The number of iterations which have been performed when checkpoint acquisition ends: 500, 1000, 1500
In the present embodiment example the same scheme as shown in the above configuration and procedures is used. However, the checkpoint system and the calculation body in the above configuration are integrated and realized as the same program. There are shown below residuals in the case of acquiring checkpoints at the 500th, 1000th and 1500th iterations after the start of calculation and recovering from the acquired checkpoints. In order to show how the number of iterations before acquisition influences the number of iterations after acquisition, the graph shows the residuals after recovery on the basis of the number of iterations before checkpoint acquisition.
Furthermore, embodiment examples to which the present invention can be applied include (1) to (4) below:
(1) Applicable to calculation based on convergence calculation by an iteration method in which a convergence value is decided irrespective of an initial solution. A BiCG method is an example;
(2) Applicable to calculation using the Poisson equation, because it is guaranteed that in the Poisson equation a convergence value is decided regardless of an initial value. The Poisson equation is used in a variety of fields such as CFD, electrostatics, mechanical engineering, theoretical physics and first principles calculation;
(3) Applicable to calculation in which a convergence value differs depending on an initial solution. However, it is also conceivable that, by applying the present invention, convergence to a value other than an original convergence value occurs or convergence does not occur after recovery from a checkpoint. In the problem of including such calculation that a convergence value differs depending on an initial solution, there is a possibility that an execution result may change due to application of the present invention. If a user accepts this condition, the present invention can be applied to the calculation in which a convergence value differs depending on an initial solution; and
(4) At the time of acquiring a checkpoint, asynchronous communication using RDMA (Remote Direct Memory Access) or the like can be used instead of the asynchronous I/O. In this case, the checkpoint system operates on a node other than the self-node, but the procedure itself is the same. By using RDMA, checkpoint acquisition can be performed without using CPU resources of a target node. Thereby, an increase in convergence calculation time (30 in
Number | Date | Country | Kind |
---|---|---|---|
2011-040262 | Feb 2011 | JP | national |
This application is a continuation of and claims priority from U.S. application Ser. No. 13/396820 filed on Feb. 15, 2012, which in turn claims priority under 35 U.S.C. 119 from Japanese Application 2011-040262, filed Feb. 25, 2011, the entire contents of both applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 13396820 | Feb 2012 | US |
Child | 13572844 | US |