The present invention relates to the field of parallel processes. More particularly, the present invention relates to the field of parallel processes where a checkpointing technique records an intermediate state of parallel processes.
A computer in operation includes hardware, software, and data. The hardware typically includes a processor, memory, storage, and I/O (input/output) devices coupled together by a bus. The software typically includes an operating system and applications. The applications perform useful work on the data for a user or users. The operating system provides an interface between the applications and the hardware. The operating system performs two primary functions. First, it allocates resources to the applications. The resources include hardware resources—such as processor time, memory space, and I/O devices—and software resources including some software resources that enable the hardware resources to perform tasks. Second, it controls execution of the applications to ensure proper operation of the computer.
Often, the software is conceptually divided into a user level, where the applications reside and which the users access, and a kernel level, where the operating system resides and which is accessed by system calls. Within an operating computer, a unit of work is referred to as a process. A process is computer code and data in execution. The process may be actually executing or it may be ready to execute or it may be waiting for an event to occur. System calls provide an interface between the processes and the operating system.
Checkpointing is a technique employed on some computers where an intermediate computational state of processes is captured. When processes take significant time to execute, they are susceptible to intervening system failures. By occasionally performing a checkpoint of processes and resources assigned to processes, the processes can be restarted at an intermediate computational state in an event of a system failure. Migration is a technique in which running processes are checkpointed and then restarted on another computer. Migration allows some processes on a heavily used computer to be moved to a lightly used computer. Checkpointing, restart, and migration have been implemented in a number of ways.
Operating system checkpoint, restart, and migration have been implemented as an integral part of several research operating systems. However, such research operating systems are undesirable because they lack an installed base and, consequently, few applications exist for them. Application level checkpoint, restart, and migration in conjunction with standard operating systems have also been implemented. But these techniques require that processes not use some common operating system services because the checkpointing only takes place at the application level.
Object based checkpoint, restart, and migration have also been implemented. Such object based approaches use particular programming languages or middleware toolkits. The object based approaches require that the applications be written in one of the particular programming languages or that the applications make explicit use of one of the middleware toolkits. A virtual machine monitor approach can be used to implement checkpoint, restart, and migration. But such an approach requires checkpointing and restarting all processes within the virtual machine monitor. This approach also exhibits poor performance due to isolation of the virtual machine monitor from an underlying operating system.
In The Design and Implementation of Zap: A System for Migrating Computing Enviroments, Proc. OSDI 2002, Osman et al. teach a technique of adding a loadable kernel module to a standard operating system to provide checkpoint, restart, and migration of processes for existing applications. The loadable kernel model divides the user level into process domains and provides virtualization of resources within each process domain. Such virtualization of resources includes virtual process identifiers and virtualized network addresses. Processes within one process domain are prevented from interacting with processes in another process domain using inter-process communication techniques. Instead, processes within different process domains interact using network communications and shared files set up for communication between different computers.
Checkpointing in the technique taught by Osman et al. records the processes in a process domain as well as the state of the resources used by the processes. Because resources in the process domain are virtualized, restart or migration of a process domain includes restoring resource identifications to a virtualized identity that the resources had at the most recent checkpoint.
While the checkpoint, restart, and migration techniques taught by Osman et al. show promise, several areas could be improved. In particular, these techniques are only capable of working with processes within a single process domain. There is no technique for checkpointing parallel processes executing within a plurality of process domains.
What is needed is a method of checkpointing parallel processes operating within a plurality of process domains.
The present invention comprises a method of checkpointing parallel processes in execution within a plurality of process domains. According to an embodiment, the method begins with a step of setting communication rules to stop communication between the process domains. Each process domain comprises an execution environment at a user level for at least one of the parallel processes. The method continues with a step of checkpointing each process domain and any in-transit messages. The method concludes with a step of resetting the communication rules to allow the communication between the process domains.
These and other aspects of the present invention are described in more detail herein.
The present invention is described with respect to particular exemplary embodiments thereof and reference is accordingly made to the drawings in which:
The present invention comprises a method of checkpointing parallel processes in execution within a plurality of process domains. According to an embodiment, the process domains are distributed over a plurality of nodes which are coupled together by a network. Each node comprises a computer system. According to an embodiment, the method checkpoints the parallel processes and then continues execution. According to another embodiment, the method checkpoints the parallel process and then kills the parallel processes. According to this embodiment, the parallel processes are later restored. According to an embodiment, the parallel processes are restored on the nodes which originally hosted the parallel processes. According to another embodiment, restoration of the parallel processes includes migrating at least one of the parallel processes from a first node on which the parallel process was checkpointed to a second node.
An embodiment of a computer system which implements at least a portion of the method of the present invention is illustrated schematically in
An embodiment of a cluster of computers which implements the method of the present invention is illustrated schematically in
According to an embodiment of the method of checkpointing the parallel processes of the present invention, the first through third parallel processes are periodically checkpointed so that, in an event of failure of one or more of the first through third parallel processes, the first through third parallel processes can be restarted at an intermediate point rather than returning to a starting point.
According to another embodiment, the first through third parallel processes are checkpointed as part of a suspend operation. The suspend operation checkpoints the first through third parallel processes and then kills them. A need for the suspend operation may arise, for example, when a higher priority need for one or more of the first through third nodes, 202 . . . 206, arises. Later, the first through third parallel processes can be restored when the higher priority need no longer exists.
According to another embodiment of the method of checkpointing the parallel processes of the present invention, the first through third parallel processes are checkpointed so that one or more of the first through third parallel processes can be migrated to another node. Such a need may arise, for example, when one of the first through third nodes, 202 . . . 206, requires maintenance.
An embodiment of a computer network which implements a method of checkpointing parallel processes of the present invention is illustrated schematically in
According to an embodiment, the computer network 300 comprises a cluster of computers in which the network 310 comprises a LAN (local area network). According to another embodiment, the computer network 300 comprises a more disperse group of computers in which the network 310 comprises a WAN (wide area network) such as the Internet.
An embodiment of a method of checkpointing the first through nth parallel processes, 312 . . . 316, of the present invention is illustrated as a flow chart in
According to an embodiment which employs the Linux operating system, the second step 404 makes use of netfilter( ) and iptables to configure communication rules. The iptables comprise the communication rules which the netfilter( ) uses to stop the communication between the process domains 110.
According to an embodiment, the communication rules for each of the first through nth nodes, 302 . . . 306, prevent the process domain 110 on the node from sending messages and from receiving messages. Each instantiation of configuring the communication rules on the first through nth nodes, 302 . . . 306, need not be synchronized with other instantiations on others of the first through nth nodes, 302 . . . 306. This means that the first node 302 could send a message to the second node 304 before the communications rules on the first node 302 have been configured but after the communication rules on the second node 304 have been configured. The communication rule on the second node 304 will prevent such a message from being received by the second node 304. In another case, a communication rule could be configured on the first node 302 to stop communication before a message was sent. Such messages comprise in-transit messages. In yet another case, a message could be sent by the first node 302 and received by the second node 304 before the communication rules have been configured for the first or second nodes, 302 or 304. In such a case, the message might not be retrieved by the process running on the second node 304 before the process running on the second node 304 is stopped in a later step. Such messages also comprise in-transit messages. According to an embodiment, each of the first through nth nodes, 302 . . . 306, buffer the any in-transit messages on the node or nodes which sent the in-transit messages.
According to an embodiment, the first through nth nodes, 302 . . . 306, employ TCP (Transmission Control Protocol) to send and receive messages. TCP retains a kernel level copy of each message sent by a sending node until the sending node receives a confirmation from a receiving node. TCP also retains a kernel level copy of each message received by a receiving node until the process running on the receiving node retrieves the message. According to this embodiment, the kernel level copies comprise the buffered in-transit messages.
It will be readily apparent to one skilled in the art that the buffering of in-transit messages on each of the first through nth nodes, 302 . . . 306, is different from network switch buffering. The network switch buffering places packets in one or more buffers (or queues) of network switches as messages move from one of the first through nth nodes, 302 . . . 306, through the network 310 to another of the first through nth nodes, 302 . . . 306. In contrast, the buffering of in-transit messages comprises a kernel level operation on each of the first through nth nodes, 302 . . . 306, which retains the messages in memory.
In a third step 406, the coordinator 308 sends a checkpoint command to each of the first through nth nodes, 302 . . . 306. In response, each of the first through nth nodes, 302 . . . 306, begins a checkpoint process. Each of the checkpoint processes on the first through nth nodes, 302 . . . 306, stops the first through nth parallel processes, 312 . . . 316, respectively, in a fourth step 408. In a fifth step 410, each of the checkpoint processes on the first through nth nodes, 302 . . . 306, saves a state of resources for the process domains 110 on the first through nth nodes, 302 . . . 306, respectively. The resources for which the state is saved includes kernel level resources and user level resources for the process domain 110 as well as the any in-transit messages which were buffered. The buffered in-transit messages on each node comprise messages that have not yet been successfully sent from the node, messages that have been sent from the node but for which no acknowledgement of successful receipt has been received from a receiving node, and messages that have been received successfully at the node but which have not yet been retrieved by the processes in the node. Preferably, each of the first through nth nodes, 302 . . . 306, saves the state of the resources in a stable repository such as in a file on a disk storage. Alternatively, one or more of the first through nth nodes, 302 . . . 306, save the state of the resources in a memory.
In a sixth step 412, the coordinator 308 waits for each of the first through nth nodes, 302 . . . 306, to acknowledge completion of the fourth step 408. In a seventh step 414, the coordinator 308 determines whether the checkpointing forms part of a checkpoint operation or whether the checkpointing forms part of a suspend operation. The checkpoint operation is discussed immediately below. The suspend operation is discussed following the discussion of the checkpoint operation.
If the checkpoint operation is being performed, in an eighth step 416, the coordinator 308 sends a resume communication command to each of the first through nth nodes, 302 . . . 306. In response, each of the first through nth nodes, 302 . . . 306, reconfigures the communication rules to allow communication between the process domains 110 in a ninth step 418. In an tenth step 420, the coordinator 308 sends a continue command to each of the first through nth nodes, 302 . . . 306. In an eleventh step 422, each of the first through nth nodes, 302 . . . 306, resumes execution of the first through nth parallel processes, 312 . . . 316. According to an embodiment, in a twelfth step 424, the coordinator 308 waits for an acknowledgment from the first through nth nodes, 302 . . . 306, indicating that the first through nth parallel processes, 312 . . . 316, have continued execution. According to another embodiment, the twefth step 424 is not performed and instead the coordinator 308 verifies that the first through nth parallel processes, 312 . . . 316, have continued execution using another technique (e.g., listing state of processes on the nodes).
If the suspend operation is being performed, in a thirteenth step 426, the coordinator 308 sends a kill command to each of the first through nth nodes, 302 . . . 306. In a fourteenth step 428, each of the first through nth nodes, 302 . . . 306, kills the first through nth parallel processes, 312 . . . 316, respectively. In a fifteenth step 430, the coordinator 308 sends an allow communication command to each of the first through nth nodes, 302 . . . 306. In a sixteenth step 432, each of the first through nth nodes, 302 . . . 306, reconfigures the communication rules to allow communication between the process domains 110. In a seventeenth step 434, each of the first through nth nodes, 302 . . . 306, removes their respective process domain 110. According to an embodiment, in an eighteenth step 436, the coordinator 308 waits for an acknowledgment from the first through nth nodes, 302 . . . 306, indicating that the first through nth parallel processes, 312 . . . 316, have been killed. According to another embodiment, the eighteenth step 436 is not performed and instead the coordinator 308 verifies that the first through nth parallel processes, 312 . . . 316, have been killed using another technique. At some later time, a method of restarting the first through nth parallel processes, 312 . . . 316, is employed to resume execution of the first through nth parallel processes, 312 . . . 316.
An embodiment of a method of restarting the first through nth parallel processes, 312 . . . 316, of the present invention is illustrated as a flow chart in
In a third step 506, the coordinator 308 sends a restore process domain command to each of the first through nth nodes, 302 . . . 306. In a fourth step 508, each of the first through nth nodes, 302 . . . 306, restores the process domain 110 for its own node. In a fifth step 510, each of the first through nth nodes, 302 . . . 306, restores the first through nth processes, 312 . . . 316, in a stopped state, including any buffered in-transit messages which were previously saved. In a sixth step 512, the coordinator 308 waits for each of the first through nth nodes, 302 . . . 306, to acknowledge completion of restoration of the first through nth parallel processes, respectively. In a seventh step 514, the coordinator 308 sends an allow communication command to each of the first through nth nodes, 302 . . . 306. In response, each of the first through nth nodes, 302 . . . 306, reconfigures the communication rules for itself to allow communication between the process domains 110 in an eighth step 516.
The method 500 continues in a ninth step 518 in which the coordinator 308 sends a continue command to each of the first through nth nodes, 302 . . . 306. In an tenth step 520, each of the first through nth nodes, 302 . . . 306, resumes execution of the first through nth parallel processes, 312 . . . 316. According to an embodiment, in an eleventh step 522, the coordinator 308 waits for an acknowledgment from the first through nth nodes, 302 . . . 306, indicating that the first through nth parallel processes, 312 . . . 316, have resumed execution. According to another embodiment, the eleventh step 522 is not performed and instead the coordinator 308 verifies that the first through nth parallel processes, 312 . . . 316, have resumed execution using another technique.
Each of the method 400 (
An alternative embodiment of a method of checkpointing the first through nth parallel processes, 312 . . . 316, of the present invention is illustrated as a flow chart in
In a second step 604, each of the first through nth nodes, 302 . . . 306, configures communication rules for itself to stop communication between the process domains 110. In a third step 606, each of the first through nth nodes, 302 . . . 306, stops the first through nth processes, 312 . . . 316, respectively. In a fourth step 608, each of the first through nth nodes, 302 . . . 306, saves a state of resources for the process domains 110 on the first through nth nodes, 302 . . . 306, respectively. The resources for which the state is saved includes kernel level and user level resources for the process domain 110 as well as any in-transit messages. Preferably,ach of the first through nth nodes, 302 . . . 306, saves the state of the resources in a stable repository such as a file on a disk storage. Alternatively, one or more of the first through nth nodes, 302 . . . 306, save the state of the resources in a memory.
The method 600 continues in a fifth step 610 in which each of the first through nth nodes, 302 . . . 306, employs a barrier mechanism to ensure that all of the first through nth nodes, 302 . . . 306, complete the fourth step 608 before any of the first through nth nodes, 302 . . . 306, proceed. The barrier mechanism comprises a built-in system operation for synchronizing processes. In a sixth step 612, a particular node determines whether the checkpointing forms part of a checkpoint operation or whether the checkpointing forms part of a suspend operation. The checkpoint operation is discussed immediately below. The suspend operation is discussed following the discussion of the checkpoint operation.
If the checkpoint operation is being performed, in a seventh step 614, each of the first through nth nodes, 302 . . . 306, reconfigures the communication rules for itself to allow communication between the process domains 110. In an eighth step 616, each of the first through nth nodes resumes execution of the first through nth parallel processes, 312 . . . 316, respectively. According to an embodiment, in a ninth step 618, the particular node waits for an acknowledgment from the first through nth nodes, 302 . . . 306, indicating that the first through nth parallel processes, 312 . . . 316, have continued execution. According to another embodiment, the ninth step 618 is not performed and instead the particular node verifies that the first through nth parallel processes, 312 . . . 316, have continued execution using another technique.
If the suspend operation is being performed, in a tenth step 620, each of the first through nth nodes, 302 . . . 306, kills the first through nth parallel processes, 312 . . . 316. In an eleventh step 622, each of the first through nth nodes reconfigures the communication rules to allow the communication between the process domains 110. In a twelfth step 624, each of the first through nth nodes, 302 . . . 306, removes their respective process domains 110. According to an embodiment, in a thirteenth step 626, the particular node waits for an acknowledgment from the first through nth nodes, 302 . . . 306, indicating that the first through nth parallel processes, 312 . . . 316, have been killed. According to another embodiment, the thirteenth step 626 is not performed and instead the particular node verifies that the first through nth parallel processes, 312 . . . 316, have been killed using another technique.
An alternative embodiment of a method of restarting the first through nth parallel processes, 312 . . . 316, of the present invention is illustrated as a flow chart in
In a second step 704, each of the first through nth nodes, 302 . . . 306, configures communication rules for itself to stop communication between the process domains 110. In a third step 706, each of the first through nth nodes, 302 . . . 306, restores the process domain 110 for its own node. In a fourth step 708, each of the first through nth nodes, 302 . . . 306, restores the first through nth processes, 312 . . . 316, in a stopped state, including any buffered in-transmit messages which were previously saved. In a fifth step 710, each of the first through nth nodes, 302 . . . 306, employs a barrier mechanism to ensure that all of the first through nth nodes, 302 . . . 306, complete the fourth step 708 before any of the first through nth nodes, 302 . . . 306, proceed.
In a sixth step 712, each of the first through nth nodes reconfigures the communication rules for itself to allow communication between the process domains 110. In a seventh step 714, each of the first through nth nodes resumes execution of the first through nth parallel processes, 312 . . . 316, respectively. According to an embodiment, in an eighth step 716, the particular node waits for an acknowledgment from the first through nth nodes, 302 . . . 306, indicating that the first through nth parallel processes, 312 . . . 316, have resumed execution. According to another embodiment, the eighth step 716 is not performed and instead the particular node verifies that the first through nth parallel processes, 312 . . . 316, have resumed execution using another technique.
It will be readily apparent to one skilled in the art that while the methods discussed herein refer to a parallel process within a process domain any process domain could have multiple parallel processes. In such a situation, the steps of the methods which discuss operations on a parallel process within a process domain will be understood by one skilled in the art to refer to operations on the multiple parallel processes.
It will be readily apparent to one skilled in the art that various modifications can be made to the methods of the present invention. Such modifications include separating steps into multiple steps, combining steps into a single command, or rearranging an order of steps. Examples include the following: A checkpoint command of the present invention can be divided into a stop command and a save state command. A suspend operation of the present invention can be implemented using a single suspend command which combines checkpoint and kill commands. A restart method of the present invention can implement steps of sending a resume communication command and sending a continue command in either order. Or, a process domain can include an independent process other than one of the parallel processes in which case the independent process can be allowed to proceed without interruption by the methods of the present invention.
The foregoing detailed description of the present invention is provided for the purposes of illustration and is not intended to be exhaustive or to limit the invention to the embodiments disclosed. Accordingly, the scope of the present invention is defined by the appended claims.