This application claims priority to The People's Republic of China Patent Application No. 201010536241.X, filed Oct. 28, 2010, entitled TASK CANCELLATION GRACE PERIODS.
Large computations or calculations are often executed on clusters of computers. A computer cluster is a group of computing machines that work together or cooperate to perform tasks. A cluster of computers often has a head node and one or more compute nodes. The head node is responsible for allocating compute node resources to jobs, and compute nodes are responsible for performing tasks from the jobs to which their resources are allocated. A job is a request for cluster resources (such as compute node resources) that includes one or more tasks. A task is a piece of computational work that can be performed, such as in one or more compute nodes of a cluster, or in some other environment. A job is started or scheduled by starting one or more tasks in the job.
Sometimes jobs and tasks running on a cluster are cancelled, i.e., terminated before they naturally reach completion. Cancelling a job includes cancelling the tasks in the job that are currently running. A task can be cancelled by terminating processes that are currently performing the computation of the task. Such cancellation may be initiated in various ways and for various reasons, such as in response to user input from an end user or cluster administrator, or as a result of a scheduling policy of the cluster. When a task running on a compute node of the cluster is cancelled, the processes corresponding to the task on the compute node are immediately terminated. Task cancellations may also happen in situations other than in computer clusters, such as in suspend and resume scenarios where tasks may be cancelled, but may resume at a later time.
Whatever the advantages of previous task cancellation tools and techniques, they have neither recognized the task cancellation grace period tools and techniques described and claimed herein, nor the advantages produced by such tools and techniques.
In one embodiment, the tools and techniques can include receiving a command to perform a task, and starting the task. Additionally, a command to cancel the task can be received. The task can be sent a warning signal and provided with a predetermined grace period of time before cancelling the task. If the task has not shut down within the grace period, then the task can be cancelled after the grace period expires.
In another embodiment of the tools and techniques, a command to cancel a running task can be received. It can be determined whether to provide the task with a grace period of time before cancelling the task. If the task is not to be provided with the grace period, then the task can be cancelled without waiting for the grace period to expire. If the task is to be provided with the grace period, then the task can be sent a warning signal and provided with the grace period. If the task has not shut down within the grace period, the task can be cancelled after the grace period expires.
In yet another embodiment of the tools and techniques, at a head node of a cluster, it can be determined that a running task is to be cancelled. A command can be sent from the head node to a compute node that is running the task. The command can instruct the compute node to cancel the task. A warning signal can be sent to the task, and if the task has not shut down when a predetermined grace period of time expires, then the task can be cancelled after the grace period expires.
This Summary is provided to introduce a selection of concepts in a simplified form. The concepts are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Similarly, the invention is not limited to implementations that address the particular techniques, tools, environments, disadvantages, or advantages discussed in the Background, the Detailed Description, or the attached drawings.
Embodiments described herein are directed to techniques and tools for improved cancellation of tasks. Such improvements may result from the use of various techniques and tools separately or in combination.
As noted above, when a task running on a compute node of the cluster is cancelled, the processes corresponding to the task on the compute node are typically terminated immediately. Such sudden termination may not allow tasks a chance to save the computational work they had already done before being terminated, resulting in a loss of the already-consumed computational time. The lost computation will be redone the next time the task is run. Moreover, many sophisticated applications will encounter problems in subsequent execution if the applications are not shut down cleanly. For example, unless some applications are correctly shutdown, the applications will run recovery code the next time the applications are invoked, or such applications may leave the compute node in a state that makes it difficult for another user to use the same application on that compute node. The tools and techniques described herein can include providing a grace period for job and task cancellation that informs a task that it is about to be terminated and then allows it a grace period to prepare for cancellation, such as by saving its state and/or shutting down cleanly as it chooses. This may be done in a cluster, and it may also be done in other environments.
Such techniques and tools may include sending a warning signal (e.g., a CTRL_BREAK signal) informing a task that it is about to be cancelled. For example, the task may be a task running in a compute node of a cluster. The task can be allowed a grace period to prepare for cancellation. For example, the task may save its state and/or exit cleanly. If the task is still running after the grace period, the task can be cancelled, such as by forcefully terminating the task's processes. A proxy may be provided to receive a signal warning of cancellation and forward a warning signal to the task's process. For example, where the task is running in a console, the proxy may also be running in the console. The proxy can receive a warning signal, and can forward a warning signal from the proxy to the task within the console. The grace period may be bypassed, such as by an administrator, to speed up cancellation of jobs.
The subject matter defined in the appended claims is not necessarily limited to the benefits described herein. A particular implementation of the invention may provide all, some, or none of the benefits described herein. Although operations for the various techniques are described herein in a particular, sequential order for the sake of presentation, it should be understood that this manner of description encompasses rearrangements in the order of operations, unless a particular ordering is required. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Techniques described herein with reference to flowcharts may be used with one or more of the systems described herein and/or with one or more other systems. For example, the various procedures described herein may be implemented with hardware or software, or a combination of both. Moreover, for the sake of simplicity, flowcharts may not show the various ways in which particular techniques can be used in conjunction with other techniques.
The computing environment (100) is not intended to suggest any limitation as to scope of use or functionality of the invention, as the present invention may be implemented in diverse general-purpose or special-purpose computing environments.
With reference to
Although the various blocks of
A computing environment (100) may have additional features. In
The storage (140) may be removable or non-removable, and may include non-transitory computer-readable storage media such as magnetic disks, magnetic tapes or cassettes, CD-ROMs, CD-RWs, DVDs, or any other medium which can be used to store information and which can be accessed within the computing environment (100). The storage (140) stores instructions for the software (180).
The input device(s) (150) may be a touch input device such as a keyboard, mouse, pen, or trackball; a voice input device; a scanning device; a network adapter; a CD/DVD reader; or another device that provides input to the computing environment (100). The output device(s) (160) may be a display, printer, speaker, CD/DVD-writer, network adapter, or another device that provides output from the computing environment (100).
The communication connection(s) (170) enable communication over a communication medium to another computing entity. Thus, the computing environment (100) may operate in a networked environment using logical connections to one or more remote computing devices, such as a personal computer, a server, a router, a network PC, a peer device or another common network node. The communication medium conveys information such as data or computer-executable instructions or requests in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media include wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier.
The tools and techniques can be described in the general context of computer-readable media. Computer-readable media are any available media that can be accessed within a computing environment. By way of example, and not limitation, with the computing environment (100), computer-readable media include memory (120), storage (140), and combinations of the above.
The tools and techniques can be described in the general context of computer-executable instructions, such as those included in program modules, being executed in a computing environment on a target real or virtual processor. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed within a local or distributed computing environment. In a distributed computing environment, program modules may be located in both local and remote computer storage media.
For the sake of presentation, the detailed description uses terms like “determine,” “choose,” “adjust,” and “operate” to describe computer operations in a computing environment. These and other similar terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human being, unless performance of an act by a human being (such as a “user”) is explicitly noted. The actual computer operations corresponding to these terms vary depending on the implementation.
II. Task Execution System and Environment with Cancellation Grace Periods
The task execution system (200) can be implemented with a client (210) and a cluster (212) that can process jobs for the client (210). The task execution system (200) may also include additional clients and/or additional computer clusters. The client (210) can communicate with the cluster (212), which can include a head node (220) running a scheduler service (222). The scheduler service (222) can communicate with the client (210), such as over standard network connections. The cluster (212) can also include a compute node (230), and it may also include additional compute nodes that work together to perform jobs. Communications between nodes may use standard network messaging formats and techniques. The scheduler service (222) can schedule jobs (such as jobs submitted by clients such as client (210)) and the tasks of those jobs on compute nodes in the cluster (212), such as the compute node (230).
The compute node (230) can run a node manager service (232). For example, the node manager service (232) and the scheduler service (222) may be modules that are components of Microsoft® Windows® HPC Server software. The node manager service (232) can be used by the scheduler service (222) to perform task startup and cancellations on the compute node (230).
As will be discussed more below, the compute node (230) can also run other modules under the direction of the node manager service (232). These other modules may include a task event (234), a task object (240) hosting a proxy (242) and a task process (244). A compute node (230) may also run additional task events, task objects, proxies, and/or task processes.
Techniques for starting and cancelling a task within the task execution system (200) will now be described with reference to the flowcharts of
Referring now to
When the node manager service (232) receives the start task message for a task, it can create (330) a task object (240), such as a Windows® job object, for the task. The task object (240) can encapsulate the processes corresponding to that task on the compute node (230). The task object (240) can be started such that any child processes created by the task will not be able to break away from the task object (240). The node manager service (232) can set up the environment for the task's process, such as environment variables, standard out, and standard error. This can also include creating (340) a task event (234), such as a Windows® event, for the task. Instead of creating the process for the task, the node manager service (232) can create (350) a node manager proxy process, or proxy (242), within the task object (240) for the task. The proxy (242) can be passed the identity of the task event (234) created by the node manager service (232), as well as the actual command line for the task. Using this information, the proxy (242) can verify that the identity of the windows event passed to it is valid and can start (360) process(es) (244) for the task in the task object (240) with the command line supplied to it by the node manager service (232). The proxy (242) can then wait (370) for either the task event (234) to be signaled or the task process (244) to exit.
The proxy (242) for a task can be created with a console process creation flag set. Accordingly, each task's processes can be run within the task's own console (260) (which can contain the same processes as are running in the task object (240)), allowing the processes in the console (260) to receive console signals such as CTRL_BREAK from other processes in the console (260), while still maintaining console isolation from other tasks on the compute node (230).
Referring now to
When the node manager service (232) receives an end task command, it can check whether the grace period supplied by the end task command is more than zero. If the grace period is more than zero, the node manager service (232) can provide that grace period of time to the task's computational processes before cancelling those processes. Specifically, the node manager service (232) can signal (430) the task event (234) created for that particular task and start (440) a timer (250) set to go off at the end of the grace period.
When the proxy (242) corresponding to that task receives (445) the cancellation signal by noticing that the task event (234) has been signaled from the node manager service (232), the proxy (242) can generate a console CTRL_BREAK event and send (450) the event to the user's computational task process (244) it had started earlier. The proxy (242) can then wait (455) for the task process (244) to exit.
After the task process (244) (including all processes for the task in the task object (240)) exits, the proxy (242) itself can exit, and the node manager service (232) can be notified that the processes within the task object (240) have exited. A task process (244) can register a handler for the CTRL_BREAK signal to be able to process that signal.
In response to receiving the CTRL_BREAK signal, which warns the task that the task will be cancelled, the task can respond by preparing for the cancellation. For example, the task may start a clean exit. As another example, a task may initiate a checkpoint and save its state, but not bother to exit. For MPI (message passing interface) tasks, the CTRL_BREAK signal can be passed through smpd to all the processes for that MPI task on all compute nodes. This can be used by the MPI task to do a synchronous checkpoint on all its processes on all its nodes. For service oriented architecture (SOA) applications, receiving the CTRL_BREAK signal could be interpreted as a command to complete the current request and then exit, rather than abandoning the work that has already been performed.
When the grace period ends, the timer (250) can go off (460), and it can be determined (465) whether a task is still running at the end of the grace period. Of course, the timer (250) itself may be terminated before it goes off if the task has already exited. If the task process (244) followed by the proxy (242) exits before the timer (250) on the node manager service (232) goes off at the end of the grace period, the node manager service (232) can be informed (480) that the task process (244) has exited, and can report (490) to the scheduler service (222) that the end task operation has completed. If the timer goes off first, then the node manager service (232) can terminate (470) the task object (240) encapsulating the task's proxy (242) as well as the computational task process (244), and then report (490) to the scheduler service (222) that the end task operation has completed.
A job or task may need to be cancelled immediately without allowing it the grace period. A force option to the cancel command can be provided for a job or a task. For example, this force option may be done in response to user input from a system administrator. For example, when the force option is specified the scheduler service (222) can send out the end task command to the node manager service (232) on the compute node (230), the scheduler service (222) provides a grace period of zero. When the node manager service (232) receives an end task command with a grace period of zero, the node manager service (232) can decide to terminate the task object (240) corresponding to that task immediately, without providing a grace period for the task to prepare for cancellation.
While particular techniques with a particular task execution system (200) have been described, many different variations could be used. For example, the grace period tools and techniques described herein may also be used in environments other than computer clusters. For example, in suspend and resume scenarios that do not involve clusters, a task may be running in an application. The application may be cancelled (suspended), and it may resume at a later time, possibly in another location. When such a task is to be cancelled, the task can be warned and provided with a grace period before cancellation, so the task can prepare for cancellation by saving its state. That saved state can be re-loaded when the task resumes at a later time.
Several task cancellation grace period techniques will now be discussed. Each of these techniques can be performed in a computing environment, such as the system of
Referring to
The technique can be performed in a system that includes a cluster. For example, the technique can be performed by a node of a cluster. The technique may be performed by a compute node, and the command to cancel the task may be received from a head node of the cluster.
The task can be running within a console when the command to cancel the task is received. Additionally, sending (550) the warning signal to the task can include sending a first signal to a proxy running within the console (e.g., by having the proxy listen for signals to an event associated with an object for the console), and sending a second signal from the proxy to the task.
Referring to
Determining (620) whether to provide the task with the grace period can include examining the command to cancel the task to determine whether the command indicates a grace period greater than zero, and/or determining whether a grace period field (e.g., a grace period field in the command to cancel the task) is set to a zero value.
The technique of
Referring to
The compute node may be a first compute node, which can be a compute node that coordinates between different portions of a task running in multiple compute nodes. Accordingly, the task may also be running in one or more other compute nodes that are receiving instructions from a portion of the task running in the first compute node. In this situation, cancelling the task can include cancelling the portion of the task running in the first compute node and the portion(s) running in the other compute node(s).
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Number | Date | Country | Kind |
---|---|---|---|
201010536241.X | Oct 2010 | CN | national |