Fault Tolerant System for Execution of Parallel Jobs

Information

  • Patent Application
  • 20080077925
  • Publication Number
    20080077925
  • Date Filed
    September 26, 2006
    17 years ago
  • Date Published
    March 27, 2008
    16 years ago
Abstract
The present invention provides a fault tolerant system and method for parallel job execution. In the proposed solution the job state and the state transition control are decoupled. The job execution infrastructure maintains the state information for all the executing jobs, and the job control units, one per-job, control the state transitions of their jobs. Due to the stateless nature of the control units, the system and method allow jobs to continue uninterrupted execution even when the corresponding control units fail.
Description

BRIEF DESCRIPTION OF THE FIGURES

The above and other items, features and advantages of the invention will be better understood by reading the following more particular description of the invention in conjunction with the accompanying drawings wherein:



FIG. 1 shows the various states of a process during a normal execution cycle.



FIG. 2 shows an example of a parallel job execution architecture.



FIG. 3 shows a job execution architecture according to an embodiment of the invention.



FIG. 4 shows a job execution architecture according to another embodiment of the invention.





DETAILED DESCRIPTION OF THE INVENTION

The invention will now be described with reference to the accompanying figures. The figures and the description is meant for explanatory purposes only and is in no way intended to limit the inventive features of the invention.


The present invention describes a system and a method for running one or more jobs in parallel on a common job execution infrastructure. The jobs' state is maintained with the job execution infrastructure, thereby decoupling the control units from the actual execution of their jobs. Should the connection to the control unit be lost for any reason, the execution infrastructure allows the jobs to continue uninterrupted execution since the state of the jobs is maintained with the infrastructure, and not with the control units.


The control units have a similar role as the Mpirun programs, in that control units are programs used to launch the jobs on the execution infrastructure. Control units differ from Mpirun in that control units are stateless with respect to the jobs they control. This allows jobs to continue uninterrupted execution even if their corresponding control units are killed.


The actual state of all jobs is maintained by the job execution infrastructure, which exports an interface for jobs state query and control. The control units query the state of their jobs, and communicate state transitions to the execution infrastructure using that interface.


In parallel computing a typical job is divided by a programmer into two or more units, each of which has hardware resources allocated to it. One or more job units may share the hardware resources, or hardware resources maybe dedicated to each job unit. The jobs are executed using the hardware resources. Each complete job that is run in parallel is referred to as ‘job’, and each division as a ‘job unit’. Each job unit is equivalent to a process.


During a typical execution cycle a job may go through various job states. FIG. 1 shows the various states that a job may undergo during its lifecycle. At first when a new job is created, it is in the ‘initiated’ state 101. Thereafter in order to start the execution of the job it is loaded 102. Once the job is loaded, it is given the run command to get it into the running state 103. While a job is running, its process may be stopped 104, to be resumed later and restore it back to running state 103. A job may also be put in a debugging mode 105 and then be resumed to the running state 103. A running job may be naturally terminated or it may terminate because of some error 106. All the job units, i.e., the sub processes of a job, have the same state at any given point of time.


Mpirun is the program used by the user to launch his job, in a typical parallel computing environment. Mpirun controls the state transitions of the job until the job is completed. A system that implements such a parallel computing environment is shown in FIG. 2.


As shown in FIG. 2, jobs A and B may be decomposed into two or more job units 201 and 202. The job units are allocated hardware resources on the job execution infrastructure 203 and are executed in parallel. The state of a particular job is maintained by its corresponding Mpirun program 204 and 205, which also controls the job's state transitions. These Mpirun programs communicate state transitions for the jobs to the execution infrastructure 203. The job execution infrastructure 203 is stateless with regard to the state of jobs, whereas the Mpirun programs are state-full.


If the connection to Mpirun is lost for some reason, in the aforementioned setup, the job execution infrastructure takes immediate action to terminate the corresponding job. This results in waste of all resources consumed by the job until its early termination. Furthermore, in such a setup, the Mpirun is closely coupled with the job execution infrastructure. Since, the Mpirun is actually responsible for performing the state transitions, it is difficult to achieve the portability of Mpirun to different platforms.


The present invention proposes to overcome the aforementioned drawback by decoupling the control and the state maintenance of jobs. FIG. 3 shows an embodiment according to the present invention. Once a job is received, that has been divided into two or more job units, a job allocation unit (not shown) allocates the job units 301 and 302 hardware resource on the job execution infrastructure 303. Apart from executing the jobs, the job execution infrastructure also maintains the information on the present state of the different jobs 304 and 305. For example, job 304 may be in state “running” while job 305 may be in state “stopped”. In this setup, stateless Mpirun (also called ‘control unit’) is responsible only for controlling state transitions for jobs 306 and 307. The actual low level transition of state is carried out by the job execution infrastructure 303, which exports an interface 308 for state transitions control and monitoring. Any control-unit implementation, that conforms to the interface, even one on a different platform, may control parallel jobs on that execution infrastructure.


In the above setup, the job execution infrastructure would maintain the job in its current state instead of terminating it, even if the connection to the control unit is lost for some reason. If the job is in state of “running” when it terminates, the infrastructure moves the job to a “terminated” state. At a later time, a subsequent control unit can reconnect to the job, by polling its current state from the infrastructure. This subsequent control unit can then continue to control the job's state transitions. In this manner the work performed by long executing jobs is not lost and resources are not wasted.


In yet another embodiment of the aforementioned concept, in particular for the Blue Gene/L™ supercomputer system as shown in FIG. 4, the jobs state information is maintained in a DB2 database 406, and the execution infrastructure 403 uses a DB2 client to connect to the database, update jobs state, and answer job state queries from the job control units. The job execution infrastructure exports a set of Application Programming Interfaces (APIs) for adding and removing jobs, querying jobs states and performing state transitions.


The Mpirun programs 404 and 405 interact with the job execution infrastructure 403 using these APIs. To perform a state transition, the Mpirun calls the corresponding API, and the BlueGene/L execution infrastructure handles the rest. To check if the state transition completes successfully, the control units use the ‘query API’. The execution infrastructure will answer the query from the database 406.


There is no need to maintain a permanent connection between the control units and the infrastructure. This allows flexibility to an extent that the control units can be killed, re-restarted, suspended and resumed, and the job will still remain in its current state. The job will continue uninterrupted execution, or moved to a ‘terminated’ state by the infrastructure, on its termination.


The basis for maintaining a connection between the infrastructure and the control units is for streaming input and output. However, by buffering input and output, and later streaming it to a subsequent control unit, the infrastructure can maintain a continuity in execution of jobs, even if the connection is lost.


In the aforesaid description, specific embodiments of the present invention have been described by way of examples with reference to the accompanying figures and drawings. One of ordinary skill in the art will appreciate that various modifications and changes can be made to the embodiments without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention.

Claims
  • 1. A system for executing one or more jobs in parallel in a job execution infrastructure, wherein each job comprises of one or more job units, the system comprising: a job distribution unit for allocating at least a portion of the job execution infrastructure to each job; andjob control units, one per-job, for controlling the execution state transitions for the jobs.
  • 2. The system of claim 1, wherein the execution states of the jobs include initial, loaded, running, stopped, debugged and terminated.
  • 3. The system of claim 1, wherein the job execution infrastructure comprises a job control interface used for issuing queries and communicating the execution state transitions for the jobs being executed.
  • 4. The system of claim 3, wherein a job control unit that conforms to the job control interface of said execution infrastructure, may issue queries and communicate execution state transitions for the job it controls, through that interface.
  • 5. The system of claim 1, wherein the job execution infrastructure maintains the state of all executing jobs in a database coupled to the job execution infrastructure.
  • 6. The system of claim 1, wherein the job control units remain stateless with regards to the jobs they control.
  • 7. The system of claim 1, wherein jobs may continue uninterrupted execution even if their corresponding job control units fail.
  • 8. The system of claim 7, wherein subsequent control units may take control over the executing jobs by querying the current state of the job from the execution infrastructure, and continue to communicate further state transitions.
  • 9. A method for executing one or more jobs in parallel in a job execution infrastructure wherein each job comprises of one or more job units, the method comprising: allocating at least a portion of the job execution infrastructure to each job; andcontrolling the execution state of the jobs.
  • 10. The method of claim 9, wherein the step of controlling the execution state of the jobs comprises of interfacing between the execution infrastructure and the job control units using job control APIs exported through the job control interface of the execution infrastructure.
  • 11. The method of claim 10 wherein the job control APIs are used to query the state of the jobs in order to regain control of future state transition by subsequent job control units when the original job control units fail.