The present disclosure relates generally to computer systems and, in particular, to systems and methods for scheduling on parallel machines including multi-level preemption.”
High performance computing platforms such as mainframe or cluster computers are used for computationally intensive operations. Nearly all mainframes have the ability to run (or host) multiple operating systems and thereby operate not as a single computer but as a number of virtual machines. In this role, a single mainframe can replace dozens or even hundreds of smaller servers, reducing management and administrative costs while providing greatly improved scalability and reliability.
Mainframes or clusters may include many (hundreds or even thousands) of central processing units (CPU's). Each CPU contained in a mainframe or cluster computer shall be referred to herein as a “node.” A node, as the term is used herein, is not limited to CPU and may be any microprocessor and could be contained, for example, in a personal computer.
Mainframes or clusters are designed to handle very high volume input and output (I/O) and emphasize throughput computing. Since the mid-1960s, mainframe designs have included several subsidiary computers (called channels or peripheral processors) which manage the I/O devices, leaving the CPU free to deal only with high-speed memory. In addition, clustered machines (that is, a group of computers clustered together) may also allow high volume input and output.
One embodiment of the present invention is directed to a computing system configured to handle preemption events in an environment having jobs with high and low priorities. The system of this embodiment includes a job queue configured to receive job requests from users, the job queue storing the jobs in an order based on the priority of the jobs, and indicating whether a job is a high priority job or a low priority job. The system of this embodiment also includes a plurality of node clusters, each node cluster including a plurality of nodes and a scheduler coupled to the job queue and to the plurality of node clusters and configured to assign jobs from the job queue to the plurality of node clusters. The scheduler is configured to preempt a first low priority job running in a first node cluster with a high priority job that appears in the job queue after the low priority job has started and, in the event that a second low priority job from the job queue may run on a portion of the plurality of nodes in the first node cluster during a remaining processing time for the high priority job, backfill the second low priority job into the portion of the plurality of nodes and, in the event a second high priority job is received in the job queue and may run on the portion of the plurality of nodes, return the second low priority job to the job queue.
Another embodiment of the present invention is directed to a method for managing preemption events in a backfill enabled computing system. The method of this embodiment includes suspending a first low priority job running on one or nodes of a node cluster upon receipt of a first high priority job; running the first high priority job on one or more nodes of the node cluster; selecting a second low priority job from a job queue, the second low priority job having a position in the job queue; running the second low priority job on available nodes of the node cluster while the high priority job is running; receiving a request for a second high priority job; and returning, after receiving the request for the second high priority job, the second low priority job to a job queue in the position in the job queue.
Another embodiment of the present invention is directed to a method of managing the operation of computing system including a plurality of node clusters, each node cluster including a plurality of nodes. The method of this embodiment includes allocating a first low priority job to run on an a first set of the nodes in a first node cluster; running the first low priority job on the first set of nodes; receiving, at a job queue, a first high priority job; suspending the first low priority job; running the first high priority job on a second set of nodes in the first node cluster for a predetermined amount of time; selecting a second low priority job from the job queue; running the second low priority job on a third set of nodes in the first node cluster; receiving a second high priority job on the job queue; returning the second low priority job to the job queue; and running the first low priority job after the first and second high priority jobs are complete.
Other systems, methods, and/or computer program products according to embodiments will be or become apparent to one with skill in the art upon review of the following drawings and detailed description. It is intended that all such additional systems, methods, and/or computer program products be included within this description, be within the scope of the present invention, and be protected by the accompanying claims.
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
f show node usage in a node cluster as various jobs are received and run;
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
Aspects of the present invention are directed to systems and methods which take into account multiple preemption events. In some embodiments, the invention adjusts to changing backfill window conditions caused by multiple preemption events and canceling low priority jobs which may have advantageously backfilled into the system due to past high priority work. That is, a backfilled job that is low priority, if it needs to be preempted by a high priority job, will be returned to the job queue in its original location. Operating in this manner preserves the intent of classic preemption rules: to allow higher priority jobs to run immediately even if low priority work is presently using them.
In addition, aspects of the present invention prevent previously preempted low priority work from excessive restart delays due to low priority backfill in the event of high priority preemption. As a further advantage high system utilization is preserved because low priority work may still be advantageously backfilled in the event of a high priority preemption. Systems and methods according to the present invention subject low priority backfill to rules that prevents the problems experienced in the current backfill and preemption implementations.
Each of the node clusters 102 may be coupled to a scheduler 104. The scheduler 104 may, in some embodiments, be a parallel job scheduler such as a Tivoli Workload Scheduler Loadleveler from IBM. In general, the scheduler 104 determines how to distribute the jobs contained in a job queue 106 amongst the node clusters 102 in order to efficiently utilize the resources of the computing system 100. The computing system 100 receives jobs to be performed from a system user 108 and these jobs are placed in the job queue 106 in the order they are received. Of course, more than one system users 108 could be connected to the computing system 100. The job queue 106 and the operation of the scheduler 104 are described in greater detail below.
According to classic preemption rules, a HiP job will always preempt a LowP job. That is, even if a HiP job is received from a user after a LowP job, the computing system will halt (suspend) the LowP job and run the HiP job if there are not enough nodes to process both jobs simultaneously. In such a case, the LowP is suspended and put on hold until the nodes it was running on become available (i.e., after the HiP job finishes). Such a preemption rule may be established in the configuration files of the scheduler. As one of skill in the art will realize, different configurations could exist but the following discussion that the above preemption rule is applied.
The job queue 106 may also include a state column 212 indicating whether a job is running (R) or idle (I). The job queue 106 may also include a nodes required column 214. The values in the node required column 214 represents the number of nodes (or processors) in a node cluster that will be needed for a particular process to run. The job queue 106 may also include a time required column 216 indicating how long the job will take to run and a time in column 218 that represents a time that a particular job was submitted to the queue. As shown, the time in column 218 is just an ordered list for ease of explanation but any type of representation of the time a job came in may be utilized. Alternatively, the time in column 218 could represent an order that jobs are to be completed or any other type of ordering for jobs to be completed.
It should be understood that the job queue 106 shown in
f show the usage of particular nodes in a single node cluster. As depicted, the node cluster includes 16 nodes. Of course, the node cluster could include any number of node clusters. The description of
The scheduler attempts to achieve maximal efficiency and, therefore, maximal usage of the nodes. As shown in
As shown in
As can be seen from this brief example, in the case where multiple preemptions occur in systems where backfilling is allowed, situations could exist where Job 1 never gets completed. Aspects of the present invention may ensure that this will not happen.
At a block 406 a higher priority job is received and causes, at a block 408, the backfill job to be returned to the queue in the location it originally had. Returning the backfill job to the job queue at block 408 accomplishes the goal of ensuring that low priority backfill jobs do not end up taking on a higher priority than previously started low priority jobs. In this manner the present invention helps alleviate the problems associated with backfilling that may exist when multilevel preemption events occur.
The process begins at a block 510 where the status of the backfill job is checked. It is then determined, at a block 510, whether a particular processing threshold has been exceeded. This threshold is based on how much of the backfill job has been completed. In some instances it may be advantageous to store all of the processing that has been performed on the backfill job in a storage location (available disk space or memory) so that when the job is restarted this information may be retrieved. Of course, in certain instances the job may not have progressed far enough such that the amount of time taken to store the information and then recall when the backfill job restarts results in any advantage. That is, it may take longer to “checkpoint” the job than to just restart it when the job reappears at the top of the job queue. The threshold that is checked may be a time running or amount of processing completed and is configurable by a user of the system.
If the threshold has not been exceeded, processing progresses to a block 514 where the backfill job is returned to the job queue. If the threshold has been exceeded processing progresses to a block 516 where the processing that has been performed on the job is stored. After the processing of the backfill job has been stored, at a block 518, the job is returned to the job queue and, in some instances may include an indication of the memory locations where the checkpoint information has been stored.
In some embodiments, certain jobs may not be checkpointable. For instance, in the example above, assume Job 3 is not checkpointable and it has run for a certain portion of its run time. In some embodiments, jobs that are not able to be checkpointed are tagged as such and the system would not save any job state if the job is placed back onto the job queue. In addition, another class of jobs may also impact storage in such a way that, if they are cancelled after a partial run, there would be undesired side effects. Jobs in this category are tagged as not checkpointable, and not able to be backfilled in the case where a preempted backfill window becomes available. These special situations can be handled by flags within the jobs to indicate which different categories the job belongs to.
Referring again to
At a block 802 a request for a HiP job is received. At a block 804 it determined if any of the node clusters, considered either alone or in combination, include enough open nodes to run the HiP job. If so, at a block 818 the HiP job has nodes in one of the node clusters allocated for it and it is run. The allocation of nodes may include updating the status of particular nodes within a node cluster in the machine list as described above. If there are no nodes available, or not enough nodes available to run the HiP job, then the node clusters are examined to see if any have LowP jobs running on them that may be preempted at a block 806. The determination made in block 806 may include scanning the machine list which contains the status of the nodes in node clusters of the computing system. It will be understood that a HiP job may run on a combination of free nodes and nodes that previously had LowP job running on them. In other words, to the extent additional nodes beyond available free nodes are needed, those nodes may be made available by preempting LowP jobs.
In the event that no LowP jobs are running that may be preempted, i.e., all of the nodes are running high priority jobs, the current scheduler pass moves to the next job and processing returns to block 804. That is, if a job cannot be scheduled, it is skipped and the scheduler considers other jobs in the list. At the next pass, all idle jobs are again checked to see if they can now run (due to changes in node availability). In the event that there is a LowP job running on a particular node in certain clusters that may be preempted (i.e., the LowP job is utilizing the same or more nodes than are needed by the HiP job), at a block 810 it is determined whether this LowP is a backfill job. In some embodiments, the probability that a node having a backfill job running on it will be selected may be reduced by placing nodes having backfill jobs running on them at the bottom of the machine list and scanning the machine list from top to bottom when searching for nodes having LowP jobs in block 808.
If the LowP job is not a backfill job, at a block 820 the job is preempted. In some instances, preempting the job may cause the node or node cluster the job was running on to be placed at the bottom of the machine list. The HiP job is then allocated and run at a block 818. Of course, as one of skill in the art will realize, multiple LowP jobs may need to be preempted in order to free up the required resources. As such, the process performed in block 820 may involve preempting multiple LowP jobs. If the LowP was a backfill job, a checkpointing routine may be run. That is, if the LowP job is a backfill job as determined at block 810, at a block 812 it is determined if the backfill job should be checkpointed. If so, at a block 814 the processing that has been performed on the backfill job is saved. Regardless of whether checkpointing is required, at a block 816 the backfill job is returned to the job queue. As discussed above, returning the backfill job to the job queue ensures that a LowP backfill job will not, by virtue of being a backfill job, achieve a status that is higher than a LowP job that was previously started but is suspended because it was preempted.
As described above, embodiments can be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. In exemplary embodiments, the invention is embodied in computer program code executed by one or more network elements. Embodiments include computer program code containing instructions embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, universal serial bus (USB) flash drives, or any other computer-readable storage medium, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. Embodiments include computer program code, for example, whether stored in a storage medium, loaded into and/or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims. Moreover, the use of the terms first, second, etc. do not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another. Furthermore, the use of the terms a, an, etc. do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced item.
Number | Name | Date | Kind |
---|---|---|---|
5872963 | Bitar et al. | Feb 1999 | A |
6021425 | Waldron, III et al. | Feb 2000 | A |
6687905 | Day, III et al. | Feb 2004 | B1 |
6694345 | Brelsford et al. | Feb 2004 | B1 |
6728792 | Wagner | Apr 2004 | B2 |
6895292 | Fromherz et al. | May 2005 | B2 |
6993762 | Pierre | Jan 2006 | B1 |
7082606 | Wood et al. | Jul 2006 | B2 |
7222343 | Heyrman et al. | May 2007 | B2 |
7480913 | Buco et al. | Jan 2009 | B2 |
7596788 | Shpigelman | Sep 2009 | B1 |
7650601 | Aguilar et al. | Jan 2010 | B2 |
7920282 | Coppinger et al. | Apr 2011 | B2 |
7984447 | Markov | Jul 2011 | B1 |
8108656 | Katragadda et al. | Jan 2012 | B2 |
8185903 | Fulton et al. | May 2012 | B2 |
20020120488 | Bril et al. | Aug 2002 | A1 |
20020194248 | Wood et al. | Dec 2002 | A1 |
20030208521 | Brenner et al. | Nov 2003 | A1 |
20040015973 | Skovira | Jan 2004 | A1 |
20040199918 | Skovira | Oct 2004 | A1 |
20040236556 | Lin | Nov 2004 | A1 |
20050055697 | Buco et al. | Mar 2005 | A1 |
20050125793 | Aguilar et al. | Jun 2005 | A1 |
20050228630 | Tseng et al. | Oct 2005 | A1 |
20060218558 | Torii et al. | Sep 2006 | A1 |
20070044102 | Casotto | Feb 2007 | A1 |
20070143760 | Chan et al. | Jun 2007 | A1 |
20070220152 | Jackson | Sep 2007 | A1 |
20080027565 | Erva et al. | Jan 2008 | A1 |
20080115140 | Erva et al. | May 2008 | A1 |
20080178185 | Okada et al. | Jul 2008 | A1 |
20080229318 | Franke | Sep 2008 | A1 |
20090031312 | Mausolf et al. | Jan 2009 | A1 |
20090083746 | Katsumata | Mar 2009 | A1 |
20090165009 | Heffernan et al. | Jun 2009 | A1 |
Entry |
---|
Kannan et al. (NPL: Kannan—2001.pdf), “Workload Management with LoadLeveler”, Nov. 2001, IBM Redbooks, (pp. 1-228). |
E.G. Coffman, “On the Tradeoff Between Response and Preemption Costs in a Foreground-Background Computer Service Discipline,” IEEE Transactions on Computers, vol. 18, No. 10, pp. 942-947, Oct. 1969. |
Number | Date | Country | |
---|---|---|---|
20090276781 A1 | Nov 2009 | US |