None
None
None
Recently developed processor chip sets for the ubiquitous Intel processors are exposing more power reporting and control interfaces with almost every release. The Basic Input Output System (BIOS) and operating systems of today's computer systems that are required for making use of these interfaces are trying to keep pace. In today's systems, being able to collect data on power consumption over a given period is more common by node rather than by processor or even by socket. With the advent of nodes with up to hundreds of processors it has become common practice to share nodes between scheduled jobs.
The BIOS and Power reporting features development provides for the collection of power consumption data such that power consumption data for specific jobs can be reported to the end users and the site managers or programmers for further evaluation. Because of limited API's (Application Program Interfaces) and node sharing this data may not be precise, but is still useful.
There are alternatives in the prior art for managing power consumption whether for a single processor, for a single socket, for a plurality of boards or modules, or within a computer “node”. For example, some current processors, comprising one or more cores, may be configured to manage their own temperature profiles and power consumption. This is done by having the processor chip or module manage its own power by manipulating its own frequency and or voltage levels. However, this approach may not be desirable for use in large systems such as High Performance Cluster (HPC) systems that are executing applications distributed across thousands of nodes because the nodes may then be potentially running at different speeds which may hinder communication or possibly greatly slow overall system performance.
As known in the art, the term “computer cluster”, referred to as “cluster” for short, is a type of computer system which completes computing jobs by means of multiple collaborative computers (also known as computing resources such as software and/or hardware resources) which are connected together. These computing resources in a same management domain have a unified management policy and provide services to users as a whole. A single computer in a cluster system is usually called a node or a computing node. The term “node” is meant to be interpreted in a general way. That is, a node can be one processor in a single cabinet, one board of processors in a single cabinet, or several boards in a single cabinet. A node can also be a virtualized collection of computing resources where physical boundaries of the hardware resources are less important.
Also, as known in the art, the scheduling of a plurality of jobs for running by a computer system, or cluster system is typically done by an operating system program module called a scheduler. Schedulers typically accept jobs from users (or from the system itself) and then schedule those jobs so as to complete them efficiently using the available or assigned system resources. Two exemplary methods of scheduling or scheduling policies are “FIFO” meaning First In First Out, and “Priority” meaning jobs are prioritized in some manner and that more important jobs are scheduled so they are more likely to run sooner than jobs having lower priority. Scheduling algorithms or policies may be quite inexact in that for example, a lower priority job may run before a higher priority depending on specific methods of scheduling or policy enforced or incorporated by a specific scheduler, or also depending on specific resource requirements.
“Backfill” scheduling is a method of scheduling or policy that typically fits on top of or follows a FIFO or Priority scheduler, for example, and attempts to fill in voids or holes in a proposed schedule and more efficiently use resources. This may result in the time required to execute a plurality of jobs being reduced from that which would be achieved for example, by a FIFO or Priority scheduler alone. Here, the term “backfill” as used herein is a form of scheduling optimization which allows a scheduler to make better use of available resources by running jobs out of order. An advantage of such scheduling is that the total system utilization is increased since more jobs may be run in a given interval of time.
In general, a backfill scheduler schedules jobs in an order that is out of order with the arrival of those jobs. That is, for example, if jobs arrive in order 1, 2, 3 a backfill scheduler is allowed to schedule those jobs for starting in an order that is different than 1, 2, 3. A strict FIFO scheduler does not typically schedule jobs out of order. In similar manner if jobs are ordered in priority 1, 2, 3 a backfill scheduler may move the starting times of selected jobs so that some of the jobs are started in an order that is out of the order of priority. A strict Priority based scheduler does not typically schedule jobs out of order of priority.
Various optimized scheduling policies for handling various types of jobs are found in the prior art. For example, there are scheduling policies for real time jobs, parallel jobs, serial jobs, and transaction type jobs. Typically, a first in first out (FIFO) scheduling policy is based on looking at priorities of jobs in queues and is beneficial for scheduling serial jobs. A backfill scheduling policy is typically used and is beneficial for handling large-scale parallel jobs as might typically be processed by a Cluster system with dozens, hundreds or even thousands of processing nodes.
Brief descriptions of two typical backfill schedulers from the prior art are provided in Appendices A, and B and such descriptions clearly establish the basis of the methods implemented within a computer system scheduler that are being proposed for modification according to certain illustrated embodiments of the present invention as described hereafter.
In an illustrated embodiment of the present invention, the method or steps performed by a backfill scheduler are modified so as to improve the management of power, and more specifically “peak” power during the running of a plurality of jobs.
It is very beneficial in management of peak power of HPC systems to provide power control features that allow either site management or user software to explicitly set CPU speed parameters, instead of the CPU making those decisions itself. This may potentially help in guaranteeing uniformity over time in either average or maximum peak power usage between the nodes and processors running specific jobs.
Some operating systems and/or BIOS control programs may also allow either the scheduler or the running jobs themselves to manipulate either the system frequency or the frequency of a node. It may also be allowed or provided for applications themselves to explicitly set, change or modify the processor frequency multiple times during the course of a job or series of jobs.
In an illustrated embodiment of the present invention, a “backfill scheduler” is modified to consider or examine, in its processing, the voids in a normal First In/First Out (FIFO), or Priority or other scheduler which are not normally filled by lower priority jobs. The scheduler determines if these voids can be filled by reducing the CPU frequency of the preceding jobs by an amount that will still allow them to complete before the next scheduling event or job which follows the void. This scheduler mechanism allows these preceding jobs to employ lower peak power usage during their time of execution. This approach or methodology may potentially provide for reducing power during certain periods of time for thousands of nodes, and thus has potential for significantly lowering the peak power requirements of an entire site or an entire cluster. Also, this approach can be optionally implemented by a method that does not delay the start time of any scheduled job, which can be viewed as an advantage.
For purpose of illustration before discussing specifics, it is important for understanding examples presented herein to have a description of the starting point for a typical program (job) scheduler which is described as First In First Out (FIFO). In the examples, the FIFO is viewed as being an algorithm for scheduling a job that “looks at” (examines the predicted attributes) of only one job at a time. That is, as jobs arrive at the input of the FIFO scheduler, the FIFO scheduler looks at the present schedule, and without changing the assignment of previous jobs, makes an assignment of the job presented for scheduling. In the following examples, the FIFO operates such that the job currently being scheduled will not be scheduled any earlier (prior in time) to any other previously scheduled job. The FIFO scheduler IS allowed to start the job currently being scheduled at the SAME time as prior jobs.
A “backfill” scheduler is an enhancement to the FIFO example just described which provides for the scheduler to look “back” and place jobs into the schedule at an assigned time that might be before a job already scheduled. For example, Job 5 might be scheduled to start before Job 6 even though Job 5 arrived for scheduling first. This approach potentially allows for more dense scheduling of resources (more highly utilized) than the pure FIFO approach, but is more complicated, and also may seem unfair to users because a later scheduled job might get done before an earlier scheduled job.
In another illustrated embodiment of the present invention, the goal is not to delay the finish time of any scheduled job.
In still another illustrated embodiment of the present invention, another goal is to complete all jobs within the same time as would have been achieved with a normal FIFO scheduling algorithm. In another illustrated embodiment of the present invention, a further goal is to complete all jobs before some specific time, over some specific period of time, or within various constraints in time that might be described by one well versed in the art of computer scheduling, and computer scheduler design.
As known in the art, Large High Performance Computer clusters may consist of thousands of compute nodes each with up to several hundred CPU cores (a core is a single processing unit). Clusters often consume large amounts of power. Feeding power to the cluster and getting rid of the heat produced is a major issue for its operation, and operation of, and control of expenses related to running a cluster or an entire computer site. There may be tens or even thousands of jobs executing and being scheduled on the cluster of nodes. Typically, the jobs are comprised of programs which run on each of the nodes allocated for the job run.
Some priority scheme is typically employed in order to choose an order or schedule to be used to run the jobs. The software which chooses the job is referred to as the cluster job scheduler. It is typical for all the nodes or other resources required for a particular job to not be available at the time of job submission. Thus, jobs must wait until their node and other resource requirements are met. While they are waiting, it is possible that many of the nodes that they will eventually use become free. The nodes which became free may sit idle unless the cluster job scheduler can find something for them to do.
A normal backfill scheduler will search for lower priority jobs which could successfully use these smaller number of nodes and still complete in time for the scheduled start of the aforementioned job which is waiting for all the nodes to become free. This requires that the expected runtime of these “backfill” jobs be known to the scheduler. It is typical for the runtime of submitted jobs to be included as an estimate in the job submission in order to facilitate this type of scheduling.
The method of the present invention takes advantage of the fact that there may not be any suitable jobs which can be employed to fill a void. The scheduler is capable of seeing or detecting these unusable voids as soon as the next job to run is in the queue. At that time, according to an illustrated embodiment of the present invention, the scheduler operates to attempt to reduce the size of voids by decreasing the CPU frequency of the nodes running specific jobs which will become idle prior, or which are scheduled to complete prior to the next job start. This decrease in the frequency elongates (extends) the execution time of these jobs and reduces the peak power required during this period.
A person knowledgeable in the art will be able to employ the techniques of the present invention for either a CPU whose frequency may be set in discrete steps or a CPU whose frequency is continuously variable.
It will be noted that power consumption on current processor models is roughly directly proportional to frequency in a linear relationship. Job runtime for a CPU bound job will also typically be roughly directly proportional to frequency. For an I/O bound job, reductions in CPU frequency may have less impact on overall runtime.
The main power consumption within a typical processing node is primarily by the CPU, but other hardware in the node such as fans and power supplies and other support chips and modules can also require a significant amount of power. In cases where the node is not powered down or put into a low power sleep state, then the use of the teachings of the present invention would typically cause the total power consumption to remain roughly the same or slightly increase while retaining the benefit of the peak power consumption being lower during the periods of reduced frequency. In cases where the node could have been powered down or put into a low power sleep state, then total power may actually increase, but the peak power would still typically be reduced by the use of the teachings of the present invention.
It will be further noted out that the peak operating temperatures for each node whose frequency is regulated will also be smoothed. The resulting smoothing of the power consumption and accompanying smoothing of temperature variations could prove beneficial for increasing the lifetime of the components.
A further illustrated embodiment of this present invention utilizes the observation that when a backfill scheduler selects a job or jobs to run in the scheduling voids as previously described, the void will typically not be completely filled. This refinement according to the teachings of the present invention enables the scheduler to determine that the attempt at backfill has left a portion of the void unfilled. This portion can then be eliminated or nearly eliminated by the scheduler reducing the CPU frequency of the node or nodes where the backfill job or jobs are being run. This frequency reduction results in elongating the runtime of the job or jobs, but the adjustment in frequency will be made such that the jobs will still be able to complete in time so as to not impact the overall system schedule. This reduction in frequency results in a reduction in peak power requirements for the site during this period of operation.
Another useful refinement according to the teachings of this invention as shown in a further illustrated embodiment of the present invention is based upon recognizing that after a job or jobs has been deployed at a reduced frequency, a job of higher priority may arrive whose start time may be expedited by having the scheduler reduce elongation of the already launched jobs by increasing their CPU frequency. This increasing of the frequencies of jobs already in execution in turn may produce a void suitable for the running of the newly arrived job or jobs.
It will be noted that the prediction of the runtime of a job may not be precise whether its time is estimated by a person running the job, based upon history, or calculated using some algorithm. Therefore, another refinement according to the teachings of this invention as shown in a further illustrated embodiment is based upon recognizing if a job with previously reduced frequency is running longer than its expected or forecasted time and is now delaying the start of another job, in which case the frequency reduction can be canceled and the frequency increased by the scheduler.
The invention is better understood by reading the detailed description of the invention in conjunction with the accompanying drawings in which:
This description provided in
Further depicted in
According to a further illustrated embodiment of the present invention depicted in the
According to another illustrated embodiment of the present invention illustrated in
It will be noted however in the timeline 330 of
In
Decisions to take longer to run particular jobs at a reduced speed; e.g., J5, as opposed to running them at their fastest time, can be based on other scheduler criteria such as user attributes, user specified job submission parameters, time of day, temperature of rack, current site peak power usage, and other factors which will be obvious to those knowledgeable in the art. The method and enhanced scheduler operation according to the present invention utilizes the techniques of the backfill scheduler to locate candidate jobs that provide for reduced power consumption.
In other illustrated embodiments, the method of the present invention can be employed to utilize the site scheduler so as to control the processor speed for each job step in order to get more predictable results.
The illustrated method and enhanced scheduler of the present invention may be employed in conjunction with the operations described above.
It may provide further benefits in the operation of a High Performance Computer cluster, or any computer system in terms of providing a method which allows for computer system users or administrators to provide input to the scheduling of programs, jobs, or job sets that enables the method being employed or performed by the computer system scheduler to utilize that input in making scheduling decisions. For example, a user could choose to maximize performance during running of a particular job by selecting not to allow power management by the scheduler. (It will be noted that power management by the hardware, BIOS, or operating system still may occur to avoid damaging equipment or for other reasons). In this manner users can “help” or assist the method employed or performed by the scheduler to make better decisions by providing at least some indication and information to the scheduler as to, for example, as to which jobs are most important, which jobs cannot employ processor frequency management, and which jobs must be completed within specific time periods.
In another illustrated embodiment that incorporates the teachings of the present invention, a selection by a user or administrator might provide a user with the capability of allowing power management with potential for increasing run time in return for reduced billing or charges incurred for running a particular program or job. Connecting or associating billing, rates of resource usage accounting, or other “cost” or “charges” for running particular programs or jobs with user input describing desired or allowed power management by the scheduler would provide incentive for users to allow or select to have power management applied to their job(s). In a further enhancement, specific user jobs could be implicitly run and billed with permission given to apply power management based upon the job being run “from” a specific computer system user id (userid) source.
Reference to U.S. Provisional Patent Application 61/560,652 filed Nov. 16, 2011 This application claims priority to a U.S. Provisional Patent Application 61/560,652 Filed Nov. 16, 2011 titled “A MODIFIED BACKFILL SCHEDULER AND A METHOD EMPLOYING FREQUENCY CONTROL TO REDUCE PEAK CLUSTER POWER REQUIREMENTS” with first named inventor David A. Egolf, Glendale, Ariz. (US), which is expressly incorporated herein as though set forth in full.