The embodiments of the invention will be better understood from the following detailed description with reference to the drawings, in which:
The embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments of the invention. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments of the invention may be practiced and to further enable those of skill in the art to practice the embodiments of the invention. Accordingly, the examples should not be construed as limiting the scope of the embodiments of the invention.
The embodiments of the invention include the following. First, co-scheduling of job dispatching and data replication assignments and simultaneously scheduling both for achieving good makespans is identified. Second, it is shown that deploying a genetic search method to solve the optimal allocation problem has the potential to achieve significant speed-up results versus traditional allocation mechanisms. Embodiments herein provide three variables within a job scheduling system, namely the order of jobs in the scheduler queue, the assignment of jobs to compute nodes, and the assignment of data replicas to local data stores. There exists an optimal solution that provides the best schedule with the minimal makespan, but the solution space is prohibitively large for exhaustive searches. To find the optimal (or near-optimal) combination of these three variables in the solution space, an optimization heuristic is provided using a genetic method. By representing the three variables in a “chromosome” and allowing them to compete and evolve, the method converges towards an optimal (or near-optimal) solution.
A job and data co-scheduling model 100 is illustrated in
The compute nodes 130 are supported by local data stores 140 capable of caching read-only replicas of data downloaded from remote data stores 150. The local data stores 140, depending on the context of the applications, can range from web proxy caches to data warehouses. It is assumed that the compute nodes 130 and the local data stores 140 are connected on a high-speed LAN (e.g. Ethernet or Myrinet) and that data can be transferred across the stores. The model 100 can be extended to multiple LANs containing clusters of compute nodes and data stores, but for simplicity a single LAN is assumed. Data downloaded from the remote store 140 crosses a wide-area network 160 such as the Internet. The terms “data object”, “object”, “data object 170”, or “object 170” [Stockinger+01] are used to encompass a variety of potential data manifestations, including Java objects and aggregated SQL tuples, although its meaning can be construed to be a file on a file system.
The model 100 includes the following assumptions. First, the jobs 110 follow the bag-of-tasks programming model with no inter-job communication. Second, data retrieved from the remote data stores 150 is read-only. Output being written back to the remote data stores 150 is not considered because computed output is typically directed to the local file system at the compute nodes 130 and such output is commonly much smaller and negligible compared to input data. Further, the computation time required by a job 110 is known to the scheduler 120. In practical terms, when jobs 110 are submitted to the scheduler 120, the submitting user typically assigns an expected duration of usage to each job 110 [Mu'alem+01]. Moreover, the data objects 170 required to be downloaded for a job 110 are known to the scheduler 120 and can be specified at the time of job submission. Additionally, the communication cost for acquiring the data objects 170 can be calculated for each job 110. The only communication cost considered is transmission delay, which can be computed by dividing a data object 170's size by the bottleneck bandwidth between a sender and receiver. As such, queueing delay or propagation delay is not considered. Furthermore, if the data object 170 is a file, its size is typically known to the job 110's user and specified at submission time. On the other hand, if the data object 170 is produced dynamically by a remote server, it is assumed that there exists a remote API that can provide the approximate size of the data object 170. For example, for data downloads from a web server, a HTTP's HEAD method can be used to get the requested URI's (uniform resource identifier's) size prior to actually downloading it. Moreover, the bottleneck bandwidth between two network points can be ascertained using known techniques [Hu+04] [Ribeiro+04] that typically trade off accuracy with convergence speed. It is assumed that such information can be periodically updated by a background process and made available to the scheduler 120. In addition, arbitrarily detailed delays and costs are not included in the model 100 (e.g., database access time, data marshalling, or disk rotational latency), as these are dominated by transmission delay and computation time.
Given such assumptions, the lifecycle of a submitted job 110 proceeds as follows. When a job 110 is submitted to the queue, the scheduler 120 assigns it to a compute node 130 (using a traditional load-balancing method or the method discussed herein). Each compute node 130 maintains its own queue from which jobs 110 run in FIFO order. Each job 110 requires data objects 170 from remote data stores 150; these data objects 170 can be downloaded and replicated to one of the local data stores 140 (again, using a traditional method or the method discussed herein), thereby obviating the need for subsequent jobs 110 to download the same data objects 170 from the remote data store 150. All required data objects 170 are downloaded before a job 110 can begin, and data objects 170 are downloaded on-demand in parallel at the time that a job 110 is run. Although parallel downloads will almost certainly reduce the last hop's bandwidth, for simplicity it is assumed that the bottleneck bandwidth is a more significant concern. A requested data object 170 will be downloaded from a local data store 140, if it exists there, rather than from the remote store 150. If a job 110 requires a data object 170 that is currently being downloaded by another job 110 executing at a different compute node 130, the job 110 either waits for that download to complete or instantiates its own, whichever is faster based on expected download time maintained by the scheduler 120.
Thus, it can be seen that if jobs 110 are assigned to compute nodes 130 first, the latency required to access data objects 170 may vary drastically because the objects 170 may or may not have been already cached at a close local data store 140. On the other hand, if data objects 170 are replicated to local data stores 140 first, then the subsequent job executions will be delayed due to these same variations in access costs. Furthermore, the ordering of the jobs 110 in the queue can affect the performance. For example, if job 110A is waiting for job 110B (on a different compute node 130) to finish downloading an object 170, job 110A blocks any other jobs 110 from executing on its compute node 130. Instead, if the job queue is rearranged such that other shorter jobs 110 run before job 110A, then these shorter jobs 110 can start and finish by the time job 110A is ready to run. This approach is similar to backfilling methods [Lifka95] that schedule parallel jobs requiring multiple processors. The resulting tradeoffs affect the makespan.
With this scenario as it is illustrated in
Existing work in job scheduling can be analyzed in the context presented above. Prior work in schedulers that dispatch jobs in FIFO order eliminate all but one of the J! job orderings possible. Schedulers that also assume the data objects have been preemptively assigned to local data stores eliminate all but one of the SD ways to replicate. Essentially all prior efforts have made assumptions that allow the scheduler to make decisions from a drastically reduced solution space that may or may not include the optimal schedule.
The relationship between these three variables is intertwined. Although they can be changed independently of one another, adjusting one variable will have an adverse or beneficial effect on the schedule's makespan that can be counter-balanced by adjusting another variable.
With a solution space size of J!·CJ·SD, a goal is to find the schedule in this space that produces the shortest makespan. To achieve this goal, a genetic method [Baeck+00] is used as a search heuristic. While other approaches exist, each has its limitations. For example, an exhaustive search, as mentioned, would be pointless given the potentially huge size of the solution space. An iterated hill-climbing search samples local regions but may get stuck at a local optima. Simulated annealing can break out of local optima, but the mapping of this approach's parameters, such as the temperature, to a given problem domain is not always clear.
A genetic method (GM) simulates the behavior of Darwinian natural selection and converges upon an optimal (or near-optimal) solution through successive generations of recombination, mutation, and selection, as shown in the pseudocode 200 of
Initially, a random set of chromosomes is instantiated as the population. The chromosomes in the population are evaluated (hashed) to some metric, and the best ones are chosen to be parents. In this context, the evaluation produces the makespan that results from executing the schedule of a particular chromosome. The parents recombine to produce children (simulating sexual crossover), and occasionally a mutation may arise which produces new characteristics that were not present in either parent; for simplification, embodiments herein did not implement the optional mutation. The best subset of the children is chosen (based on an evaluation function) to be the parents of the next generation. Elitism is further implemented, where the best chromosome is guaranteed to be included in each generation in order to accelerate the convergence to the global optimum (if it is found). The generational loop ends when some criteria is met (e.g., termination after 100 generations). At the end, a global optimum or near-optimum can be found. Note that finding the global optimum is not guaranteed because the recombination has probabilistic characteristics.
Using a genetic method is suited in this context. The job queue, job assignments, and object assignments can be represented as character strings, which allows the leverage of prior genetic method research in how to effectively recombine string representations of chromosomes [Davis85]. Additionally, a genetic method's running time can be traded off for increased accuracy: increasing the running time of the method increases the chance that the optimal solution can be found. While this is true of all search heuristics, a genetic method has the potential to converge to an optimum very quickly.
An objective of the genetic method is to find a combination of the three variables that minimizes the makespan for the jobs. The resulting schedule that corresponds to the minimum makespan will be carried out, with jobs being executed on compute nodes and data objects being replicated to data stores in order to be accessed by the executing jobs. At a high level, the workflow proceeds as follows. First, jobs are queued. Jobs requests enter the system and are queued by the job scheduler.
Next, the scheduler takes a snapshot of the jobs in the queue. In order to achieve the tightest packing of jobs into a schedule, the scheduling method should look at a large window of jobs at once. FIFO schedulers consider only the front job in the queue. The optimizing scheduler in [Shmueli+03] uses dynamic programming and considers a large group of jobs, which they call “lookahead,” on the order of 10-50 jobs. Embodiments herein call the collection of jobs a snapshot window. The scheduler takes this snapshot of queued jobs and feeds it into the scheduling method. Taking the snapshot can vary in two ways, namely by the frequency of taking the snapshot (e.g., at periodic wallclock intervals or when a particular queue size is reached) or by the size of the snapshot window (e.g., the entire queue or a portion of the queue starting from the front). Embodiments herein can consider the entire queue at once.
Following this, the GM converges to the optimal schedule. Given a snapshot, the genetic method executes. An objective of the method is to find the minimal makespan. The evaluation function, as more fully described below, takes the current instance of the three variables as input and returns the resulting makespan. As the genetic method executes, it can converge to an optimal (or near-optimal) schedule with the minimum makespan.
Next, the schedule is executed. Given the genetic method's output of an optimal schedule consisting of the job order, job assignments, and object assignments, the schedule is executed. Jobs are executed on the compute nodes, and the data objects are replicated on-demand to the data stores so they can be accessed by the jobs.
Each chromosome consists of three strings, corresponding to the job ordering, the assignment of jobs to compute nodes, and the assignment of data objects to local data stores. As illustrated in
Recombination is applied only to strings of the same type to produce a new child chromosome. In a two-parent recombination scheme for arrays of unique elements, a 2-point crossover scheme can be used where a contiguous subsection of the first parent is copied to the child, and then all remaining items in the second parent (that have not already been taken from the first parent's subsection) are then copied to the child in order [Davis85].
In a uni-parent mutation scheme, two items can be chosen at random from an array and the elements can be reversed between them, inclusive. Mutation can be used to increase the probability of finding global optima. Other recombination and mutation schemes are also possible, as well as different chromosome representations.
A component of the genetic method is the evaluation function. Given a particular job ordering, set of job assignments to compute nodes, and set of object assignments to local data stores, the evaluation function returns the makespan. The makespan is calculated deterministically from the method described below. The rules use the lookup table 400 in
At any given iteration of the genetic method, the evaluation function executes to find the makespan of the jobs in the current queue snapshot. The pseudocode 500 of the evaluation function is shown in
In the loop spanning lines 11 to 29, the function looks at all objects required by the currently considered job and finds the maximum transmission delay incurred by the objects. Data objects required by the job are downloaded to the compute node prior to the job's execution either from the data object's source data store or from a local data store. Since the assignment of data object to local data store is known during a given iteration of the genetic method, the transmission delay of moving the object from the source data store to the assigned local data store can be calculated (line 17) and then update the NAOT (next available object time) table entry corresponding to this data object (lines 18-22). The NAOT is the next available time that the object is available for a final-hop transfer to the compute node regardless of the local data store. The object may have already been transferred to a different data store, but if the current job can transfer it faster to its assigned data store, then it will do so (lines 18-22). Also, if the object is assigned to a local data store that is on the compute nodes' LAN, then the object is still be transferred across one more hop to the compute node (see line 23 and 26).
Lines 31 and 32 compute the start and end computation time for the job at the compute node. Line 36 keeps track of the largest completion time seen so far for all the jobs. Line 38 returns the resulting makespan, i.e. the longest completion time for the current set of jobs.
Accordingly, the embodiments of the invention provide a method, service, computer program product, etc. of co-scheduling job assignments and data replication in wide-area systems using a genetic method. A method begins by co-scheduling assignment of jobs and replication of data objects based on job ordering within a scheduler queue, job-to-compute node assignments, and object-to-local data store assignments. As discussed above,
More specifically, the job ordering is determined according to an order in which the jobs are assigned from the scheduler to the compute nodes; and, the job-to-compute node assignments are determined according to which of the jobs are assigned to which of the compute nodes. As discussed above, when a job is submitted to the queue, the scheduler assigns it to a compute node. Each compute node maintains its own queue from which jobs run in first-in-first-out order.
The object-to-local data store assignments are determined according to which of the data objects are replicated to which of the local data stores. As discussed above, each job requires data objects from remote data stores; these objects can be downloaded and replicated to one of the local data stores (again, using a traditional method or the method we discuss in this paper), thereby obviating the need for subsequent jobs to download the same objects from the remote data store. All required data must be downloaded before a job can begin, and objects are downloaded on-demand in parallel at the time that a job is run.
Furthermore, the co-scheduling includes creating chromosomes having first strings, second strings, and third strings, such that the first strings include possible arrays of the job ordering. Moreover, the second strings include possible arrays of the job-to-compute node assignments; and, the third strings include possible arrays of the object-to-local data store assignments. As discussed above, a random set of chromosomes is initially instantiated as the population. The chromosomes in the population are evaluated (hashed) to some metric, and the best ones are chosen to be parents. The evaluation produces the makespan that results from executing the schedule of a particular chromosome.
Next, the first strings, the second strings, and the third strings can be recombined and/or mutated to create new arrays of job ordering, job-to-compute node assignments, and object-to-local data store assignments. As more fully described above, by representing the job ordering, the job-to-compute node assignments, and the object-to-local data store assignments in a “chromosome” and allowing them to compete and evolve, the method naturally converges towards an optimal (or near-optimal) solution.
Additionally, the co-scheduling includes determining an execution time of one or more of the new arrays. As discussed above, given a particular job ordering, set of job assignments to compute nodes, and set of object assignments to local data stores, the evaluation function returns the makespan. Following this, the jobs are assigned to the compute nodes based on results of the co-scheduling; and, the data objects are simultaneously replicated to the local data stores based the results of the co-scheduling.
More specifically, the job ordering is determined according to an order in which the jobs are assigned from the scheduler to the compute nodes (item 602); and, the job-to-compute node assignments are determined according to which of the jobs are assigned to which of the compute nodes (item 604). As discussed above, when a job is submitted to the queue, the scheduler assigns it to a compute node. Each compute node maintains its own queue from which jobs run in first-in-first-out order.
The object-to-local data store assignments are determined according to which of the data objects are replicated to which of the local data stores (item 606). As discussed above, each job requires data objects from remote data stores; these objects can be downloaded and replicated to one of the local data stores (again, using a traditional method or the method we discuss in this paper), thereby obviating the need for subsequent jobs to download the same objects from the remote data store. A requested object will be downloaded from a local data store, if it exists there, rather than from the remote store. If a job requires an object that is currently being downloaded by another job executing at a different compute node, the job either waits for that download to complete or instantiates its own, whichever is faster based on expected download time maintained by the scheduler.
Furthermore, in item 608, the co-scheduling includes creating chromosomes having first strings, second strings, and third strings, such that the first strings include possible arrays of the job ordering. Moreover, the second strings include possible arrays of the job-to-compute node assignments; and, the third strings include possible arrays of the object-to-local data store assignments. As discussed above, a random set of chromosomes is initially instantiated as the population. The chromosomes in the population are evaluated (hashed) to some metric, and the best ones are chosen to be parents. The evaluation produces the makespan that results from executing the schedule of a particular chromosome.
Next, in item 610, the first strings, the second strings, and the third strings can be recombined and/or mutated to create new arrays of job ordering, job-to-compute node assignments, and object-to-local data store assignments. As more fully described above, the genetic method simulates the behavior of Darwinian natural selection and converges upon an optimal (or near-optimal) solution through successive generations of recombination, mutation, and selection, as shown in the pseudocode of
Additionally, in item 612, the co-scheduling includes determining an execution time of one or more of the new arrays (i.e., the best final execution time). As discussed above, given a particular job ordering, set of job assignments to compute nodes, and set of object assignments to local data stores, the evaluation function returns the makespan. Following this, in item 620, the jobs are assigned to the compute nodes based on results of the co-scheduling; and, the data objects are simultaneously replicated to the local data stores based the results of the co-scheduling.
The embodiments of the invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the embodiments of the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code is retrieved from bulk storage during execution.
Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
A representative hardware environment for practicing the embodiments of the invention is depicted in
The embodiments of the invention include the following. First, co-scheduling of job dispatching and data replication assignments and simultaneously scheduling both for achieving good makespans is identified in the domain of wide-area distributed systems. Second, it is shown that deploying a genetic search method to solve the optimal allocation problem has the potential to achieve significantly better results versus traditional allocation mechanisms. Embodiments herein provide three variables within a job scheduling system, namely the order of jobs in the scheduler queue, the assignment of jobs to compute nodes, and the assignment of data replicas to local data stores. There exists anoptimal solution that provides the best schedule with the minimal makespan, but the solution space is prohibitively large for exhaustive searches. To find the optimal (or near-optimal) combination of these three variables in the solution space, an optimization heuristic can be provided to generate the solution in an efficient manner using a genetic method. By representing the three variables in a “chromosome” and allowing them to compete and evolve, the method converges towards an optimal (or near-optimal) solution.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments of the invention have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments of the invention can be practiced with modification within the spirit and scope of the appended claims.