A cloud infrastructure can include various resources, including computing resources, storage resources, and/or communication resources, that can be rented by customers (also referred to as tenants) of the provider of the cloud infrastructure. By using the resources of the cloud infrastructure, a tenant does not have to deploy the tenant's own resources for implementing a particular platform for performing target operations. Instead, the tenant can pay the provider of the cloud infrastructure for resources that are used by the tenant. The “pay-as-you-go” arrangement of using resources of the cloud infrastructure provides an attractive and cost-efficient option for tenants that do not desire to make substantial up-front investments in infrastructure.
Some example implementations are described with respect to the following figures.
A cloud infrastructure can include various different types of resources that can be employed by a tenant for deploying a system for performing a workload of the tenant. A tenant can refer to an individual or an enterprise (e.g. a business concern, an educational organization, or a government agency). The resources of the cloud infrastructure are available over a network, such as the Internet, a local area network (LAN), a wide area network (WAN), a virtual private network (VPN), and so forth.
A selected set of resources of the cloud infrastructure form a specific platform configuration that is usable for performing the workload of the tenant. Selections of different combinations of resources form different platform configurations. A platform configuration can refer to an arrangement of resources that together can perform the workload.
In the ensuing discussion, reference is made to computing resources that are used for performing computing tasks. However, it is noted that techniques or mechanisms according to some implementations can be applied to other types of resources that can be available in a cloud infrastructure, including storage resources and/or communication resources. Storage resources can be used for storing data, while communication resources can be used for communicating data between network elements.
Computing resources can include computing nodes, where a “computing node” can refer to a computer, a collection of computers, a processor, or a collection of processors. A tenant can select a cluster of computing nodes to use for performing a workload. Depending on the workload to be performed, the tenant can select clusters of different sizes. A larger cluster size includes a larger number of computing nodes.
In addition, in some implementations, computing resources can also be categorized into computing resources of different processing capacity (resources of different sizes). As examples, the computing resources can include virtual machines (formed of machine-readable instructions) that emulate a physical machine. A virtual machine can execute an operating system and applications like a physical machine. Multiple virtual machines can be included in a physical machine, and these multiple virtual machines can share the physical resources of the physical machine. Virtual machines can be categorized into different sizes, such as small, medium, and large. A small virtual machine has a processing capacity that is less than the processing capacity of a medium virtual machine, which in turn has less processing capacity than a large virtual machine. As examples, a large virtual machine can have twice the processing capacity of a medium virtual machine, and a medium virtual machine can have twice the processing capacity of a small virtual machine. A processing capacity of a virtual machine can refer to a central processing unit (CPU) and memory capacity, for example.
A provider of a cloud infrastructure can charge different prices for use of different resources. For example, the provider can charge a higher price for a large virtual machine, a medium price for a medium virtual machine, and a lower price for a small virtual machine. In a more specific example, the provider can charge a price for the large virtual machine that is twice the price of the medium virtual machine. Similarly, the price of the medium virtual machine can be twice the price of a small virtual machine. Note also that the price charged for a platform configuration can also depend on the amount of time that resources of the platform configuration are used by a tenant.
Although specific relative prices and processing capacities of virtual machines of different sizes are noted above, different relative prices and different relative processing capacities can be employed in other examples.
Instead of providing virtual machines of different processing capacities that are selectable by a tenant, a cloud infrastructure can alternatively or additionally include physical machines of different processing capacities that are selectable by a tenant. As an example, a tenant can select from among a large physical machine, a medium physical machine, and a small physical machine.
Also, the price charged by a provider to a tenant can vary based on a cluster size by the tenant. If the tenant selects a larger number of computing nodes to include in the cluster, then the provider would charge a higher price to the tenant.
A tenant is thus faced with a variety of choices with respect to resources available in the cloud infrastructure, where the different choices are associated with different prices. Intuitively, according to examples discussed above, it may seem that a large virtual machine can execute a workload twice as fast as a medium virtual machine, which in turn can execute a workload twice as fast as a small virtual machine. Similarly, it may seem that as 40-node cluster can execute a workload flair times as fast as a 10-node cluster.
As an example, the provider may charge the same price to a tenant for the following two platform configurations: (1) a 40-node cluster that uses 40 small virtual machines; or (2) a 10-node cluster using 10 large virtual machines. Although it may seem that either platform configuration (1) or (2) may execute a workload of a tenant with the same performance, in actuality, the performance of the workload may differ on platform configurations (1) and (2). The difference in performance of a workload by the different platform configurations may be due to constraints associated with network bandwidth and persistent storage capacity in each platform configuration. A network bandwidth can refer to the available communication bandwidth for performing communications among computing nodes. A persistent storage capacity can refer to the storage capacity available in a persistent storage subsystem.
Increasing the number of computing nodes and the number of virtual machines may not lead to a corresponding increase in persistent storage capacity and network bandwidth. Accordingly, a workload that involves a larger amount of network communications would have a poorer performance in a platform configuration with a larger number of computing nodes and virtual machines, for example. Since the price charged to a tenant depends on the amount of time of resources of a platform configuration used by the tenant, it would be beneficial to select a platform configuration that reduces the amount of time of resource usage of resources of the cloud infrastructure.
The choice of platform configuration in a cloud infrastructure can become even more challenging when a performance objective is to be achieved. For example, one performance objective may be to reduce (or minimize) the overall completion time (referred to as a “makespan”) of the workload.
In accordance with some implementations, techniques or mechanisms are provided to allow for selection of a platform configuration, from among multiple platform configurations, that is able to satisfy an objective of a tenant of a cloud infrastructure. A workload of a tenant can include a number of jobs. Different ordering of the jobs can affect the performance of the workload. Stated differently, a first ordering of the jobs of the workload may complete faster than a second ordering of the jobs of the workload. A specific ordering of jobs of the workload is also referred to as a schedule of the jobs in the workload.
The process further simulates (at 104) performance of the workload of jobs on the different platform configurations according to the respective schedules. The simulation can be performed by a simulator.
In addition, the process selects (at 106), for the workload jobs, a platform configuration from the different platform configurations, based on results of the simulation. The simulation results can include completion times for the different platform configurations. The platform configuration selected from among the different platform configurations can depend on the problem to be solved. In some implementations, the platform configuration selection solves the following problem: given a target makespan (target completion time) specified by a tenant, select the platform configuration that minimizes the cost (note that each of the different platform configurations is associated with a respective cost). In alternative implementations, the platform configuration selection solves the following problem: given a target cost specified by a tenant, select the platform configuration that minimizes the makespan.
The ability to determine different schedules for the jobs of a workload, and the ability to simulate the workload of jobs on different platform configurations according to the respective schedules, allow for a determination of which platform configuration can be a better platform configuration for the workload of jobs (depending on the problem to be solved). In addition to returning a selected platform configuration that achieves better performance or reduced cost, a proposed schedule of jobs can also be returned by the platform configuration process, in some implementations. This schedule of jobs of the workload can be considered an optimized schedule of jobs of the workload, in some examples.
In some implementations, the jobs of the workload can be MapReduce jobs. MapReduce jobs operate according to a MapReduce framework that provides for parallel processing of large amounts of data. A MapReduce framework includes a distributed arrangement of machines to process requests with respect to data.
A MapReduce job is divided into multiple map tasks and multiple reduce tasks, which can be executed in parallel by computing nodes. The map tasks are defined by a map function, while the reduce tasks are defined by a reduce function. Each of the map and reduce functions can be a user-defined function that is programmable to perform target functionalities. A MapReduce job has a map stage (that includes map tasks and a reduce stage (that includes reduce tasks).
The computing nodes on which map and reduce tasks are performed can be referred to as worker nodes (also referred to as slave nodes). A MapReduce system also includes a master node. MapReduce jobs can be submitted to the master node by various requesters, and the master node can deploy the MapReduce jobs on the worker nodes.
More generally, “map tasks” are used to process input data to output intermediate results, based on a specified map function that defines the processing to be performed by the map tasks. “Reduce tasks” take as input partitions of the intermediate results to produce outputs, based on a specified reduce function that defines the processing to be performed by the reduce tasks. The map tasks are considered to be part of a map stage, whereas the reduce tasks are considered to be part of a reduce stage.
Map tasks are run in map slots of worker nodes, while reduce tasks are run in reduce slots of worker nodes. The map slots and reduce slots are considered the resources used for performing map and reduce tasks. A “slot” can refer to a time slot or alternatively, to some other share of a resource that can be used for performing the respective map or reduce task.
More specifically, in some examples, the map tasks process input key-value pairs to generate a set of intermediate key-value pairs. The reduce tasks produce an output from the intermediate results. For example, the reduce tasks can merge the intermediate values associated with the same intermediate key.
The map function takes input key-value pairs (k1, v1) and produces a list of intermediate key-value pairs (k2, v2). The intermediate values associated with the same key k2 are grouped together and then passed to the reduce function. The reduce function takes an intermediate key k2 with a list of values and processes them to form a new list of values (v3), as expressed below.
map(k1, v1)→list(k2, v2).
reduce(k2, list(v2))→list(v3)
The reduce function merges or aggregates the values associated with the same key k2. The multiple map tasks and multiple reduce tasks are designed to be executed in parallel across resources of a distributed computing platform that makes up a MapReduce system.
Although reference is made to MapReduce jobs in the foregoing, it is noted that techniques or mechanisms according to some implementations can be applied to select platform configurations for workloads that include other types of jobs.
Tenant systems 206 are coupled to the cloud infrastructure 200. A tenant system 206 can refer to a computer or collection of computers associated with a tenant. Through the tenant system 206, a tenant can submit a request to the cloud infrastructure 200 to rent the resources of the cloud infrastructure 200, including the computing nodes 202 and corresponding virtual machines. A request for resources of the cloud infrastructure 200 can be submitted by a tenant system 206 to a control system 208 of the cloud infrastructure 200. The request can identify a workload of jobs to be performed, and can also specify a target makespan or cost of the tenant.
In accordance with some implementations, the control system 208 includes a platform configuration selector 210 that is able to select a platform configuration, from among multiple platform configurations, in accordance with some implementations, such as according to the process of
Once the platform configuration is selected by the platform configuration selector 210, the selected resources that are part of the selected platform configuration (including a cluster of computing nodes 202 of a given cluster size, and virtual machines of a given size) are made accessible to the tenant system 206 to perform a workload of the tenant system 206.
Although the scheduler 212, the simulator 214, and the job trace summary module 216 are depicted as being part of the platform configuration selector 210 in some implementations, it is noted that in other examples, the scheduler 212 and/or the simulator 214 and/or job trace summary module 216 can be separate from the platform configuration selector 210.
The platform configuration selector 210, scheduler 212, simulator 214, and job trace summary module 216 can be implemented as machine-readable instructions executable on one or multiple processors 302 in the control system 208. The control system 208 can be implemented as a computer or a number of computers. The machine-readable instructions forming the platform configuration selector 210, scheduler 212, simulator 214, job trace summary module 216 can be stored in a non-transitory machine-readable or computer-readable storage medium (or storage media) 304.
The processor(s) 302 is (are) coupled to a network interface 306, to allow the control system 208 to communicate over a network, such as a network between the tenant systems 206 and the cloud infrastructure 200.
The following describes further details regarding platform configuration selection according to some implementations.
For a given a set of jobs J (the workload), the platform configuration selector 210 can solve either of the following two problems:
As noted above, the platform configuration selector 210 includes or uses the scheduler 210, the simulator 214, and the job trace summary module 216. The job trace summary module 216 produces a job trace summary that includes a summary of a processing trace of each job J, where the processing trace includes NMJ map task durations and NRJ reduce task durations, where NMJ and NRJ represent the number of map and reduce tasks, respectively, within each job J. Note that a reduce task can include the following phases:
The job processing trace can be obtained in multiple ways, such as from a past run of a job on the corresponding platform configuration (the job execution can be recorded on an arbitrary cluster size), or extracted from a sample execution of this job on a smaller data set, or interpolated by using a benchmarking approach. The benchmarking approach creates a benchmark, which can include a set of parameters and values assigned to the respective parameters. The parameters of the benchmark can characterize a size of input data, and various characteristics associated with map and reduce tasks.
More generally, a job trace summary represents a set of measured durations of map and reduce tasks of a given job on a given platform configuration. The information of the job trace summary can be created for each of multiple platform configurations, which can differ in instance types (e.g. different sizes of virtual machines or physical machines) and different cluster sizes, for example. Using the job trace summary, a job profile can be computed that reflects the average and maximum durations of map and reduce tasks, respectively, of each job.
The distributions of durations of map and reduce tasks can be used for extracting distribution parameters, and where appropriate, generating scaled traces. A scaled trace refers to a trace for execution on a larger data set, based on a trace obtained from a job execution on a smaller data set. The job traces can be replayed using the simulator 214. Also, the job traces can be used for creating a compact job profile for analytic models, where the compact job profile can include average the average and maximum durations of map and reduce tasks, respectively.
For predicting a completion time of a job, the compact job profile that characterizes job execution during a map phase, shuffle/sort phase, and reduce phase with average and maximum task durations can be used. A model for predicting completion time can evaluate lower bounds TJlow and upper bounds TJup on the job completion time. The model can be based a Makespan Theorem for computing performance bounds on the completion time of a given set of n tasks that are processed by k servers (e.g. n map tasks are processed by k map slots in a MapReduce environment). The completion time of the n tasks can be shown to be at least:
and at most
where avg and max represent the average and maximum durations, respectively, of the n tasks (map tasks or reduce tasks).
The difference between the lower bound Tlow and upper bound Tup represents the range of possible completion times due to task scheduling non-determinism. The average of the lower and upper bounds (TJavg) can be a good approximation of the job completion time. Using the foregoing, the duration of map and reduce stages of a given job can be estimated as a function of allocated resources of a platform configuration.
In some implementations, the scheduler 212 produces a schedule (that includes a specific order of execution of jobs) that reduces (or minimizes) an overall completion time of a given set of jobs. In some examples, a Johnson scheduling technique for identifying an optimal or improved schedule of concurrent jobs can be used. An example of the Johnson scheduling technique is described in S. Johnson, “Optimal Two- and Three-stage Production Schedules with Setup Times Included,” dated May 1953. The Johnson scheduling technique provides a decision rule to determine an optimal scheduling of tasks that are processed in two stages.
In other implementations, other techniques for determining an optimal or improved schedule of jobs can be employed. For example, the determination of the optimal or improved schedule can be accomplished using a brute-force technique, where multiple orders of jobs are considered and the order with the best or better execution time (smallest or smaller execution time) can be selected as the optimal or improved schedule.
The simulator 214 performs a trace replay of jobs in a workload in an order prescribed by a corresponding schedule, as determined by the scheduler 212. The replay of the jobs on a given platform configuration produces results from which the completion time of the jobs and the corresponding cost can be estimated. By varying the platform configuration, the simulator 214 generates a set of performance/cost estimates across different platform configurations. In other words, for each platform configuration of multiple platform configurations, the simulator can produce the following correlation representation that correlates platform configuration parameters (e.g. cluster size and instance type) with the achieved makespan. An example of the simulator 214 that can be used inclues a simulator as described in A. Verma et al., “Play It Again, SimMR!” in Proc. of Intl. IEEE Cluster ‘2011.
As an example, if the platform configurations of interest employ small, medium, and large VMs, then three respective correlation representations can be produced.
Next, if a stopping condition is not satisfied (as determined at 412), the size (N) of the cluster can be incrementally increased (at 414) (e.g. by adding a computing node to the cluster), and the iterative process of
The iterative process of
From the performance of iterative processes of
As noted above, platform configuration selection can be based on solving one of two problems: (1) given a target makespan T specified by a tenant, select the platform configuration that minimizes the cost; or (2) given a target cost C specified by a tenant, select the platform configuration that minimizes the makespan.
To solve problem (1), the following procedure can be performed.
To solve problem (2), the following procedure can be performed.
The following further desribes determining a schedule of jobs of a workload, according to some implementations. For a set of MapReduce jobs (with no data dependencies between them), the order in which the jobs are executed may impact the overall processing time, and thus, utilization and the cost of the rented platform configuration (note that the price charged to a tenant can also depend on a length of time that rented resources are used—thus, increasing the processing time can lead to increased cost).
The following considers an example execution of two (independent) MapReduce jobs J1 and J2 in a cluster, in which no data dependencies exist between the jobs. As shown in
A first execution order of the jobs may lead to a less efficient resource usage and an increased processing time as empared to a second execution of the jobs. To illustrate this, consider an example workload that includes the following two jobs:
There are two possible execution orders for jobs J1 and J2 shown in
More generally, there can be a substantial difference in the job completion time depending on the execution order of the jobs of a workload. A workload ={J1,J2, . . . , Jn} includes a set of n MapReduce jobs with no data dependencies between them. The scheduler 214 generates an order (a schedule) of execution of jobs Ji ∈ such that the makespan of the workload is minimized. For minimizing the makespan of the workload of jobs ={J1,J2, . . . , Jn}, the Johnson scheduling technique discussed above can be used.
Each job Ji in the workload of n jobs can be represented by the pair (mi, ri) of map and reduce stage durations, respectively. The values of mi and ri can be estimated using lower and upper bounds, as discussed above, in some examples. Each job Ji=(mi,ri) can be augmented with an attribute Di that is defined as follows:
The first argument in Di is referred to as the stage duration and denoted as Di1. The second argument in Di is referred to as the stage type (map or reduce) and denoted as Di2. In the above, (m, m), mi represents the duration of the map stage, and m denotes that the type of the stage is a map stage. Similarly, in (ri, r), ri represents the duration of the reduce stage, and r denotes that the type of the stage is a reduce stage.
An example pseudocode of the Johnson scheduling technique is provided below.
The Johnson scheduling technique (as performed by the scheduler 212) depicted above is discussed in connection with
Line 1 of the pseudocode sorts the n jobs of the set in the ordered list L in such a way that job Ji precedes job Jt+1 in the ordered list L if and only if min(mi,ri)≦min(mi+1, ri+1). In other words, the jobs are sorted using the stage duration attribute Di1 in Di (stage duration attribute Di1 represents the smallest duration of the two stages).
The pseudocode takes jobs from the ordered list L and places them into the schedule σ (represented by the scheduling queue 702) from the two ends (head and tail), and then proceeds to place further jobs from the ordered list L in the intermediate positions of the scheduling queue 702. As specified at lines 4-6 of the pseudocode, if the stage type Di2 in Di is m, i.e. Di2 represents the map stage type, then job Ji is placed at the current available head of the scheduling queue 702 (as represented by head, which is initiated to the value 1. Once job J1 is placed in the scheduling queue 702, the value of head is incremented by 1 (so that a next job would be placed at the next head position of the scheduling queue 702).
As specified at lines 7-9 of the pseduocode, if the stage type Di2 in Di is not m, then job Ji is placed at the current available tail of the scheduling queue 702 (as represented by tail, which is initiated to the value n. Once job Ji is placed in the scheduling queue 702, the value of tail is incremented by 1 (so that a next job would be placed at the next tail position of the scheduling queue 702).
Techniques or mechanisms according to some implementations allow platform configurations and schedules to be selected for respective workloads that improve performance of the workloads and reduce costs.
Machine-readable instructions of various modules described above (including the platform configuration selector 210, scheduler 212, simulator 214, and job trace summary module 216 of
Data and instructions are stored in respective storage devices, which are implemented as respective non-transitory computer-readable or machine-readable storage media. The storage media can include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices. Note that the instructions discussed above can be provided on one computer-readable or machine-readable storage medium, or alternatively, can be provided on multiple computer-readable or machine-readable storage media distributed in a large system having possibly plural nodes. Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.
In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/035101 | 4/23/2014 | WO | 00 |