Computing services can be provided by a network of resources, which can include processing resources and storage resources. The network of resources can be accessed by various requestors. In an environment that can have a relatively large number of requestors, there can be competition for the resources.
Some embodiments are described with respect to the following figures:
To process data sets in a network environment that includes computing and storage resources, a MapReduce framework can be used, where the MapReduce framework provides a distributed arrangement of machines to process requests performed with respect to the data sets. A MapReduce framework is able to process unstructured data, which refers to data not formatted according to a format of a relational database management system. An example open-source implementation of the MapReduce framework is Hadoop.
Generally, a MapReduce framework includes a master node and multiple slave nodes (also referred to as worker nodes). A MapReduce job submitted to the master node is divided into multiple map tasks and multiple reduce tasks, which can be executed in parallel by the slave nodes. The map tasks are defined by a map function, while the reduce tasks are defined by a reduce function. Each of the map and reduce functions can be user-defined functions that are programmable to perform target functionalities. A MapReduce job thus has a map stage (that includes map tasks) and a reduce stage (that includes reduce tasks).
MapReduce jobs can be submitted to the master node by various requestors. In a relatively large network environment, there can be a relatively large number of requestors that are contending for resources of the network environment. Examples of network environments include cloud environments, enterprise environments, and so forth. A cloud environment provides resources that are accessible by requestors over a cloud (a collection of one or multiple networks, such as public networks). An enterprise environment provides resources that are accessible by requestors within an enterprise, such as a business concern, an educational organization, a government agency, and so forth.
Although reference is made to a MapReduce framework or system in some examples, it is noted that techniques or mechanisms according to some implementations can be applied in other distributed processing frameworks that employ map tasks and reduce tasks. More generally, “map tasks” are used to process input data to output intermediate results, based on a predefined map function that defines the processing to be performed by the map tasks. “Reduce tasks” take as input partitions of the intermediate results to produce outputs, based on a predefined reduce function that defines the processing to be performed by the reduce tasks. The map tasks are considered to be part of a map stage, whereas the reduce tasks are considered to be part of a reduce stage. In addition, although reference is made to unstructured data in some examples, techniques or mechanisms according to some implementations can also be applied to structured data formatted for relational database management systems.
Map tasks are run in map slots of slave nodes, while reduce tasks are run in reduce slots of slave nodes. The map slots and reduce slots are considered the resources used for performing map and reduce tasks. A “slot” can refer to a time slot or alternatively, to some other share of a processing resource or storage resource that can be used for performing the respective map or reduce task.
More specifically, in some examples, the map tasks process input key-value pairs to generate a set of intermediate key-value pairs. The reduce tasks (based on the reduce function) produce an output from the intermediate results. For example, the reduce tasks merge the intermediate values associated with the same intermediate key.
The map function takes input key-value pairs (k1, v1) and produces a list of intermediate key-value pairs (k2, v2). The intermediate values associated with the same key k2 are grouped together and then passed to the reduce function. The reduce function takes an intermediate key k2 with a list of values and processes them to form a new list of values (v3), as expressed below.
map(k1,v1)→list(k2,v2)
reduce(k2,list(v2))→list(v3).
The reduce function merges or aggregates the values associated with the same key k2. The multiple map tasks and multiple reduce tasks (of multiple jobs) are designed to be executed in parallel across resources of a distributed computing platform.
In a relatively complex or large system, it can be relatively difficult to efficiently allocate resources to jobs and to schedule the tasks of the jobs for execution using the allocated resources.
In a network environment that provides services accessible by requestors, it may be desirable to support a performance-driven resource allocation of network resources shared across multiple requestors running data-intensive programs. A program to be run in a MapReduce system may have a performance goal, such as a completion time goal, cost goal, or other goal, by which results of the program are to be provided to satisfy a service level objective (SLO) of the program.
In some examples, the programs to be executed in a MapReduce system can include Pig programs. Pig provides a high-level platform for creating MapReduce programs. In some examples, the language for the Pig platform is referred to as Pig Latin, where Pig Latin provides a declarative language to allow for a programmer to write programs using a high-level programming language. Pig Latin combines the high-level declarative style of SQL (Structured Query Language) and the low-level procedural programming of MapReduce. The declarative language can be used for defining data analysis tasks. By allowing programmers to use a declarative programming language to define data analysis tasks, the programmer does not have to be concerned with defining map functions and reduce functions to perform the data analysis tasks, which can be relatively complex and time-consuming.
Although reference is made to Pig programs, it is noted that in other examples, programs according to other declarative languages can be used to define data analysis tasks to be performed in a MapReduce system.
In accordance with some implementations, mechanisms or techniques are provided to specify efficient allocations of resources in a MapReduce system to jobs of a program, such as a Pig program or other program written in a declarative language. In the ensuing discussion, reference is made to Pig programs—however, techniques or mechanisms according to some implementations can be applied to programs according to other declarative languages.
Given a Pig program with a given performance goal, such as a completion time goal, cost goal, or other goal, techniques or mechanisms according to some implementations are able to estimate an amount of resources (a number of map slots and a number of reduce slots) to assign for completing the Pig program according to the given performance goal. The allocated number of map slots and number of reduce slots can then be used by the jobs of the Pig program for the duration of the execution of the Pig program.
To perform the resource allocation, a performance model can be developed to allow for the estimation of a performance parameter, such as a completion time or other parameter, of a Pig program as a function of allocated resources (allocated number of map slots and allocated number of reduce slots).
At least a subset of the jobs of the Pig program can execute concurrently. The performance model that can be developed according to some implementations takes into account overlap of the concurrent jobs. For example, given a pair of concurrent jobs, the reduce stage of a first concurrent job can overlap with the map stage of a second concurrent job—in other words, at least a portion of the reduce stage of the first concurrent job can run at the same time as at least a portion of the map stage of a second concurrent job. By taking into account overlap in execution of concurrent jobs, the performance model can provide a more accurate estimate of the performance parameter noted above, such as completion time or other parameter.
By considering overlap of execution of concurrent jobs, the performance parameter that is estimated can allow for more optimal resource allocation. For example, where the performance parameter is a completion time of a Pig program, the consideration of overlap of concurrent jobs in the performance model can allow for a smaller completion time to be estimated, as compared to an example where the jobs of a Pig program are soon to be sequential jobs where one job executes after completion of another job (which can lead to a worst-case estimate of the completion time).
To further enhance resource allocation, a more optimal schedule of concurrent jobs of the Pig program can be developed. This more optimal schedule of concurrent jobs of the Pig program attempts to specify an order of the concurrent jobs that results in a reduction of the overall completion time of the concurrent jobs.
More generally, techniques or mechanisms according to some implementations are able to perform the following:
The storage modules 102 can be implemented with storage devices such as disk-based storage devices or integrated circuit or semiconductor storage devices. In some examples, the storage modules 102 correspond to respective different physical storage devices. In other examples, plural ones of the storage modules 102 can be implemented on one physical storage device, where the plural storage modules correspond to different logical partitions of the storage device.
The system of
A “node” refers generally to processing infrastructure to perform computing operations. A node can refer to a computer, or a system having multiple computers. Alternatively, a node can refer to a CPU within a computer. As yet another example, a node can refer to a processing core within a CPU that has multiple processing cores. More generally, the system can be considered to have multiple processors, where each processor can be a computer, a system having multiple computers, a CPU, a core of a CPU, or some other physical processing partition.
In accordance with some implementations, a scheduler 108 in the master node 110 is configured to perform scheduling of jobs on the slave nodes 112. The slave nodes 112 are considered the working nodes within the cluster that makes up the distributed processing environment.
Each slave node 112 has a corresponding number of map slots and reduce slots, where map tasks are run in respective map slots, and reduce tasks are run in respective reduce slots. The number of map slots and reduce slots within each slave node 112 can be preconfigured, such as by an administrator or by some other mechanism. The available map slots and reduce slots can be allocated to the jobs.
The slave nodes 112 can periodically (or repeatedly) send messages to the master node 110 to report the number of free slots and the progress of the tasks that are currently running in the corresponding slave nodes.
Each map task processes a logical segment of the input data that generally resides on a distributed file system, such as the distributed file system 104 shown in
The reduce stage (that includes the reduce tasks) has three phases: shuffle phase, sort phase, and reduce phase. In the shuffle phase, the reduce tasks fetch the intermediate data from the map tasks. In the sort phase, the intermediate data from the map tasks are sorted. An external merge sort is used in case the intermediate data does not fit in memory. Finally, in the reduce phase, the sorted intermediate data (in the form of a key and all its corresponding values, for example) is passed on the reduce function. The output from the reduce function is usually written back to the distributed file system 104.
As further shown in
The master node 110 of
The master node 110 also includes a resource allocator 116 that is able to allocate resources, such as numbers of map slots and reduce slots, to jobs of the Pig program 132, given a performance goal (e.g. target completion time) associated with the Pig program 132. The resource allocator 116 receives as input jobs profiles of the jobs in the collection 134. The resource allocator 116 also uses a performance model 140 that calculates a performance parameter (e.g. time duration of a job) based on the characteristics of a job profile, a number of map tasks of the job, a number of reduce tasks of the job, and an allocation of resources (e.g. number of map slots and number of reduce slots).
Using the performance parameter calculated by the performance model 140, the resource allocator 116 is able to determine feasible allocations of resources to assign to the jobs of the Pig program 132 to meet the performance goal associated with the Pig program 132. As noted above, in some implementations, the performance goal is expressed as a target completion time, which can be a target deadline or a target time duration, by or within which the job is to be completed. In such implementations, the performance parameter that is calculated by the performance model 140 is a time duration value corresponding to the amount of time the jobs would take assuming a given allocation of resources. The resource allocator 116 is able to determine whether any particular allocation of resources can meet the performance goal associated with the Pig program 132 by comparing a value of the performance parameter calculated by the performance model to the performance goal.
The numbers of map slots and numbers of reduce slots allocated to respective jobs can be provided by the resource allocator 116 to the scheduler 108. The scheduler 108 is able to listen for events such as job submissions and heartbeats from the slave nodes 118 (indicating availability of map and/or reduce slots, and/or other events). The scheduling functionality of the scheduler 108 can be performed in response to detected events.
In some implementations, the collection 134 of jobs produced by the compiler 130 from the Pig program 132 can be a directed acyclic graph (DAG) of jobs. A DAG is a directed graph that is formed by a collection of vertices and directed edges, where each edge connects one vertex to another vertex. The DAG of jobs specify an ordered sequence, in which some jobs are to be performed earlier than other jobs, while certain jobs can be performed in parallel with certain other jobs.
To execute the plan represented by the DAG 200 of
For example, the DAG 200 shown in
first job stage: {J1,J2};
second job stage: {J3,J4};
third job stage: {J5};
fourth job stage: {J6}.
In a given job stage that has multiple jobs, those multiple jobs can be considered concurrent jobs since they can be executed concurrently within the given job stage (before processing proceeds to the next job stage).
In other examples, instead of representing a collection of jobs as a DAG, the collection of jobs can be represented using another type of data structure that provides a representation of an ordered arrangement of jobs that make up a program.
The process calculates (at 304) a performance parameter using a performance model (e.g. 140 in
The process then determines (at 306), based on the value of the performance parameter calculated by the performance model, a particular allocation of resources to assign to the jobs of the program to meet a performance goal of the program. Task 306 can be performed by the resource allocator 116.
Given the allocation of resources to assign to the jobs of the program, the scheduler 108 of
Further details of the performance model (e.g. 140 of
and at most
The difference between lower and upper bounds represents the range of possible completion times due to task scheduling non-determinism (based on whether the maximum duration task is scheduled to run last). Note that these lower and upper bounds on the completion time can be computed if the average and maximum durations of the set of tasks and the number of allocated slots is known.
To approximate the overall completion time of a job J, the average and maximum task durations during different execution phases of the job are estimated. The phases include map, shuffle/sort, and reduce phases. Measurements such as MavgJ and MmaxJ (RavgJ and RmaxJ) of the average and maximum map (reduce) task durations for a job J can be obtained from execution logs (logs containing execution times of previously executed jobs). By applying the outlined bounds model, the completion times of different processing phases (map, shuffle/sort, and reduce phases) of the job are estimated.
For example, let job J be partitioned into NMJ map tasks. Then the lower and upper bounds on the duration of the map stage in the future execution with SMJ map slots (the lower and upper bounds are denoted as TMlow and TMup respectively) are estimated as follows:
Similarly, bounds of the execution time of other processing phases (shuffle/sort and reduce phases) of the job can be computed. As a result, the estimates for the entire job completion time (lower bound TJlow and upper bound TJup) can be expressed as a function of allocated map and reduce slots (SMJ, SRJ) using the following equation:
The equation for TJup can be written in a similar form. The average (TJavg) of lower and upper bounds (average of TJlow and TJup) can provide an approximation of the job completion time.
Once a technique for predicting the job completion time (using the performance model discussed above to compute an upper bound, lower bound, or intermediate of the completion time) is provided, it also can be used for solving the inverse problem: finding the appropriate number of map and reduce slots that can support a given job deadline D. For example, by setting the left side of Eq. 3 to deadline D, Eq. 4 is obtained with two variables SMJ and SRJ:
The foregoing describes a performance model for a single job. Note that a Pig program can have multiple jobs, some of which can execute concurrently. A job can be represented as a composition of non-overlapping map stage and reduce stage. There is effectively a barrier between a map stage and reduce stage of a job, in that any reduce task (corresponding to the reduce function) can start its execution only after all map tasks of the map stage have completed.
The following illustrates the difference between a performance model that assumes sequential execution of jobs as compared to an execution of jobs where overlap is allowed.
If jobs J1 and J2 are assumed to be concurrent jobs, then there would be some overlap of jobs J1 and J2, as depicted in
As can be seen from
Given a subset of concurrent jobs of a Pig program, some techniques or mechanisms can select a random order of the concurrent jobs of the subset. This random order refers to an order of the jobs in the subset where one of the jobs is randomly selected to begin first, followed by another randomly selected job, followed by another randomly selected job, and so forth. In some cases, random ordering of concurrent jobs may lead to inefficient resource usage and increased execution time. An example of such a scenario is shown in
In the example of
In
In accordance with some implementations, instead of using random ordering of concurrent jobs of a subset, an optimal schedule of concurrent jobs of the subset can be derived, and this optimal schedule of concurrent jobs is used by the performance model. In alternative implementations, rather than deriving an optimal schedule of concurrent jobs, an “improved” schedule of concurrent jobs can be derived, where an improved schedule of concurrent jobs refers to an order of concurrent jobs that has a smaller execution time (or improved performance parameter value) as compared to another order of concurrent jobs. A performance model based on an optimal or improved schedule of concurrent jobs can lead to computation of a smaller completion time, and thus more efficient allocation of resources.
In some implementations, the determination of the optimal or improved schedule can be accomplished using a brute-force technique, where multiple orders of jobs are considered and the order with the best or better execution time (smallest or smaller execution time) can be selected as the optimal or improved schedule.
In other implementations, another technique for identifying an optimal or improved schedule of concurrent jobs is to use the Johnson algorithm, such as described in S. Johnson, “Optimal Two- and Three-stage Production Schedules with Setup Times Included,” dated May 1953. The Johnson algorithm provides a decision rule to determine an optimal scheduling of tasks that are processed in two stages.
In other implementations, other techniques for determining an optimal or improved schedule of concurrent jobs can be employed.
Using the performance model of a single job as a building block, as described above, a performance model for the jobs of a Pig program P (which can be compiled into a collection of |P| jobs, P={J1, J2, . . . J|P|}) can be derived, as discussed below.
For each job Ji(1≦i≦|P|) that constitutes a program P, in addition to the number of map (NMJ
(MavgJ
(RavgJ
MavgJ
The foregoing characteristics can be considered to be part of profiles for corresponding jobs. The profiles of jobs of a Pig program can be extracted (such as by the job profiler 120 of
As noted above, the jobs of a Pig program can be compiled into a DAG of jobs and includes S job stages (such as according to an example shown in
Eq. 5 specifies that the overall execution time of the Pig program P is equal to the sum of the execution times of the individual job stages Si, for i=1 to S. For a job stage Si that has a single job J, the stage completion time is defined by the job J's completion time.
For a job stage Si that has concurrent jobs, the stage completion time, TS
For each job stage Si with concurrent jobs, the optimal job schedule that minimizes the completion time of the stage is determined, such as by use of Johnson's algorithm or of another technique. Next, a performance model for predicting the Pig program P's completion time TP as a function of allocated resources (SMP, SRP) can be derived, as discussed in further detail below. The following notations can be used:
timeStartJ
timeEndJ
timeStartJ
timeEndJ
Then the stage completion time (of a particular stage Si) can be estimated as
The following explains how to estimate the start time and end time of each job's map stage and reduce stage.
Let TJ
timeEndJ
timeEndJ
Note, that
timeStartJ
The start time timeStartJ
1. timeStartJ
2. timeStartJ
Therefore, the following equation is derived:
Finally, the completion time of the entire Pig program P is defined as the sum of the job stages making up the program, according to Eq. 5.
Given the performance model for the jobs of a Pig program P, as discussed above, the challenge is then to compute an allocation of resources (e.g. map slots and reduce slots), given that the Pig program P has a deadline D. The optimized execution of concurrent jobs in P may improve the program completion time. Therefore, P can be assigned a smaller amount of resources for meeting the deadline D compared to its non-optimized execution (where jobs are assumed to executed sequentially).
The following describes how to approximate the resource allocation of a non-optimized execution of a Pig program (which assumes sequential execution of the jobs in the various job stages of the program). The completion time of non-optimized execution of the program P can be represented as a sum of completion times of the jobs that make up the DAG of the program. Thus, for a Pig program P that contains |P| jobs, its completion time can be estimated as a function of assigned map and reduce slots (SMP,SRP) as follows:
Using the performance model based on Eq. 11, the completion time D of the Pig program P can be expressed using Eq. 12 below, which is similar to Eq. 3:
Eq. 12 can be used for solving the inverse problem of finding resource allocations (SMP, SRP) such that the program P completes within time D. As can be seen in
These different feasible resource allocations (represented by points along the curve 702) correspond to different amounts of resources that allow the deadline D to be satisfied. Finding an optimal allocation of resources along the curve 702 can be accomplished by by using a Lagrange's multiplier technique, as described further in U.S. patent application Ser. No. 13/442,358, entitled “DETERMINING AN ALLOCATION OF RESOURCES TO ASSIGN TO JOBS OF A PROGRAM,” filed Apr. 9, 2012. The Langrange's multiplier technique can identify the point, A(M,R), on the curve 702, where A (M,R) represents the point with a minimal number of map and reduce slots (i.e. the pair (M,R) results in the minimal sum of map and reduce slots).
However, the performance model based on Eq. 10 (discussed above) that can be used for more accurate completion time estimates for optimized Pig program execution (where overlap of concurrent jobs is allowed) is more complex. As seen in Eq. 10, a max (maximum) function is computed for job stages with concurrent jobs. However, in accordance with some implementations, determining an optimal allocation of resources given a performance model based on Eq. 10 can use the “over-provisioned” resource allocation defined by Eq. 12 as an initial point for determining the solution for an optimized execution of the Pig program P.
Techniques or mechanisms according to some implementations can use the curve 702 of
In some examples, the following pseudocode can be used to solve for (Mmin,Rmin):
The following discusses the tasks performed by the pseudocode set forth above. First, the pseudocode finds the minimal number of map slots M′ (i.e. the pair (M′, R) at point 704 in
In the second step, the pseudocode applies a similar process for finding the minimal number of reduce slots R′ (i.e. the pair (M, R′) of point 706 in
In the third step, the pseudocode determines the intermediate values on a curve 708 between (M′,R) and (M,R′), points B and C, respectively, such that deadline D is met by the optimized Pig program P (using the performance model that considers overlap of concurrent jobs). Starting from point (M′,R), the pseudocode tries to find the allocation of map slots from M′ to M, such that the minimal number of reduce slots {circumflex over (R)} should be assigned to P for meeting its deadline (lines 10-12 of the pseudocode).
Next, the solution (Mmin,Rmin) (point 710 in
Although a specific pseudocode is depicted above, it is noted that in alternative examples, other techniques or mechanisms can be used to find a resource allocation for a program, such as a Pig program, that meets a given deadline of the program, where a performance model is used that considers overlap of concurrent jobs.
Various techniques discussed above, such as techniques depicted in
Data and instructions are stored in respective storage devices, which are implemented as one or more computer-readable or machine-readable storage media. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices. Note that the instructions discussed above can be provided on one computer-readable or machine-readable storage medium, or alternatively, can be provided on multiple computer-readable or machine-readable storage media distributed in a large system having possibly plural nodes. Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.
In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some or all of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.