Many enterprises (such as companies, educational organizations, and government agencies) employ relatively large volumes of data that are often subject to analysis. A substantial amount of the data of an enterprise can be unstructured data, which is data that is not in the format used in typical commercial databases. Existing infrastructure may not be able to efficiently handle the processing of relatively large volumes of unstructured data.
Some embodiments are described with respect to the following figures:
For processing relatively large volumes of unstructured data, a MapReduce framework provides a distributed computing platform can be employed. Unstructured data refers to data not formatted according to a format of a relational database management system. An open-source implementation of the MapReduce framework is Hadoop. The MapReduce framework is increasingly being used across an enterprise for distributed, advanced data analytics and to provide new applications associated with data retention, regulatory compliance, e-discovery, litigation, or other issues. Diverse applications can be run over the same data sets to efficiently utilize the resources of large distributed systems.
Generally, the MapReduce framework includes a master node and multiple slave nodes. A MapReduce job submitted to the master node is divided into multiple map tasks and multiple reduce tasks, which are executed in parallel by the slave nodes. The map tasks are defined by a map function, while the reduce tasks are defined by a reduce function. Each of the map and reduce functions are user-defined functions that are programmable to perform target functionalities.
The map function processes corresponding segments of input data to produce intermediate results, where each of the multiple map tasks (that are based on the map function) process corresponding segments of the input data. For example, the map tasks process input key-value pairs to generate a set of intermediate key-value pairs. The reduce tasks (based on the reduce function) produce an output from the intermediate results. For example, the reduce tasks merge the intermediate values associated with the same intermediate key.
More specifically, the map function takes input key-value pairs (k1, v1) and produces a list of intermediate key-value pairs (k2, v2). The intermediate values associated with the same key k2 are grouped together and then passed to the reduce function. The reduce function takes an intermediate key k2 with a list of values and processes them to form a new list of values (v3), as expressed below.
map(k1,v1)→list(k2,v2).
reduce(k2,list(v2))→list(v3)
Although reference is made to the MapReduce framework in some examples, it is noted that techniques or mechanisms according to some implementations can be applied in other distributed processing frameworks. More generally, map tasks are used to process input data to output intermediate results, based on a predefined function that defines the processing to be performed by the map tasks. Reduce tasks take as input partitions of the intermediate results to produce outputs, based on a predefined function that defines the processing to be performed by the reduce tasks. The map tasks are considered to be part of a map stage, whereas the reduce tasks are considered to be part of a reduce stage. In addition, although reference is made to unstructured data in some examples, techniques or mechanisms according to some implementations can also be applied to structured data formatted for relational database management systems.
The storage modules 102 can be implemented with storage devices such as disk-based storage devices or integrated circuit storage devices. In some examples, the storage modules 102 correspond to respective different physical storage devices. In other examples, plural ones of the storage modules 102 can be implemented on one physical storage device, where the plural storage modules correspond to different partitions of the storage device.
The system of
A “node” refers generally to processing infrastructure to perform computing operations. A node can refer to a computer, or a system having multiple computers. Alternatively, a node can refer to a CPU within a computer. As yet another example, a node can refer to a processing core within a CPU that has multiple processing cores. More generally, the system can be considered to have multiple processors, where each processor can be a computer, a system having multiple computers, a CPU, a core of a CPU, or some other physical processing partition.
In accordance with some implementations, the master node 110 is configured to perform scheduling of jobs on the slave nodes 112. The slave nodes 112 are considered the working nodes within the cluster that makes up the distributed processing environment.
Each slave node 112 has a fixed number of map slots and reduce slots, where map tasks are run in respective map slots, and reduce tasks are run in respective reduce slots. The number of map slots and reduce slots within each slave node 112 can be preconfigured, such as by an administrator or by some other mechanism. The available map slots and reduce slots can be allocated to the jobs. The map slots and reduce slots are considered the resources used for performing map and reduce tasks. A “slot” can refer to a time slot or alternatively, to some other share of a processing resource that can be used for performing the respective map or reduce task. Depending upon the load of the overall system, the number of map slots and number of reduce slots that can be allocated to any given job can vary.
The slave nodes 112 can periodically (or repeatedly) send messages to the master node 110 to report the number of free slots and the progress of the tasks that are currently running in the corresponding slave nodes. Based on the availability of free slots (map slots and reduce slots) and the rules of a scheduling policy, the master node 110 assigns map and reduce tasks to respective slots in the slave nodes 112.
Each map task processes a logical segment of the input data that generally resides on a distributed file system, such as the distributed file system 104 shown in
The reduce stage (that includes the reduce tasks) has three phases: shuffle phase, sort phase, and reduce phase. In the shuffle phase, the reduce tasks fetch the intermediate data from the map tasks. In the sort phase, the intermediate data from the map tasks are sorted. An external merge sort is used in case the intermediate data does not fit in memory. Finally, in the reduce phase, the sorted intermediate data (in the form of a key and all its corresponding values, for example) is passed on the reduce function. The output from the reduce function is usually written back to the distributed file system 104.
The master node 110 of
In other implementations, the job profiler 120 and/or profile database 122 can be located at another node.
The master node 110 also includes a performance characteristic estimator 116 according to some implementations. The estimator 116 is able to produce an estimated performance characteristic, such as an estimated completion time, of a job, based on the corresponding job profile and resources (e.g., numbers of map slots and reduce slots) allocated to the job. The estimated completion time refers to either a total time duration for the job, or an estimated time at which the job will complete. In other examples, other performance characteristics of a job can be estimated, such as cost of the job, error rate of the job, and so forth.
As depicted in
A “map wave” refers to an iteration of the map stage. If the number of allocated map slots is greater than or equal to the number of map tasks, then the map stage can be completed in a single iteration (single wave). However, if the number of map slots allocated to the map stage is less than the number of map tasks, then the map stage would have to be completed in multiple iterations (multiple waves). Similarly, the number of iterations (waves) of the reduce stage is based on the number of allocated reduce slots as compared to the number of reduce tasks.
Thus, it can be observed from the examples of
In accordance with some implementations, mechanisms are provided to estimate a job completion time of a job as a function of allocated resources. By being able to estimate a job completion time as a function of allocated resources, the master node 110 (
Next, a performance model is produced (at 304) based on the job profile and allocated amount of resources for the job (e.g., allocated number of map slots and allocated number of reduce slots). Using the performance model, a performance characteristic of the job is estimated (at 306). For example, this estimation can be performed by the performance characteristic estimator 116 in
In some implementations, the particular job is executed in a given environment (including a system having a specific arrangement of physical machines and respective map and reduce slots in the physical machines), and the job profile and performance model are applied with respect to the particular job in this given environment.
A job profile reflects performance invariants that are independent of the amount of resources assigned to the job over time, for each of the phases of the job: map, shuffle, sort, and reduce phases.
The map stage includes a number of map tasks. To characterize the distribution of the map task durations and other invariant properties, the following metrics can be specified in some examples:
(Mmin, Mavg, Mmax, AvgSizeMinput, SelectivityM), where
The duration of the map tasks is affected by whether the input data is local to the machine running the task (local node), or on another machine on the same rack (local rack), or on a different machine of a different rack (remote rack). These different types of map tasks are tracked separately. The foregoing metrics can be used to improve the prediction accuracy of the performance model and decision making when the types of available map slots are known.
As described earlier, the reduce stage includes the shuffle, sort and reduce phases. The shuffle phase begins only after the first map task has completed. The shuffle phase (of any reduce wave) completes when the entire map stage is complete and all the intermediate data generated by the map tasks have been shuffled to the reduce tasks.
The completion of the shuffle phase is a prerequisite for the beginning of the sort phase. Similarly, the reduce phase begins only after the sort phase is complete. Thus the profiles of the shuffle, sort, and reduce phases are represented by their average and maximum time durations. In addition, for the reduce phase, the reduce selectivity, denoted as SelectivityR, is computed, which is defined as the ratio of the reduce data output size to its data input size.
The shuffle phase of the first reduce wave may be different from the shuffle phase that belongs to the subsequent reduce waves (after the first reduce wave). This can happen because the shuffle phase of the first reduce wave overlaps with the map stage and depends on the number of map waves and their durations. Therefore, two sets of measurements are collected: (Shavg1,Shmax1) for a shuffle phase of the first reduce wave (referred to as the “first shuffle phase”), and (Shavgtyp,Shmaxtyp) for the shuffle phase of the subsequent reduce waves (referred to as “typical shuffle phase”). Since techniques according to some implementations are looking for the performance invariants that are independent of the amount of allocated resources to the job, a shuffle phase of the first reduce wave is characterized in a special way and the parameters (Shavg1 and Shmax1) reflect only durations of the non-overlapping portions (non-overlapping with the map stage) of the first shuffle. In other words, the durations represented by Shavg1 and Shmax1 represent portions of the duration of the shuffle phase of the first reduce wave that do not overlap with the map stage.
Thus, the job profile in the shuffle phase is characterized by two pairs of measurements:
(Shavg1,Shmax1), (Shavgtyp,Shmaxtyp).
If the job execution has only a single reduce wave, the typical shuffle phase duration is estimated using the sort benchmark (since the shuffle phase duration is defined entirely by the size of the intermediate results output by the map stage).
Once the job profile is provided, then a performance model that is based on the job profile can be produced (304 in
In some implementations, the performance model is characterized by lower and upper bounds for a makespan (a completion time of the job) of a given set of n (n>1) tasks that are processed by k (k>1) servers (or by k slots in a MapReduce environment). Let T1,T2, . . . , Tn be the durations of n tasks of a given job. Let k be the number of slots that can each execute one task at a time. The assignment of tasks to slots is done using a simple, online, greedy algorithm, e.g., assign each task to the slot with the earliest finishing time.
Let μ=(Σi−1nTi)/n and λ=max, {Ti} be the mean and maximum durations of the n tasks, respectively. The makespan of the greedy task assignment is at least n·μ/k and at most (n−1)·μ/k+λ. The lower bound is trivial, as the best case is when all n tasks are equally distributed among the k slots (or the overall amount of work is processed as fast as it can by k slots). Thus, the overall makespan (completion time of the job) is at least n·μ/k (lower bound of the completion time).
For the upper bound of the completion time for the job, the worst case scenario is considered, i.e., the longest task (T)∈(T1,T2, . . . , Tn) with duration λ is the last task processed. In this case, the time elapsed before the last task is scheduled is (Σi=1n−1Ti)/k≦(n−1)·μ/k. Thus, the makespan of the overall assignment is at most (n−1)·μ/k+λ. These bounds are particularly useful when λ<<n·μ/k, in other words, when the duration of the longest task is small as compared to the total makespan.
The difference between lower and upper bounds (of the completion time) represents the range of possible job completion times due to non-determinism and scheduling. As discussed below, these lower and upper bounds, which are part of the properties of the performance model, are used to estimate a completion time for a corresponding job J.
The given job J has a given profile created by the job profiler 120 (
Let Mavg and Mmax be the average and maximum time durations of map tasks (defined by the job J profile). Then, based on the Makespan theorem, the lower and upper bounds on the duration of the entire map stage (denoted as TMUP and TMup, respectively) are estimated as follows:
T
M
low
=N
M
/S
M
·M
avg,
T
M
up=(NM−1)/SM·Mavg+Mmax,
Stated differently, the lower bound of the duration of the entire map stage is based on a product of the average duration (Mavg) of map tasks multiplied by the ratio of the number map tasks (NM) to the number of allocated map slots (SM). The upper bound of the duration of the entire map stage is based on a sum of the maximum duration of map tasks (Mmax) and the product of Mavg with (NM−1)/SM. Thus, it can be seen that the lower and upper bounds of durations of the map stage are based on properties of the job J profile relating to the map stage, and based on the allocated number of map slots.
The reduce stage includes shuffle, sort and reduce phases. Similar to the computation of the lower and upper bounds of the map stage, the lower and upper bounds of time durations for each of the shuffle phase (TShlow,TShlow), sort phase (TSortlow,TSortup), and reduce phase (TRlow,TRup) are computed. The computation of the Makespan theorem is based on the average and maximum durations of the tasks in these phases (respective values of the average and maximum time durations of the shuffle phase, the average and maximum time durations of the sort phase, and the average and maximum time duration of the reduce phase) and the numbers of reduce tasks NR and allocated reduce slots SR, respectively. The formulae for calculating (TShlow,TShlow), (TSortlow,TSortup), and (TRlow,TRup) are similar to the formulate for calculating TMup and TMup set forth above, except variables associated with the reduce tasks and reduce slots and the respective phases of the reduce stage are used instead.
The subtlety lies in estimating the duration of the shuffle phase. As noted above, the first shuffle phase is distinguished from the task durations in the typical shuffle phase (which is a shuffle phase subsequent to the first shuffle phase). As noted above, the first shuffle phase includes measurements of a portion of the first shuffle phase that does not overlap the map stage. The portion of the typical shuffle phase in the subsequent reduce waves (after the first reduce wave) is computed as follows:
where Shavgtyp is the average duration of a typical shuffle phase, and Shmaxtyp is the average duration of the typical shuffle phase. The formulae for the lower and upper bounds of the overall completion time of job J are as follows:
T
J
low
=T
M
low
+Sh
avg
1
+T
Sh
low
+T
Sort
low
+T
R
low,
T
J
up
=T
M
up
+Sh
max
1
+T
Sh
up
+T
Sort
up
+T
R
up,
where Shavg1 is the average duration of the first shuffle phase, and Shmax1 is the maximum duration of the first shuffle phase. TJlow and TJup represent optimistic and pessimistic predictions (lower and upper bounds) of the job J completion time. Thus, it can be seen that the lower and upper bounds of durations of the job J are based on properties of the job J profile and based on the allocated numbers of map and reduce slots. The properties of the performance model, which include TJlow and TJup in some implementations, are thus based on both the job profile as well as allocated numbers of map and reduce slots.
In some implementations, estimates based on the average value between the lower and upper bounds tend to be closer to the measured duration. Therefore, TJavg is defined as follows:
T
J
avg=(TMup+)TJlow/2.
In some implementations, the value TJavg is considered the estimated completion time for job J (estimated at 306 in
The estimation of a performance characteristic of a job, such as its completion time, can be computed relatively quickly, since the calculations as discussed above are relatively simple. As a result, the master node 110 (
Machine-readable instructions of modules described above (including 116, 120, 122 in
Data and instructions are stored in respective storage devices, which are implemented as one or more computer-readable or machine-readable storage media. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; optical media such as compact disks (CDs) or digital video disks (DVDs); or other types of storage devices. Note that the instructions discussed above can be provided on one computer-readable or machine-readable storage medium, or alternatively, can be provided on multiple computer-readable or machine-readable storage media distributed in a large system having possibly plural nodes. Such computer-readable or machine-readable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components. The storage medium or media can be located either in the machine running the machine-readable instructions, or located at a remote site from which machine-readable instructions can be downloaded over a network for execution.
In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some or all of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2011/023438 | 2/2/2011 | WO | 00 | 7/30/2013 |