Some embodiments described herein relate to techniques for implementing a MapReduce system. MapReduce systems provide a framework for parallelizing processing tasks to be performed on large data sets. The MapReduce framework assigns processing resources to function as Mappers and Reducers, which execute customizable Map functions and Reduce functions, respectively. Mappers operate in parallel to process input data according to the Map function, and Reducers operate in parallel to process Mapper output according to the Reduce function, to produce output data from the MapReduce program.
A canonical example of a MapReduce program is known as “Word Count,” and operates to count the number of appearances of each word that occurs in a set of documents. The Map function assigns each unique word as a key, and counts each appearance of each word in a portion of the input data (e.g., one or more documents in the set of documents). The input data set is divided into splits such that each Mapper counts the appearances of words in a portion of the document set, and the Mappers operate in parallel to count the word appearances in the entire data set in a distributed fashion. The data output by the Mappers is in the form of [key, value] pairs, with the key in each pair representing a particular word, and the value in each pair representing a count of one or more appearances of that word in the input data split processed by the Mapper that generated that [key, value] pair. A Shuffle stage delivers the [key, value] pairs from the Mappers to the Reducers, with each Reducer being responsible for a particular set of keys (i.e., a particular subset of words out of the total set of unique words that occur in the input document set), and the Reducers operating in parallel to compute the total counts of all the words in the data set in a distributed fashion. The Reduce function sums the values received from all of the Mappers for a particular key, outputting the sum as the total count of appearances of that particular word in the input document set. The output data from all of the Reducers, organized by key, thus contains a total count of appearances of each unique word in the input data set.
One type of embodiment is directed to apparatus comprising at least one processor configured to execute one or more MapReduce applications that cause the at least one processor to function as at least a Mapper in a MapReduce system, and at least one processor-readable storage medium storing processor-executable instructions that, when executed by the at least one processor, cause the at least one processor to perform a method comprising: accessing data stored in a file system implemented on at least one nonvolatile storage medium; and in response to input data being written to the file system by an application other than the one or more MapReduce applications, accessing a set of one or more Map functions applicable to the input data, executing at least one Map function of the one or more Map functions on the input data, and outputting at least one set of [key, value] pairs resulting from execution of the at least one Map function on the received input data.
Another type of embodiment is directed to a method for use with at least one processor configured to execute one or more MapReduce applications that cause the at least one processor to function as at least a Mapper in a MapReduce system, the method comprising: accessing data stored in a file system implemented on at least one nonvolatile storage medium; and in response to input data being written to the file system by an application other than the one or more MapReduce applications, accessing a set of one or more Map functions applicable to the input data, executing, via the at least one processor functioning as at least the Mapper in the MapReduce system, at least one Map function of the one or more Map functions on the input data, and outputting at least one set of [key, value] pairs resulting from execution of the at least one Map function on the received input data.
Another type of embodiment is directed to at least one processor-readable storage medium storing processor-executable instructions that, when executed, perform a method for use with at least one processor configured to execute one or more MapReduce applications that cause the at least one processor to function as at least a Mapper in a MapReduce system, the method comprising: accessing data stored in a file system implemented on at least one nonvolatile storage medium; and in response to input data being written to the file system by an application other than the one or more MapReduce applications, accessing a set of one or more Map functions applicable to the input data, executing, via the at least one processor functioning as at least the Mapper in the MapReduce system, at least one Map function of the one or more Map functions on the input data, and outputting at least one set of [key, value] pairs resulting from execution of the at least one Map function on the received input data.
Another type of embodiment is directed to apparatus comprising a processor configured to function as at least a Mapper in a MapReduce system, and a processor-readable storage medium storing processor-executable instructions that, when executed by the processor, cause the processor to perform a method comprising: generating a set of [key, value] pairs by executing a Map function on input data; and storing the set of [key, value] pairs in a storage system implemented on at least one data storage medium, the storage system being organized into a plurality of divisions with different divisions of the storage system storing [key, value] pairs corresponding to different keys, the storing comprising storing in a first division of the plurality of divisions both a first [key, value] pair corresponding to a first key handled by a first Reducer in the MapReduce system and a second [key, value] pair corresponding to a second key handled by a second Reducer in the MapReduce system.
Another type of embodiment is directed to a method comprising: generating, via a processor configured to function as at least a Mapper in a MapReduce system, a set of [key, value] pairs by executing a Map function on input data; and storing the set of [key, value] pairs in a storage system implemented on at least one data storage medium, the storage system being organized into a plurality of divisions with different divisions of the storage system storing [key, value] pairs corresponding to different keys, the storing comprising storing in a first division of the plurality of divisions both a first [key, value] pair corresponding to a first key handled by a first Reducer in the MapReduce system and a second [key, value] pair corresponding to a second key handled by a second Reducer in the MapReduce system.
Another type of embodiment is directed to at least one processor-readable storage medium storing processor-executable instructions that, when executed by a processor configured to function as at least a Mapper in a MapReduce system, perform a method comprising: generating a set of [key, value] pairs by executing a Map function on input data; and storing the set of [key, value] pairs in a storage system implemented on at least one data storage medium, the storage system being organized into a plurality of divisions with different divisions of the storage system storing [key, value] pairs corresponding to different keys, the storing comprising storing in a first division of the plurality of divisions both a first [key, value] pair corresponding to a first key handled by a first Reducer in the MapReduce system and a second [key, value] pair corresponding to a second key handled by a second Reducer in the MapReduce system.
Another type of embodiment is directed to apparatus comprising a processor configured to function as at least a first Reducer in a MapReduce system, and a processor-readable storage medium storing processor-executable instructions that, when executed by the processor, cause the processor to perform a method comprising: receiving a set of mapped [key, value] pairs output from a Mapper in the MapReduce system; identifying, within the set of mapped [key, value] pairs, one or more [key, value] pairs for whose keys the first Reducer is not responsible; and transferring the one or more identified [key, value] pairs to one or more other Reducers in the MapReduce system.
Another type of embodiment is directed to a method comprising: receiving, at a processor configured to function as at least a first Reducer in a MapReduce system, a set of mapped [key, value] pairs output from a Mapper in the MapReduce system; identifying, by the processor configured to function as at least the first Reducer, within the set of mapped [key, value] pairs, one or more [key, value] pairs for whose keys the first Reducer is not responsible; and transferring the one or more identified [key, value] pairs to one or more other Reducers in the MapReduce system.
Another type of embodiment is directed to at least one processor-readable storage medium storing processor-executable instructions that, when executed by a processor configured to function as at least a first Reducer in a MapReduce system, perform a method comprising: receiving a set of mapped [key, value] pairs output from a Mapper in the MapReduce system; identifying, within the set of mapped [key, value] pairs, one or more [key, value] pairs for whose keys the first Reducer is not responsible; and transferring the one or more identified [key, value] pairs to one or more other Reducers in the MapReduce system.
Another type of embodiment is directed to apparatus comprising at least one processor, and at least one processor-readable storage medium storing processor-executable instructions that, when executed by the at least one processor, cause the at least one processor to perform a method comprising: receiving a data packet including a set of mapped [key, value] pairs corresponding to a plurality of keys handled by a plurality of Reducers in a MapReduce system; and for each mapped [key, value] pair in the set of mapped [key, value] pairs, identifying a key corresponding to the respective mapped [key, value] pair, identifying a Reducer of the plurality of Reducers responsible for the identified key, and providing the respective mapped [key, value] pair to the identified Reducer for processing.
Another type of embodiment is directed to a method comprising: receiving, at at least one processor, a data packet including a set of mapped [key, value] pairs corresponding to a plurality of keys handled by a plurality of Reducers in a MapReduce system; and for each mapped [key, value] pair in the set of mapped [key, value] pairs, identifying, via the at least one processor, a key corresponding to the respective mapped [key, value] pair, identifying, via the at least one processor, a Reducer of the plurality of Reducers responsible for the identified key, and providing the respective mapped [key, value] pair to the identified Reducer for processing.
Another type of embodiment is directed to at least one processor-readable storage medium storing processor-executable instructions that, when executed by at least one processor, cause the at least one processor to perform a method comprising: receiving a data packet including a set of mapped [key, value] pairs corresponding to a plurality of keys handled by a plurality of Reducers in a MapReduce system; and for each mapped [key, value] pair in the set of mapped [key, value] pairs, identifying a key corresponding to the respective mapped [key, value] pair, identifying a Reducer of the plurality of Reducers responsible for the identified key, and providing the respective mapped [key, value] pair to the identified Reducer for processing.
Another type of embodiment is directed to apparatus comprising a processor configured to function as at least a Mapper in a MapReduce system, and a processor-readable storage medium storing processor-executable instructions that, when executed by the processor, cause the processor to perform a method comprising: generating mapped [key, value] pairs by executing a Map function on input data; collecting a set of the mapped [key, value] pairs corresponding to a plurality of keys handled by a plurality of Reducers in the MapReduce system; and transmitting the collected set of the mapped [key, value] pairs in a data packet to a device, local to the plurality of Reducers, responsible for routing the mapped [key, value] pairs in the data packet to the plurality of Reducers.
Another type of embodiment is directed to a method comprising: generating, via a processor configured to function as at least a Mapper in a MapReduce system, mapped [key, value] pairs by executing a Map function on input data; collecting a set of the mapped [key, value] pairs corresponding to a plurality of keys handled by a plurality of Reducers in the MapReduce system; and transmitting the collected set of the mapped [key, value] pairs in a data packet to a device, local to the plurality of Reducers, responsible for routing the mapped [key, value] pairs in the data packet to the plurality of Reducers.
Another type of embodiment is directed to at least one processor-readable storage medium storing processor-executable instructions that, when executed by a processor configured to function as at least a Mapper in a MapReduce system, perform a method comprising: generating mapped [key, value] pairs by executing a Map function on input data; collecting a set of the mapped [key, value] pairs corresponding to a plurality of keys handled by a plurality of Reducers in the MapReduce system; and transmitting the collected set of the mapped [key, value] pairs in a data packet to a device, local to the plurality of Reducers, responsible for routing the mapped [key, value] pairs in the data packet to the plurality of Reducers.
The accompanying drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. In the drawings:
MapReduce system, MapReduce program: As used herein, the term “MapReduce system” refers to a system implemented via hardware, software, or a combination of hardware and software, including processing resources configured to function as a set of one or more Mappers and a set of one or more Reducers. The Mappers are capable of processing data in parallel to each other, and the Reducers are capable of processing data in parallel to each other. Each Reducer is configured to process data received from the Mappers according to a customizable Reduce function, with the data organized into categories referred to as “keys.” Each Mapper is configured to process input data according to a customizable Map function, which outputs data that is paired with keys. Taken together, the Map function and the Reduce function are referred to as the “MapReduce program,” which is executed by the MapReduce system and may be user- or application-defined. In a MapReduce system with multiple Mappers, the input data to be processed by the MapReduce program is divided into portions referred to as “splits,” with each Mapper processing a different split or set of splits of the input data. In a Shuffle stage, the output data from the Mappers is distributed to the Reducers according to the keys, with different Reducers processing data corresponding to different keys. If there is a case in which a customized MapReduce program does not specify a Reduce function, then the output of the MapReduce system may be the mapped [key, value] pairs generated by the Mappers according to the specified Map function.
[End Glossary]
Step 12 is a second step of the example depicted in
The output of each map operation performed by the Mapper #1 (104) is a sequence of [key,value] pairs, where the key determines which Reducer will receive this [key,value] as input. These outputs are typically placed in an in-memory buffer on the Mapper (104). However, as more and more input data is processed, this buffer may approach overflow. In order to make room in the memory buffer space, the data held in the buffer must be moved. This is traditionally done by moving the data to disk. The literature may use the term “spill” or “spill to disk” to describe this emptying of the buffer by moving data held in the buffer to disk. The spilling process performs two functions to organize the data prior to its movement to disk. The first is that it is “partitioned,” which groups each [key,value] pair in a set of [key,value] pairs bound for a particular Reducer as determined by the key and the MapReduce Partition function, which may be implemented in a default manner or customized by the user. Typical MapReduce Partition functions assign each key to a single Reducer, with each Reducer being assigned multiple keys. Some embodiments described below may be useful for MapReduce applications suitable for execution on very many Mapper and Reducer processes, such as millions of Mapper processes and millions of Reducer processes. In this case a benefit of a Partitioning function may be to load balance the Reducers toward being equally likely to be assigned work ([key,value] inputs), and the particular Reducer process that a [key,value] is assigned to may not be important to the functioning of the MapReduce program. Another capability provided by the Partition function is that it allows separate Mapper processes to send [key,value] pairs with the same key to the same Reducer process without those Mapper processes having to communicate with each other. This is assured by having the separate Mapper processes use the same Partitioning function, which deterministically produces a Reducer process assignment, which is the partition, from the key input of a [key,value] input pair.
Subsequent to the above partitioning step, and prior to spilling the data to disk, there is a step which sorts each partition's [key,value] pairs by key. An optional step not depicted in
Upon completion of step 16 of the example depicted in
Not only does each Mapper send output to multiple Reducers in the example of
In step 22 illustrated in
The ellipsis between steps 24 and 26 indicates that many iterations of the merge operation depicted in step 24 may be performed. The second-to-last merge sort operation is performed in step 26 by reading Partially Sorted Data (190) from Disk (168), sorting this data on the Reducer (148) and storing the Partially Sorted Data (192) back to Disk 168. The final sorting operation is performed in step 28 by reading Partially Sorted Data (198) from Disk 168, upon which the Reducer (148) performs sorting. Once the data is completely sorted on the Reducer (or as it is being sorted, since the beginning of the data will be sorted prior to the end of the data when merge sort is used), the Reduce function can be performed by the Reducer. When a new key is encountered, all values assigned to that key can be passed as input (e.g., via an iterator) to a Reduce function execution on the Reducer (148), up until the first value assigned to a different key, which indicates the end of all values for the previous key.
The inventors have recognized that a major performance penalty paid by conventional MapReduce frameworks is caused by disk-based sorting, which can take hours to complete. (Indeed, reading all data that is on a hard drive once can itself take hours.) The inventors have further recognized that conventional disk-based sorting requires all of the data-to-be-sorted to be present on the disk before the sorting task can be accomplished. The inventors have appreciated that this limitation contributes to constraining conventional MapReduce to be performed as a batch process on a complete set of input data, and to making conventional MapReduce unusable for, e.g., streaming data. By contrast, some embodiments disclosed herein can begin executing MapReduce programs on input data as portions of a data set arrive at the file system serviced by the MapReduce system, e.g., before the entire input data set has been received at the file system. Some embodiments may reduce or eliminate disk-based sorting from the MapReduce system, through techniques described further below.
Accordingly, some embodiments described herein relate to techniques which may address one or more of the above-discussed shortcomings of traditional methods, and/or that may provide one or more of the foregoing benefits. However, aspects of the invention are not limited to any of these benefits, and it should be appreciated that some embodiments may not provide any of the above-discussed benefits and/or may not address any of the above-discussed deficiencies that the inventors have recognized in conventional techniques.
In some embodiments, a MapReduce system as described herein may optimistically elect to perform MapReduce programs on incoming input data so that results will be ready when a user or application later requests that such a program be run. When the request to run a given MapReduce program is later received, if it has already been run on the input data, it need not be rerun since the results of the MapReduce program may have already been output and available.
In some embodiments, the MapReduce system may be implemented via one or more MapReduce applications (e.g., software applications) executing on one or more processors, which may cause processing threads on the one or more processors to function as Mappers and/or Reducers in the MapReduce system. In some cases, a computer machine such as a server may have a single processor with a single processor core capable of a single thread of execution. In other cases, a server machine may have multiple processors, and/or a processor may have multiple cores, and/or a processor core may have multiple hardware threads (virtual cores). In some embodiments, a Mapper or Reducer may be executed by a processor thread. Thus, a single-thread, single-core, single-processor server may implement a single Mapper or Reducer at a given time (although it could implement one Mapper or Reducer during one time period and a different Mapper or Reducer during a different time period). A multi-processor and/or multi-core-per-processor and/or multi-thread-per-core server could potentially implement up to as many Mappers and/or Reducers as it has hardware threads in parallel at the same time.
In some embodiments, the MapReduce application(s) executing on one or more processors may monitor a file system to detect when data is written to the file system by one or more other applications (i.e., applications other than the MapReduce application(s)). The file system may be implemented on one or more nonvolatile storage media, such as a hard drive, a storage array, or any other suitable nonvolatile storage media. In some embodiments, the file system may represent a virtualized construct of logical volumes presenting an organization of the data that differs from how the data is physically stored in the hardware storage media. As such, in some embodiments, the MapReduce application(s) may monitor writes to the file system (e.g., to the abstraction layer) as opposed to the hardware storage media themselves. In other embodiments, the file system monitored by the MapReduce application(s) may simply be the nonvolatile storage media in which the data are stored.
In some embodiments, in response to input data being written to the file system by another application, one or more Mappers in the MapReduce system may access and execute one or more known Map functions on the input data. In this way, [key, value] pairs resulting from execution of the known Map function(s) can be precomputed so that they will be immediately available in the case of a later user or application request for the MapReduce program(s) including those Map function(s) to be performed on the input data. In some embodiments, the [key, value] pairs output by the Mapper(s) may be stored in one or more nonvolatile storage media, such as the media underlying the file system to which the input data is stored, one or more disks local to the Mapper(s), and/or any other suitable nonvolatile storage media. Alternatively or additionally, in some embodiments the output [key, value] pairs may be transferred to one or more Reducers in the MapReduce system for execution of the Reduce function and precomputation of the output of the MapReduce program. In some embodiments, a Map function or a MapReduce program (including both Map function and Reduce function) may be executed on input data as it arrives at the file system, e.g., as in the case of streaming data. In some embodiments, execution of a Map function on a portion of the input data stream may commence before other portions of the input data stream have been written to the file system by the other application. In some embodiments, furthermore, multiple Map functions or multiple MapReduce programs may be precomputed on input data upon its arrival at the file system, as described further below. In some embodiments, when a MapReduce program has been precomputed on input data but no request is later received for the MapReduce program to be executed (or its results to be output), the precomputed results may be discarded. For example, in some embodiments, precomputed MapReduce program results may be discarded after a suitable threshold time period has elapsed, or after a suitable threshold amount of precomputed data has been accumulated, or according to any other suitable criteria in the absence of receiving a request for the precomputed data. Discarding precomputed data may include deleting the data, transferring the data to a different location or file system, or otherwise disposing of the data in the context of non-use.
It should be appreciated that the foregoing description is by way of example only, and aspects of the invention are not limited to providing any or all of the above-described functionality, although some embodiments may provide some or all of the functionality described herein.
The aspects of the present invention described herein can be implemented in any of numerous ways, and are not limited to any particular implementation techniques. Thus, while examples of specific implementation techniques are described below, it should be appreciated that the examples are provided merely for purposes of illustration, and that other implementations are possible.
In some embodiments, an Initialization unit (310) may transmit Configuration Settings (315) to the Accelerator (220) during an initialization phase. In some embodiments, the Configuration Settings (315) may configure a storage of Known Map Functions (320). Once the Known Map Functions (320) have been initialized within the Accelerator (220), a selection of them, which may comprise all of them, may be loaded into the Currently Running Map Functions (330) unit. Alternatively a load-balanced selection of MapReduce algorithms may be selected such that output bandwidth to disk and computational requirements may be used in a balanced way. For example, the selection may be made such that the output bandwidth to disk is exhausted while computational resources are unused, only to later run a different selection of MapReduce programs where the opposite resource utilization would be observed. In some embodiments, one or more of the Known Map Functions (320) may be selected based on their being applicable to the Input Data (225). For example, in some embodiments, a set of one or more Map functions may be selected based on the Map function(s) being applicable to input data of a certain form, such as text data, numerical data, data from a particular domain or data source, etc. In some embodiments, a subset of the Known Map Functions (320) may be selected based on their being capable of execution on streaming input data.
As Input Data (225) arrives as input to the Accelerator, it arrives at a Map Processor (300) (or processors). The Map Processor (300) loads a Map function from the set of Currently Running Map Functions (330). The Map Processor (300) then performs Input Splitting, Input Reading, and Mapping according to the specification of the loaded Map function. In some embodiments, the Map Processor (300) may implement a Java Virtual Machine (JVM) and the Map functions may be specified in Java and utilize a MapReduce library and JVM optimized for execution on the Accelerator's Map Processor(s) (300). Upon loading a piece of Input Data (225), loading of the next piece of Input Data (225) may continue while the current piece of Input Data (225) is being processed. During this process each Currently Running Map Function (330) may be loaded in turn and operated upon the current piece of Input Data (225). Once one Map Function completes on the current piece, the next Map Function may be loaded from element 330. If data arrives faster than it can be processed, the set of Currently Running Map Functions may be adjusted so that the Map Processor and output bandwidth (see Disk, 350) can keep up.
In some embodiments, the Map Processor may perform a segmentation operation on output [key,value] pairs that differs from the conventional Partitioning operation (
In some embodiments, there may be two different storage systems that may receive mapped [key, value] pairs—one implemented on one or more volatile storage media (e.g., a Mapper's internal or local memory, which in some embodiments may be dynamic random-access memory (DRAM), or other suitable volatile storage media), and one implemented on one or more nonvolatile storage media (e.g., a Mapper's hard disk, or other suitable nonvolatile storage media). In some embodiments, [key, value] pairs may first be collected in a set of buffers in the volatile storage system, and appropriately moved to a set of divisions (referred to herein as “buckets”) in the nonvolatile storage system as necessary or desired.
In some embodiments, each key may be assigned a segment which corresponds to a particular Bucket Buffer (342) in the set of Bucket Buffers (340) of the volatile storage system, in which [key, value] pairs corresponding to that key may be stored prior to arriving in the corresponding Disk Bucket (352) in the set of Disk Buckets (350) of the nonvolatile storage system. In some embodiments, the Disk Buckets (350) may lie on disk, which may not be capable of writing individual [key,value] data at distinct locations without some performance penalty, as the inventors have recognized. This is because disk storage today is primarily a sequential medium with random access time measured in milliseconds and only a hundred or so of these can be performed per second per disk. Contrast this with Random Access Memory (RAM), which measures random access time in nanoseconds, millions of which can be performed per second. The opposite of random access is sequential access. The inventors have appreciated that one way to achieve high disk IO bandwidth while still performing random accesses may be to make each random access perform a data operation that is of a sufficiently large size, for example, to balance the amount of time the storage medium spends seeking with the amount of time the storage medium spends writing. For a disk drive this might be, for example 4 Megabytes (“4 MB”). For example, considering a disk having sequential read/write speeds of around 400 MB/sec and seek times around 6-10 ms, allowing for around 100 random seeks per second, balancing the write time against the seek time may result in a desired write size of 400/100=4 MB written per seek. Thus, in some embodiments the performance of the disk IO bandwidth relative to peak bandwidth (i.e., completely sequential) can be controlled by establishing an efficient write size, e.g., determined by Bucket Buffer size. At 4 MB it is not unreasonable to assume that a disk drive could support 100 writes (or reads) of 4 MB to random locations on disk per second. The bucket buffers, in some embodiments, may allow data destined for a particular Disk Bucket (352) to be pooled until it is of sufficiently large size to allow an efficient disk access (i.e., at a desired ratio of seek time to total (seek+write) time). In one embodiment, each Bucket Buffer (342) is 4 MB and the fullest Bucket Buffer (342) is constantly being emptied to Disk Bucket (352) to avoid the buckets becoming too full (i.e., running out of space in the Bucket Buffer 342, which could call for some software handling to deal with, which the inventors have recognized could lower performance in some cases). For example, in some embodiments, emptying of the next fullest buffer may begin in response to completion of the emptying of the previously fullest buffer.
Given a desirable disk access chunk size (e.g. 4 MB), the number of buckets to support may be determined, in some embodiments, using a calculation technique. In this particular exemplary technique, first, the size of the storage space on the Disk that is the maximum that will be utilized by the MapReduce system may be determined. As an example, a 4 Terabyte (“4 TB”) disk drive may be fully dedicated to holding data for MapReduce, and the system may support up to its full utilization. In this case the working memory of the Map Processor (300), which may be the memory supporting the Bucket Buffers (340) or it may be memory internal to the Map Processor (300) or some other memory, may be established to be sufficiently large to hold an entire Disk Bucket (352) so that the keys in the Disk Bucket might be organized using the working memory (when it is not being otherwise used to store [key, value] pairs in Bucket Buffers 340), such as by creating a hash table of the key values. For example, in some embodiments, after processing of a Map function has completed on Input Data 225, and mapped [key, value] pairs have been assigned to and stored in Disk Buckets 350, the [key, value] pairs stored together in a Disk Bucket (352) may be read into memory, separated by their keys (e.g., using a hash table) into data bound for different Reducers, and then transferred to the appropriate Reducers for processing according to the Reduce function. The memory in which data from completed Disk Buckets is prepared for routing to Reducers may be the same memory in which Bucket Buffers 340 were previously implemented during processing of the Map function, or may be a different memory. In a some embodiments, this working memory is planned to be twice the size of a Disk Bucket (352) to allow for a hash table to be efficiently implemented with empty room, which may allow the hash table to operate efficiently.
In some embodiments in which the Bucket Buffers are implemented in the working memory of the Map Processor (300), desired sizes for separate Bucket Buffer (340) and working memory may be determined using any suitable calculation process, one non-limiting example of which is described below.
An exemplary calculation to determine the number of buckets to be supported may be performed iteratively by starting with a working memory that is too small or at least can be trivially supported in hardware. One can start an example of this process at 512 Megabytes (“512 MB”). Because one knows the desired size of each Bucket Buffer (342) is 4 MB (see above) one can divide the total Bucket Buffers (340) memory size (512 MB) by this (4 MB) to arrive at a number of buckets supported. In this case that is 512/4, which is 128 buckets. Suppose that each Disk Bucket (352) was written to disk many times during a set of MapReduce processes that have approximately filled each bucket on disk. In this case (in which each Bucket Buffer in memory has a corresponding Disk Bucket on disk) the 4 TB have been written to disk in 128 separate buckets. The size of each bucket on disk is 4 TB/128, which is 32 GB. The data in each bucket is unsorted and thus if the working memory of the Map Processor (300) is less than 32 GB then the data may be difficult to organize efficiently. The exemplary calculation continues by increasing the size of the working memory and Bucket Buffers (340) to 1 GB.
The above process is then performed again starting with a total memory size of 1 GB, and the size of the working memory and Bucket Buffers (340) is increased until it is of sufficient size to establish a Disk Bucket size that does not exceed the size capacity of the working memory. 1 GB/4 MB=256, therefore a 1 GB Bucket Buffers (340) and working memory would allow for 256 separate buckets (342, 352). If each of these buckets (352) has been nearly filled on Disk (350) then each bucket holds approximately 4 TB/256=16 GB. Since 16 GB may not be efficiently organized with Bucket Buffers (340) and working memory of size 1 GB, the size of the memory may be increased again.
At 2 GB the Bucket Buffers (340) and working memory supports 2 GB/4 MB buckets, which is 512 buckets. When Disk (350) is full, these buckets (352) will be 4 TB/512 in size, which is 8 GB. 8 GB is larger than the 2 GB memory and thus the size of the Bucket Buffers (340) and working memory may be increased again, this time to 4 GB. Now the Bucket Buffers (340) can support 4 GB/4 MB, which is 1024 buckets. When disk is full each Disk Bucket (352) will be about 4 TB/1024, which is 4 Gigabytes in size. Since the working memory is 4 GB, a full bucket can be read from disk and organized in memory. Thus 4 GB may be approximately the right size for the Bucket Buffers and working memory in this example. In some embodiments, once the appropriate size for the Bucket Buffers (340) has been determined, the hardware may be designed with such a memory. Thus this calculation may be performed at design time, in some embodiments. It is also possible to use the above calculation technique to determine what capacity of disk is supported by a given size of Bucket Buffers (340) and working memory (given, e.g., the desired disk access data chunk size, e.g., 4 MB). In other embodiments, however, it may not be possible or desirable optimize the size of a Mapper's volatile or nonvolatile storage system at design time, and calculations may instead be performed later, given predetermined hardware capacities, to determine appropriate numbers and/or sizes of divisions (e.g., buckets, buffers) to implement in the storage system(s). In other embodiments, appropriate numbers and/or sizes of storage system divisions may be determined based on any suitable considerations other than hardware capacities, such as characteristics of a Map function and/or of input data, such as the number of keys to be supported, etc. Also, although examples described above have incorporated equal-sized storage system divisions and corresponding numbers of memory buffers and disk buckets, it should be appreciated that such designs are not required, and in other embodiments divisions may be established of unequal sizes and/or numbers.
According to any suitable Bucket Buffer emptying policy, such as the “fullest bucket first” priority scheme described previously, data in a bucket buffer (342) may be added to its corresponding Disk bucket (352) in some embodiments.
In some embodiments, a set of Map Processors (300) may be associated with a working memory of size 4 GB and Disk buckets (350) of capacity 4 TB. In some embodiments, the disks may be physically implemented as four separate 1 TB 2.5″ drives, which may allow higher aggregate bandwidth and a higher number of disk seeks per second than a single 4 TB drive. In some embodiments, the Accelerator may use the Mapper's motherboard DRAM memory in a size of 4 GB and interact directly over PCI express with this memory as well as with the disks via a Redundant Array of Independent Disks (RAID) controller also connected via PCI Express, which the Map Processor may have driver software to control.
In another embodiment, the RAID controller and Accelerator 220 may be connected to a PCI express switch that is separate from the Mapper's PCI express switch, so that the Disks (350), Accelerator (220), and RAID controller can all be integrated into the same chassis module. This may allow these components to be added to a Mapper as a single unit. In some embodiments, multiple such units may be added to a single Mapper depending on cost, workload, and desired performance. The PCI Express switch built into the Accelerator's housing could then provide a single uplink to the motherboard PCI Express Switch, enabling the unit to use a single interface to the motherboard.
In some embodiments, although a Disk may support 4 TB, it may be chosen slightly oversized and typical use may tend to fill each bucket half-full, or 2 GB each in the example above. In this case, the 4 GB of memory could be used to provide space for an efficient hash table for organizing an entire 2 GB worth of bucket data (352) (i.e., empty space may be available so that collisions may be sufficiently infrequent as to be efficient).
In another embodiment, the Map Processor (300) may interact directly with a Dual In-line Memory Module (DIMM) holding 4 GB of data. In another embodiment, a Field-Programmable Gate Array (FPGA) may connect to the same PCI Express switch as the Accelerator (220) and also to several DRAM modules that together comprise 4 GB. The FPGA may be configured to allow the Accelerator (220) to efficiently interact with the DRAM memory modules in the case that the Accelerator (220) does not have a direct interface for DRAM memory modules. Thus, in some embodiments, an Accelerator (220) that contains only a PCI Express switch interface may be integrated into the system using PCI-express-attached memory and PCI-express-attached disk, and this may all be integrated into a combined housing that exposes a single physical PCI Express interface that may be connected to the Mapper motherboard.
In some embodiments, the determination of which Bucket Buffer (342) a particular [key,value] pair is moved to by the Map Processor (300) may be performed by a deterministic hashing function that gives each Bucket Buffer (342) an equal chance of having a [key,value] pair added to it. Any suitable hashing functions may be used for this purpose; one non-limiting example is SHA-1 (“secure hash algorithm 1,” published by the National Institute of Standards and Technology) combined with a mod (remainder) function. For example, a key might hold the value 8512, which might be hashed to 7070, and then further hashed with a mod function so that it is within the bounds of the number of buffers. Thus 7070 would be modded by 1024 in the case that 1024 buckets are available, resulting in a value of 926. Thus the [key,value] would be placed in bucket 926. Some embodiments may add to a [key,value] pair a third attribute in the case that [key,value] pairs from different MapReduce algorithms share data structures such as Bucket Buffers (340) and Disk Buckets (350). An attribute indicating which MapReduce program the [key,value] has been produced by (and will be consumed by) may be added to the pair [key,value]. Logically this may be considered as transforming each [key,value] pair into a [key,value,program] triplet. The program attribute may determine which program will be loaded to process the [key,value] pair held by the triplet. In some embodiments the data structures that manage the organizing and routing of the keys may use an alternative key based on the original key but with the program attribute (or a hash of it) prepended (thus the new key may be a product of the original key and the program attribute, in some embodiments).
The arrows in
In the example depicted in
In the example of
In the example of
The Bucket Buffers (445) are depicted in the example of
In some embodiments, Buffer Memory 440 may be implemented with one or more DRAM chips, e.g., integrated in a DIMM. In some embodiments, Map Processor 400 may be a processor including a set of cores connected by a network-on-chip. Each core may implement multiple hardware threads, in some embodiments, which may utilize fine-grain context switching to allow for high latency tolerance for network operations. In some embodiments, the Map Processor may include an FPGA. In some embodiments, the Map Processor may include an Application-Specific Integrated Circuit (ASIC). In some embodiments, a cacheless memory system may be integrated on the Map Processor, while in other embodiments, an incoherent cache may be implemented, which may provide better performance at a lower level of power consumption. In some embodiments, Key Hash Table 450 may be implemented using in-package memory coupled with the Map Processor; in other embodiments, Key Hash Table 450 may be implemented with on-chip Static Random-Access Memory (SRAM). The inventors have appreciated that SRAM may be more suitable if the size of a full Bucket Buffer is similar to the amount of on-chip SRAM that can be integrated on the Map Processor. In some embodiments, Known Combiner Algorithms 420 and/or Currently Running Combiner Algorithms 430 may be held in on-chip memory, such as SRAM, embedded DRAM, or on-chip Flash memory. In some embodiments, Single Bucket Buffer 475 may be held in on-chip embedded DRAM, SRAM, or a combination of both. In some embodiments, Post Buffer Combining Logic 480 may be implemented in software. In some embodiments, Disk Buckets 490 may be implemented with a RAID disk array, a single disk, and/or one or more Flash memory devices.
Each exemplary component depicted in
In some embodiments, buckets may be loaded in a deterministic order such that each Mapper selects the Next Bucket (507) to load in the same order. The inventors have appreciated that this may enable Reducers to process all of the values assigned to the same key whenever the Reducer has loaded its portion of the X'th bucket from each of the Mappers. For example, a Reducer may perform Reduce operations on all of the [key,value] pairs sent to it after all of the Mappers have sent out their first Disk Bucket to the Reducers. Similarly, a Reducer may perform Reduce operations on all of the [key,value] pairs sent to it after all of the Mappers have sent out their second Disk Bucket to the Reducers, and so on. The inventors have appreciated that this technique may allow the Reducers in some embodiments to process data before all of the data has arrived. By processing data as it arrives the Reducers may not require disk-based storage, thereby reducing the cost of the Reducers. The Reducers may furthermore be implementable at a finer granularity of computational elements (e.g., using 100-core processors that have no disk, rather than 10-core processors that have 10 disks, which may have performance and cost advantages).
In the example illustrated in
In the example of
If a match has been found by unit 530 then the value held with that key in the Key Hash Table (525) is transmitted to the Combiner (540) (if the MapReduce program for that [key,value] has a Combiner associated with it) as the Value From Matching Key (533), in the example of
Key matching in this example, as well as in the example of element 460 in
The Key Hash Table (525) is depicted in
In the example of
Conventionally, partitioning of keys has been performed such that each Reducer gets a partition. Consider lightweight Mappers with limited buffer resources in a system that includes very many Reducers such as millions of servers. Conventionally, the Mapper must maintain a buffer for each Reducer and these buffers must be able to hold at least as much data as the minimum data payload that can be sent efficiently over the network. For networks such as infiniband this data payload may be around 4 kilobytes, however there are examples of larger transfer sizes of 8 kilobytes being required to achieve highest network utilization. Thus, maintaining ˜10 kB (10 kilobytes) of memory on the Mapper for each of millions of Reducers would require 10 GB of memory on the Mapper, which the inventors have recognized is a high overhead on existing systems and does not allow Mappers to be implemented at finer granularity (e.g. lower frequency 100-core processors that are more power efficient than higher frequency 10-core processors). By contrast, instead of maintaining a buffer for each Reducer on each Mapper, some embodiments as depicted in
In the example of
In some embodiments, the Mapper may transmit the mapped key-value pairs collected in a buffer for a Reducer Group (e.g., buffer 616) by packaging the mapped key-value pairs into a data packet for transmission on Network 625. In some embodiments, the data packet may be sized by selecting and including a number of the mapped key-value pairs from the buffer that makes the data packet a desired size for network transmission, e.g., depending on network factors such as protocol(s) used, available bandwidth, available connections, etc. In some embodiments, the size of an individual buffer in Buffer Memory 610 for sending data to a particular Reducer Group may be established to correspond to a desired data packet size, such that the buffer may be emptied into a data packet of desired size in response to the buffer becoming full.
In some embodiments, once this communication has been sent, no additional communication to the larger network may be required to perform the final message routing. Thus the final message routing may not impose any additional burden on the Network (625) bisection bandwidth, which the inventors have appreciated may be advantageous since bisection bandwidth may cost a premium when very many Reducers are connected such as in the example depicted in
In some embodiments, the Reducers in a Reducer Group may be Reducers that are connected to the same Network Switch. In some other cases, a Reducer Group may be defined to include Reducers that are local to each other, e.g., in that they are able to communicate with each other with fewer intermediate network hops than is typically required for two arbitrary nodes on Network 625 to communicate. For example, in some embodiments, if half of the nodes on Network 625 require X hops to communicate with the other half of the nodes, then nodes that are able to communicate in fewer than X network hops may be considered to be local to each other. In some embodiments, nodes that are local to each other may be able to communicate with each other at higher data rate (higher bandwidth) than nodes that are not local to each other.
In the example illustrated in
In
In the example depicted in
Once the Receiving Buffer for Reducer #1 (648) of Reducer #C (665) receives the example data, it may be propagated to a Reduce Key Hash Table (652) for Reduce operations to commence. Buffer Memory (612) for other Map Processors such as Map Processor #M (623) may similarly perform operations to send mapped [key, value] pairs to intermediary devices (which may themselves be Reducers) for Reducer Groups, for subsequent Reducer-side routing of the data to the Reducers responsible for handling the associated keys. Similarly, many Reducer Groups may be supported, such as G Reducer Groups, which is depicted in
In some embodiments, system components illustrated in
Step 700 begins the process depicted in
In Step 720 the Reducer function receives a previously unprocessed key from the Reduce Key Hash Table (650, 652) as input and processes the values associated with the input key. If this key is the last unprocessed key assigned to the Reducer in the current bucket that has not yet been processed then the Reducer requests that the Reduce Key Hash Table (650, 652) be filled with keys and values from the next bucket that has not yet been processed in step 730. Otherwise Step 720 proceeds back to Step 710.
If Step 730 finds that all buckets have been processed then the process proceeds to Step 740 and ends, otherwise the process proceeds back to Step 720.
An illustrative implementation of a computer system 1000 that may be used in connection with some embodiments of the present invention is shown in
The above-described embodiments of the present invention can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. It should be appreciated that any component or collection of components that perform the functions described above can be generically considered as one or more controllers that control the above-discussed functions. The one or more controllers can be implemented in numerous ways, such as with dedicated hardware, or with general purpose hardware (e.g., one or more processors) that is programmed using microcode or software to perform the functions recited above.
In this respect, it should be appreciated that one implementation of embodiments of the present invention comprises at least one processor-readable storage medium (i.e., at least one tangible, non-transitory processor-readable medium, e.g., a computer memory (e.g., hard drive, flash memory, processor working memory, etc.), a floppy disk, an optical disc, a magnetic tape, or other tangible, non-transitory processor-readable medium) encoded with a computer program (i.e., a plurality of instructions), which, when executed on one or more processors, performs above-discussed functions of embodiments of the present invention. The processor-readable storage medium can be transportable such that the program stored thereon can be loaded onto any computer resource to implement aspects of the present invention discussed herein. In addition, it should be appreciated that the reference to a computer program which, when executed, performs above-discussed functions, is not limited to an application program running on a host computer. Rather, the term “computer program” is used herein in a generic sense to reference any type of computer code (e.g., software or microcode) that can be employed to program one or more processors to implement above-discussed aspects of the present invention.
The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” “having,” “containing,” “involving,” and variations thereof, is meant to encompass the items listed thereafter and additional items. Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed. Ordinal terms are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term), to distinguish the claim elements.
Having described several embodiments of the invention in detail, various modifications and improvements will readily occur to those skilled in the art. Such modifications and improvements are intended to be within the spirit and scope of the invention. Accordingly, the foregoing description is by way of example only, and is not intended as limiting. The invention is limited only as defined by the following claims and the equivalents thereto.
This application claims the benefit under 35 U.S.C. §119(e) of U.S. Provisional Application Ser. No. 61/898,942, filed on Nov. 1, 2013, and entitled “Efficient and Scalable MapReduce Precomputation System,” which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61898942 | Nov 2013 | US |