In many instances, scripts can be run on distributed computing systems to process large volumes of data. One such distributed computing system is Hadoop. Programs for Hadoop are written in Java with a map/reduce architecture. The programs are prepared on local machines but are specifically generated as Hadoop commands. The programs are, then, transferred (pushed) to a grid gateway computers where the programs are stored temporarily. The programs are then executed on the grid of computers. While map/reduce programming provides a tool for large scale computing, in many applications the map/reduce architecture cannot be utilized directly due to the complex processing required. Also, many developers prefer to use other programming languages like perl, C++ for heavy-processing jobs on their local machines. Accordingly, many developers are looking for a way to utilize distributed computing systems as a resource for their familiar languages or tools.
In satisfying the drawbacks and other limitations of the related art, the present application provides an improved method and system for distributed computing.
According to the method, input data may be stored on an input storage module. Mapper code can be loaded onto a map module and executed. The mapper code can load a mapper executable file onto the map module from a central storage unit and instantiate the mapper executable file. The mapper code, then, can pass the input data to the mapper executable file. The mapper executable file can generate mapped data based on the input data and pass the mapped data back to the mapper code.
In another aspect of the system, a reducer module can also be configured in a similar manner. In such a system, reducer code can be loaded onto a reducer module and executed. The reducer code can load a reducer executable file onto the reducer module and instantiate the reducer executable file. The reducer module can then pass the mapped data from the map module to the reducer executable file to generate result data. The result data may be passed back to the reducer code and stored in a result storage module.
Further objects, features and advantages of this application will become readily apparent to persons skilled in the art after a review of the following description, with reference to the drawings and claims that are appended to and form a part of this specification.
To address the issued noted above, a pushing mechanism and Hadoop streaming can be used to wrap all heavy-processing executable components together with their local dependent data and libraries that are developed off-the grid using for example, perl or C++. A push script can set up an environment and run executable files through Hadoop streaming to leverage computing clusters for both map/reduce and non-map/reduce computing. As such, conventional code can also be used for large scale computing on the Grid. With a good planning, many heavy-duty computing components developed off-the-grid may be reused through the pushing scripts.
In this information age, data is essential for understanding customer behaviors and for making business decisions. Most big web related companies like YAHOO!, Inc., AMAZON.COM, Inc., and EBAY Inc. spend an enormous amount of resources to build their own data warehouses for user tracking and decision-making purposes. Usually the amount of data collected from weblogs is in the scale of terabytes or peta bytes. There is a huge challenge in processing such a large amount of data on a daily basis.
Since the debut of the Hadoop system, developers have leveraged this parallel computing system for the processing of large data applications. Hadoop is an open-source, java-based, high performance parallel computing infrastructure that utilizes thousands of commodity PCs to produce significant amount of computing power. Map/reduce is the common program style used for code development in the Hadoop system. It is also a software framework introduced by GOOGLE, Inc. to support parallel computing over large data sets on clusters of computers. Equipped with the Hadoop system and map/reduce framework, many engineers, researchers, and scientists who need to process large data sets are migrating from proprietary clusters to standard distributed architectures, such as the Hadoop system. There is a need to let developers work in a preferred environment, but provide a way to push the application to a distributed computer environment when large data sets are processed.
Now referring to
The master module 20 coordinates which computer system is used for a particular mapping or reducing algorithm. The master module 20 also coordinates setting up for each computer system and the routing of data from one computer system to another computing system. The master module 20 is able to coordinate the modules based on some basic structural rules without knowing how each map module 14 or reduce module 16 manipulates the data. The data is packaged and transferred between modules in key/value pairs. In addition, the flow is generally expected to model a map/reduce flow with splits of the input data being provided to each map module 14 and result data being provided from the reduce modules 16. However, each map and reduce module 14, 16 acts as a black box to the master module 20, as such the master module 20 does not need to know what type of processing occurs with each map and reduce module 14, 16. The structure provided in
Referring again to
Similar to computer 22, the computer 24 is in communication with the master module 20, as denoted by line 54. The master module 20 may download the mapper code 32 to the computer 24 for execution. The mapper code 32 downloads ancillary files including executable files 34, library files 36, and data files 38 from the central storage module 21. In addition, the computer 26 is in communication with the master module 20, as denoted by line 52. The master module 20 may download the mapper code 32 to the computer 26 for execution. The mapper code 32 downloads ancillary files including executable files 34, library files 36, and data files 38 from the central storage module 21 to computer 26.
The communication between each computer, including master module 20, as well as the input data storage module 12 and the result data storage module 18 may be implemented via a wired or wireless network, including but not limited to Ethernet and similar protocols and, for example, may be over the internet, local area networks, or other wide area networks. Other communication paths or channels may be included as well, but are not shown so as to not unduly complicate the drawing.
Within the standard framework of the distributed computing system 10, the mapping modules are also provided with the input data from the input data storage module 12. Accordingly, the computer 22 receives the data split 42, the computer 24 receives the data split 22, and the computer 26 receives the data split 46. The data splits 42, 44, and 46 are transferred to the computers 22, 24, and 26, respectively, in key/value format. Each computer 22, 24, and 26, runs the mapper code 32 to manipulate the input data. As discussed above, the mapper code 32 may download and run an executable file 34. The executable file 34, when instantiated may create a buffer or data stream and pass a pointer to the stream back to the mapper code 32. As such, the input data from the mapper code 32 is passed through the stream to the executable file 34 where it may be manipulated by the executable file 34 and/or library files 36 and retrieved by the mapper code 32 through the stream. In addition, the executable file 34 and/or library files 36 may manipulate the input data based on data files 38, such as look up tables, algorithm parameters, or other such data entities. The manipulated data or mapped data may be passed by the mapper code 32 to one or more of the reduce modules 16. The manipulated data may be transmitted directly to the reduce modules 16 in key/value format, or alternatively may be stored on the network into an intermediate data storage (not shown) where it can be retrieved by the reduce modules 16.
Similarly, the master module 20 can assign computer 62 and computer 64 as reducer modules 16. Computer 62 is in communication with the master module 20, as denoted by line 66. The master module 20 may download the reducer code 72 to the computer 62 for execution. Typically, the reducer code is self contained and written in the Java programming language. In one embodiment of the present application, the reducer code 72 may be a unix script or similar macro that downloads ancillary files including executable files 74, library files 76, and data files 78 from the central storage module 21. Using the unix script in the reducer code 72 to download the ancillary files 74, 76, 78 and instantiate the executable files 74, significantly reduces the time requirements on the master module 20 and allows the developer to utilize executable files 74 and library files 76 that would otherwise need to be recoded into a language supported by the distributed computing system 10.
Similar to computer 62, the computer 64 is in communication with the master module 20, as denoted by line 68. The master module 20 may download the reducer code 72 to the computer 64 for execution. The reducer code 72 downloads ancillary files including executable files 74, library files 76, and data files 78 from the central storage module 21. Within the standard framework of the grid computing system 10, the reducer modules 16 are also provided with the data from mapper modules 14, as denoted by line 58. The data is transferred from the computers 22, 24, and 26 to computers 62 and 64 in key/value format. Each computer 62, 64 runs the reducer code 72 to manipulate the data from the mapper modules 14. As discussed above, the reducer code 72 may download and run an executable file 74. The executable file 74, when instantiated may create a buffer or data stream and pass a pointer to the stream back to the reducer code 72. As such, the input data from the reducer code 72 is simply passed to the executable file 74 where it may be manipulated by the executable file 74 and/or library files 76 and retrieved by the reducer code 72 through the stream. In addition, the executable file 74 and/or library files 76 may manipulate the data from the mapper modules 14 based on data files 78, such as look up tables, algorithm parameters, or other such data entities. The reduced data may be stored in the result data store 18 by the reducer code 72.
One method for implementing the distributed computing system is provided in
The following paragraphs are steps to wrap the executable files and library files and push them into a Hadoop system. A push script may be written for a local development computer to control the Hadoop scripts described below. The push script may use Hadoop streaming commands to pass input data to the mapper code defined below in block 140 where non-map/reduce code is wrapped with Unix shell commands. The push script may be run from the local development computer by issuing a remote call to the Hadoop system. Alternatively, the steps may be performed manually in a less efficient manner.
In block 110, dependent libraries are packaged into a tar file for deployment. Typically library files are relatively small and can be easily tarred into an archived file. When deployed into the Hadoop system, the library files will be copied and unpackaged into each computing node by the Hadoop system automatically. As such, it is suggest that small files are stored within the Hadoop system.
In block 120, large data sets, large tool files, executable files, large library files, etc. into a big package file (usually in tar format). Typically, the large files are required resources to run the needed algorithm. Sometimes the packaged resource file can be many gigabytes or larger. It would not be feasible to copy and deploy such a large file into each Hadoop computing node, as this will take up precious network bandwidth from the Hadoop system's master module. In block 140 and block 150, an innovative way to solve the issue of deploying large required packages into each computing node is provided without taking up much of network bandwidth of the Hadoop system master module.
In block 130, the standard Hadoop load command can be used to load the large package, generated in block 120, to a central Hadoop storage place so that each computing node can access this package file during the run time.
In block 140, a simple unix shell script is provided for the mapper module that executes blocks 142 to 148. It should be noted that the Mapper can run in any Hadoop machine as each machine supports running a unix shell script by default.
Inside the mapper module, the library package from block 110 will be copied/deployed to the mapper computers, then the library package will be unpackaged in each computing module so that the code can run with the corresponding dependent libraries and tools, as denoted by block 142.
In block 144, all environment variables required by the code are set by the mapper code.
Inside the mapper code, the standard Hadoop fetching command may be used to get the large package from block 120 and copy it onto each computing module, as denoted by block 146. Fetching the large packages by each mapper module happens in parallel and utilizes the Hadoop infrastructure very well without putting a significant burden on the Hadoop system's master module, which is the bottleneck of processing.
In block 148, the code runs as if the code was executed in a standalone development computer. The mapper code is able to run independently since all dependent data, executable files, and libraries were downloaded and deployed in the above steps.
In block 150, a simple unix shell script is provided for the reducer module that executes blocks 152 to 158. It should be noted, that the reducer code can run in any Hadoop machine as each machine supports running a unix shell script by default.
Inside the reducer module, the library package from block 110 will be copied/deployed to the reducer computers, then the library package will be unpackaged in each computing module so that the code can run with the corresponding dependent libraries and tools, as denoted by block 152.
In block 154, all environment variables required by the code are set by the reducer code.
Inside the reducer code, the standard Hadoop fetching command may be used to get the large package from block 120 and copy it onto each computing module, as denoted by block 156. Fetching the large packages by each reducer module happens in parallel and utilizes the Hadoop infrastructure very well without putting a significant burden on the Hadoop system's master module, which is the bottleneck of processing.
In block 158, the code runs as if the code was executed in a standalone development computer. The reducer code is able to run independently because all dependent data, executable files, and libraries were downloaded and configured in the above steps.
After each mapper code and reducer code has successfully executed, the mapper code and reducer code removes the library files and other files from the large package, as denoted in block 160. The master module is then able to reassign the computer to another task. The method ends in block 162.
To illustrate one implementation of the push mechanism the following example is given with regard to
By applying this mechanism for high-performance computing in a distributed computing environment, such as Hadoop, it is possible to reuse previous work, while leveraging vast computing and storage power of the distributed computing system. Further, the wrapping/pushing mechanism can work for nearly any type of code developed under a linux system. In addition, it provides an opportunity for developers to use a preferred language or architecture to develop modules for use on a distributed computer environment even modules designed for complicated non-map/reducer problems.
The reducer module 318 runs the reducer code, for example the unix shell to download, unpack, and instantiate the executable files 320, as discussed above. The executable files 320 may return a pointer to a data stream initialized by the executable files 320. The reducer code in the reducer module 318 may pass impression/zip code data to the executable files 320 over the stream. The executable files 320 may manipulate the impression/zip code data, for example determine the percentage of impression in each state or other statistical information, for example related to the geographic region or other demographics. The executable files 320 may make calls to library files 322 or data tables 324 to aid in the transformation from the zip code data to the statistical data. As discussed above, the library files 322 and data tables 324 may be downloaded, unpacked, and instantiated together with the executable files 320. After the executable file 320 has obtained the statistical data, the statistical data may be passed back to the reducer module 318. The reducer module 318 can then pass the statistical data to the result data storage module 326.
As such, the pushing mechanism and streaming described in this application can be utilized to wrap all heavy-duty components with their local dependent data and libraries that are developed off-the grid using perl/C++. A push script can submit the complicated commands through streaming into grid clusters to leverage each grid cluster for both map/reduce and non-map/reduce computing.
Any of the modules, servers, or engines described may be implemented in one or more general computer systems. One exemplary system is provided in
In an alternative embodiment, dedicated hardware implementations, such as application specific integrated circuits, programmable logic arrays and other hardware devices, can be constructed to implement one or more of the methods described herein. Applications that may include the apparatus and systems of various embodiments can broadly include a variety of electronic and computer systems. One or more embodiments described herein may implement functions using two or more specific interconnected hardware modules or devices with related control and data signals that can be communicated between and through the modules, or as portions of an application-specific integrated circuit. Accordingly, the present system encompasses software, firmware, and hardware implementations.
In accordance with various embodiments of the present disclosure, the methods described herein may be implemented by software programs executable by a computer system. Further, in an exemplary, non-limited embodiment, implementations can include distributed processing, component/object distributed processing, and parallel processing. Alternatively, virtual computer system processing can be constructed to implement one or more of the methods or functionality as described herein.
Further the methods described herein may be embodied in a computer-readable medium. The term “computer-readable medium” includes a single medium or multiple media, such as a centralized or distributed database, and/or associated caches and servers that store one or more sets of instructions. The term “computer-readable medium” shall also include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor or that cause a computer system to perform any one or more of the methods or operations disclosed herein.
As a person skilled in the art will readily appreciate, the above description is meant as an illustration of the principles of this invention. This description is not intended to limit the scope or application of this invention in that the invention is susceptible to modification, variation and change, without departing from spirit of this invention, as defined in the following claims.