This application claims the benefit of Indian Patent Application No. 5996/CHE/2013 filed Dec. 20, 2013, which is hereby incorporated by reference in its entirety.
The present invention relates generally to online analytical processing and in particular, to a system and method for implementing peta-byte scale online analytical processing solution using MapReduce
Digitization of various business functions and adoption of digital channels by the consumers has been resulting in a deluge of information. This is resulting in huge volumes of data getting generated at increasing pace and in various forms and varieties. The data volumes are increasing exponentially especially the unstructured data. Organizations are now dealing with Big Data that are in Petabytes or more with a periodic increase in terms of Terabytes of data. These large datasets are beyond the ability of traditional software tools to capture, store and process them. Thus a whole new set of Big Data technologies like MapReduce, NoSQL solutions, etc. have emerged that enable storage and processing of data at higher order of magnitude at much lower costs than what was possible with traditional technologies.
Analytics and Reporting solutions help analyze the data and generate reports to be consumed by the business users. An OLAP solution is used to creating cubes in a relational database which is then used to generate reports. These solutions work well on structured data. But they would be unable to handle unstructured or semi-structured data. A lot of data that has information about customer behavior like the customer click stream data in web logs are not being captured or analyzed currently for mining customer preferences. OLAP Solutions also become expensive with large datasets. Big Data technologies are helping reduce the costs while scaling to petabytes of data and handling unstructured data thus resulting in significant innovations in Business Intelligence (BI) and Analytics. There is a requirement for an OLAP Solution that stores and processes Petabyte datasets and can help organizations gain a more detailed insight into their problems. Big Data frameworks like Hadoop are being used to store and combine structured, semi-structured and unstructured data from multiple sources. The data is processed and analyzed using MapReduce programs to derive some useful business insights.
Big Data Analytics has been applied in organization cutting verticals for real-world problems. Some sample use cases where a Petabyte OLAP Solution will be applicable is for Sentiment Analysis wherein the unstructured social media content and social networking posts can be used to determine the user sentiment related to particular companies, brands or products. Analysis can focus on macro-level sentiment down to individual user sentiment. The next segment would be Fraud Detection where Identifying and flagging a fraudulent activity based on data from multiple sources including customer behavior, historical and transactional data is a scenario that online payment companies are using. Then it is followed by Customer Churn Analysis that uses Big Data technologies, organizations analyze customer behavior data to identify customer behavior patterns. Based on the behavior patterns, customers who are most likely to leave for a competing vendor or service can be identified.
The key challenge is with storing and processing large volumes of data efficiently. Traditional enterprise data warehouse and analytics solutions use expensive hardware and also cannot scale to petabytes of data. Another challenge is ease-of-use in providing an interface that will allow business users to run OLAP queries on a large dataset that is stored over a distributed file system. Writing and executing MapReduce jobs requires additional skillset of knowing a programming or scripting language and so is difficult for business analysts to use. Traditional OLAP solutions support a number of OLAP query interfaces that provide a wide variety of aggregation and analytical functionalities. There is a need for such OLAP query interfaces to be developed over MapReduce solutions.
Traditional Online Analytical Processing solutions (OLAP) use relational databases to store and process data. A number of OLAP solutions are available in the market such as Microsoft Analysis Services, Oracle Essbase, MicroStrategy, Mondrian, SAS, etc. These solutions support using query languages such as MDX, XML for Analysis, OLE DB for OLAP or SQL that process data stored in relational databases.
These solutions have the limitation in that they cannot scale horizontally using commodity hardware to be able to address the needs of next generation Big Data scenarios involving petabytes of data. Hadoop provides a solution for leveraging commodity hardware to scale horizontally but it has the limitation that it is difficult to use for business analysts as it doesn't offer the needed abstractions for business analysts. Hadoop needs developers to create MapReduce jobs and so is not easy to use for business analysts who do not have programming skills.
All the above stated methods describe different methods of parsing a SQL-like query string into MapReduce jobs. Some of them support specific aggregation functions on the data set. They however do not support the needs of online Analytical Processing (OLAP) as the data models (cubes, dimensions, etc.) is different and the kind of OLAP operations like aggregation and analytical functions to be applied are different. Analytical processing can also involve applying complex machine learning algorithms so SQL based solutions are inadequate.
The present technique online analytical processing solution overcomes the above mentioned limitation by implementing an OLAP solution that translates an OLAP QL into one or more MapReduce jobs and executes them on a dataset stored in a distributed file system such as HDFS.
According to one embodiment of the present disclosure, a method for implementing Online Analytical Processing (OLAP) solution using Map Reduce is disclosed. The technique involves receiving an OLAP query from a user through an OLAP-QL Driver. After receiving the query it is parsed through the compiler. Then the metadata information is retrieved from the parsed query through the metadata manager. Validating the parsed query using plan generator module for generating a MapReduce job execution plan based on the retrieved metadata information. The next step is to identify the scope for optimization in the generated MapReduce job execution plan and optimizing the MapReduce job execution plan using the identified scope. Then executing the optimized MapReduce job plan using the execution engine and finally storing the output data in the cube specific distributed file system directory.
In an additional embodiment, a system for implementing Online Analytical Processing (OLAP) solution using Map Reduce is disclosed. The system includes a receiving module, a parsing module, a retrieving module, a validation module, an identification module, an execution module and a storage module. The receiving module is configured to receive input OLAP query from a user. A parsing module is configured to parse the received input query. The retrieving module is configured to retrieve the metadata information from the parsed OLAP query through the metadata manager. The validation module for validating the OLAP query and generating a MapReduce job execution plan based on the retrieved metadata information. An identification module configured for identifying the scope for optimization in the generated MapReduce job execution plan and optimizing the MapReduce job execution plan using the identified scope. An execution module is configured to execute the optimized MapReduce job execution plan using an execution engine and a storage module for storing the output data in a cube specific distributed file system (DFS) directory.
In another embodiment, a computer readable storage medium for implementing online Analytical Processing (OLAP) solution using Map Reduce is disclosed. The computer readable storage medium which is not a signal stores computer executable instructions for capturing an OLAP query from the user through an OLAP-QL driver, parsing the OLAP query, for retrieving metadata information of the OLAP query through a meta date manager, validating the OLAP query and generating a MapReduce Job execution plan based on the retrieved metadata information of the OLAP query, identifying the scope for optimization in the generated MapReduce job execution plan and optimizing the MapReduce job execution plan using the identified scope, executing the optimized MapReduce job execution plan using an execution engine and storing the output data in a cube specific distributed file system (DFS) directory.
Various embodiments of the technology will, hereinafter, be described in conjunction with the appended drawings provided to illustrate, and not to limit the invention, wherein like designations denote like elements, and in which:
The foregoing has broadly outlined the features and technical advantages of the present disclosure in order that the detailed description of the disclosure that follows may be better understood. Additional features and advantages of the disclosure will be described hereinafter which form the subject of the claims of the disclosure. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the disclosure as set forth in the appended claims. The novel features which are believed to be characteristic of the disclosure, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present disclosure.
Exemplary embodiments of the present technique provide a system and method for implementing Online Analytical Processing (OLAP) solution using Map Reduce. This involves receiving the OLAP query from a user through an OLAP-QL Driver. The received query is parsed as the next step. Then the metadata information is received from the parsed OLAP query using the metadata manager. Then validating the parsed OLAP query using a plan generator module for generating a MapReduce Job execution plan based on the retrieved metadata information. As the next step identifying a scope for optimization in the generated MapReduce job execution plan using an optimizer and optimizing the MapReduce job execution plan using the identified scope. Thereafter, executing the optimized MapReduce job execution plan using an execution engine. Finally, storing the output data in a cube specific distributed file system (DFS) directory.
With reference to
The compiler 304 can be plugged in for parsing each of the OLAP query languages. Then it validates the query for the correct syntax. As the next step, it retrieves the required cube schema information from the query. The information retrieved includes Fact name, Dimension names, Measures, Aggregation functions, Analytical functions and other axis details.
Plan generator 306 is used for receiving the parsed query with the retrieved cube schema details for generating an execution plan with one or more MapReduce jobs. The metadata store 308 is used to store the metadata schema information. The plan generator 306 retrieves the metadata schema information of the cube from the metadata store 308. The metadata store 308 could be a file system or a database. The following table illustrates the representative cube metadata with information related to the fact, dimensions, measures and functions.
Consider that the Home directory of the OLAP is /olap. The metadata store 308 will contain the above mentioned metadata details for accessing the entities and cubes. Each Fact entity, dimension entity and cube is represented by a directory location in the datastore. The datastore will contain the content of the entity and cubes in the form of uncompressed or compressed text files. Thee optimizer 310 is used to identify the optimization options in the MapReduce jobs. The plan generated in 306 is run through optimizer 310 to check for opportunities to tweak the jobs for better performance and faster results. Optimization options could include choosing relevant attributes while fetching data, re-ordering the entities while fetching, optimization of joins, adding or removing jobs for performance enhancements etc. One of the techniques for the optimization of joins is generation of hash using techniques like bloom filter in map side and using that to filter only data that is relevant for further processing through join. Based on the optimizations identified in the earlier steps, an update job execution plan is generated. The optimized job plan is sent to the
Execution Engine 312 which uses the MapReduce framework for executing the jobs. The Execution Engine 312 sits on top of the MapReduce Framework 312 which receives the update job execution plan. Based on the plan, the framework spawns off the mappers and reducers on the dataset. The Distributed File System (DFS) 316 is used to store the output of the MapReduce jobs and provide the results to the user.
The StoreAction 408 is used to store data into a particular directory location. For e.g. the final Cube output is stored in the cube directory location using this command The CompressAction 410 is used to compress the data into a user-specified format based on the compress algorithm defined in the DFS. The AggregateAction 412 is used to perform aggregations on a measure in the given dataset. The functions supported are sum, count, average, min, max etc. The SortAction 414 is used to provide a sorted dataset ordered by one or more attributes in ascending or descending. The GroupAction 416 is used to group the output dataset based on the attributes specified. The SelectAction 418 is used to select specific attributes of a fact or dimension of a cub for further processing. The PredictAction is used for applying predictive analysis to predict the data. It is split into a PredictMapAction 424 and PredictReduceAction 420.
The FilterAction 426 is used to perform a filtering action based on one or more attributes in the given entity dataset. The LoadAction 428 is used to read data from a particular directory location. It is used to scan the fact entity and each of the specified dimension entities from the DFS. For e.g. the fact entities and dimension entity data is loaded from the text files in the respective directory location. The FetchMetaDataAction 430 is used to retrieve the metadata information for a given entity such as cube, fact or dimension from the metastore. The metadata information would be located in a file system or a database.
The above mentioned description is presented to enable a person of ordinary skill in the art to make and use the technology and is provided in the context of the requirement for obtaining a patent. Various modifications to the preferred embodiment will be readily apparent to those skilled in the art and the generic principles of the present technology may be applied to other embodiments, and some features of the present technology may be used without the corresponding use of other features. Accordingly, the present invention is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features described herein.
Number | Date | Country | Kind |
---|---|---|---|
5996/CHE/2013 | Dec 2013 | IN | national |