This invention relates generally to massively parallel processing (MPP) data storage systems and methods for big data applications, and more particularly to new and improved MPP system architectures comprising large clusters of commodity servers, and associated query execution models for accessing data in such systems.
Most successful companies use data to their advantage. The data are no longer easily quantifiable facts, such as point of sale transaction data. Rather, companies retain, explore, analyze, and manipulate all the available information in their purview. Ultimately, they may analyze the data to search for evidence of facts, and insights that lead to new business opportunities or which leverage their existing strengths. This is the business value behind what is often referred to as “Big Data”.
Big data is “big” because it comprises massive quantities, frequently hundreds of terabytes or more, of both structured and unstructured data. Among the problems associated with such big data is the difficulty of quickly and efficiently analyzing the data to obtain relevant information. Conventional relational databases store structured data and have the advantage of being compatible with the structured query language (SQL), a widely used powerful and expressive data analysis language. Increasingly, however, much of big data is unstructured or multi-structured data for which conventional relational database architectures are unsuited, and for which SQL is unavailable. This has prompted interest in other types of data processing platforms.
The Apache Software Foundation's open source Hadoop distributed file system (HDFS) has rapidly emerged as one of the preferred solution for big data analytics applications that grapple with vast repositories of unstructured or multi-structured data. It is flexible, scalable, inexpensive, fault-tolerant, and is well suited for textual pattern matching and batch processing, which has prompted its rapid rate of adoption by big data. HDFS is a simple but extremely powerful distributed file system that can be implemented on a large cluster of commodity servers with thousands of nodes storing hundreds of petabytes of data, which makes it attractive for storing big data. However, Hadoop is a non-SQL compliant, and, as such, does not have available to it the richness of expression and analytic capabilities of SQL systems. SQL based platforms are better suited to near real-time numerical analysis and interactive data processing, whereas HDFS is better suited to batch processing of large unstructured or multi-structured data sets.
A problem with such distinctly different data processing platforms is how to combine the advantages of the two platforms by making data resident in one data store available to the platform with the best processing model. The attractiveness of Hadoop in being able to handle large volumes of multi-structured data on commodity servers has led to its integration with MapReduce, a parallel programming framework that integrates with HDFS and allows users to express data analysis algorithms in terms of a limited number of functions and operators, and the development of SQL-like query engines, e.g., Hive, which compile a limited SQL dialect to interface with MapReduce. While this addresses some of the expressiveness shortcomings by affording some query functionality, it is slow and lacks the richness and analytical power of true SQL systems.
One reason for the slowness of HDFS with MapReduce is the necessity for access to metadata information needed for executing queries. In a distributed file system architecture such as HDFS the data is distributed evenly across the multiple nodes. If the metadata required for queries is also distributed among many individual metadata stores on the multiple distributed nodes, it is quite difficult and time-consuming to maintain consistency in the metadata. An alternative approach is to use a single central metadata store that can be centrally maintained. Although a single metadata store can be used to address the metadata consistency problem, it has been impractical in MPP database systems. A single central metadata store is subject to large numbers of concurrent accesses from multiple nodes running parallel queries, such as is the case with HDFS, and this approach does not scale well. The system slows rapidly as the number of concurrent accesses to the central store increases. Thus, while HDFS has many advantages for big data applications, it also has serious performance disadvantages. A similar problem exists in using a central metadata store in conventional MPP relational databases that requires large numbers of concurrent access. What is needed is a different execution model and approach for executing queries in such distributed big data stores.
It is desirable to provide systems and methods that afford execution models and approaches for massively parallel query processing in distributed file systems that address the foregoing and other problems of MPP distributed data storage systems and methods, and it is to these ends that the present invention is directed.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
This invention is particularly well adapted for use with a new MPP database system of the assignee of this invention comprising the native integration of a Hadoop HDFS and a massively parallel SQL distributed database with a massively parallel SQL query engine that affords true (full) SQL processing for Hadoop, and will be described in that context. It will be appreciated, however, that this is illustrative of only one utility of the invention and that the invention has applicability to other types of systems and methods.
The primary master 118, as will be described in more detail below, may be responsible for accepting queries from a client 134, planning queries, dispatching query plans to the segments for execution on the stored data in the distributed storage layer, and collecting the query results from the segments. The standby master 120 may be a warm backup for the primary master that takes over if the primary master fails. The primary master and the standby master may also be servers comprising conventional CPU, storage, memory and I/O components that execute instructions embodied in memory or other physical non-transitory computer readable storage media to perform the operations in accordance with the invention described herein. In addition to interfacing the segment hosts to the primary master and standby master, the network interconnect 114 also communicates tuples between execution processes on the segments.
As will be described in more detail below, when the primary master receives a query, it parses, optimizes and plans the query using a query planner and optimizer, which in a preferred embodiment are a SQL query planner and optimizer, and dispatches a query plan to the segments for execution. In accordance with the invention, after the query planning phase and prior to dispatch, the primary master converts the query plan to a self-described query plan that may comprise multiple slices. The self-described query plan is a self-contained plan that includes all of the information and metadata needed by each segment for execution of the plan. The self-described query plan includes, for instance, the locations of the files that store the tables accessed in the plan and the catalog information for the functions, operators and other objects in the query plan. In one embodiment, the information may also include the functions needed for processing the data. This information may be stored in a central metadata store 140 in the local file system or storage of the primary master from which it may be retrieved and inserted into the query plan to form the self-described query plan.
There are a number of advantages to the invention. Since the self-described query plan is self-contained, it may contain all of the information need for its execution. This obviates the need for any access by a segment to a central metadata store in order to execute the plan, thereby avoiding large volumes of network traffic and correspondingly slower response times, and avoids the necessity of the segment hosts storing the necessary metadata locally. Moreover, metadata is typically small and conveniently stored in one central location. Therefore, metadata consistency can be easily maintained. Furthermore, since the metadata may be stored in a local file system on the primary master node the insertion of the metadata into the self-described query plan 136 is fast. Following generation the self-described query plan may be broadcast to the segments 130 for execution, as indicated in the figure. In accordance with the invention, the segments may be stateless, i.e., they act as slave workers and have no need to maintain any state information for persistent files, functions and so on. This advantageously permits the segments to be relatively simple and fast.
Several optimizations are possible. One optimization is that the self-described query plan may be compressed prior to dispatch to decrease the network costs for broadcasting the plan. For a deep sliced query on partitioned tables, the query plan size may be quite large, for example, more than 100 MB. It is preferable to decrease the size of the query plan that must be broadcast by compressing the plan. In a preferred embodiment, a local read-only cache may be maintained on each segment (as will be described) to store static read-only information that seldom if ever changes, such as type information, built-in function information, query operators, etc. This information may include the functions and operators needed to execute the self-described query plan, e.g., Boolean functions such as SUM, and various query operators so that the plan itself need only contain a reference to the functions and operators, and identifiers for their argument or object values. The read-only cache may be initialized during a system bootstrap phase. Then, when a self-described query plan is constructed at the master, changeable metadata and information maintained in the master may be incorporated into the plan. Since it is known that any static read-only information may be obtained from the local caches on the segments, it is unnecessary to place this information into the plan that is broadcast.
When each query executor 330 receives the query plan, it can determine what command in the query plan that query executor should process. Because the query executors are set up by the query dispatcher after the self-described query plan has been generated, the query dispatcher knows the slice number and segment index information and may include this information in the query plan that is broadcast. Upon receiving the query plan, a query executor can look up the metadata information that the query executor needs to process the command. This metadata may be found either from the self-described query plan itself or from the read-only data in the read-only cache 334 in the segment. The self-described query plan may include an identifier or other indicator that designates the read-only information stored in the read-only cache that the query executor needs to execute the command. Once the query executor executes the command, it may follow the information in the command and send the results to a next-indicated query executor or return the results to the query dispatcher in the primary master. As will be appreciated, broadcasting the self-described query plan to each segment is a new execution model and process that enables full SQL functionality to be extended to distributed file systems such as RDFS.
After the master generates the query plan shown in
As shown in the figure, the query plan execution begins for Slice 1 at 410 with a sequential scan on a table containing sales data by query executors on two segments. Each query executor will only perform the sequential scan command on its own table data. For example, the query executor for Slice 1 on segment 1 will only perform the sequential scan on the table sales data for segment 1. Information on the portion of the table the query executor should process is obtained from the metadata information embedded in the self-described query plan. After the query executor performs the sequential scan for Slice 1 at 410, it will perform a redistribute motion at 412 to distribute the data to the query executors on the two segments for Slice 2. Similarly, for Slice 2 the query executors will perform a sequential scan of a table of customer data at 420, hash the results at 422, perform a hash join operation at 424 with results of the redistribute motion operation at 412, execute a hash aggregate command at 426, and a perform a redistribute motion operation at 428. Finally, for Slice 3, the query executors will execute a hash aggregate command at 430 and a gather motion command at 422 to gather the results.
In each process step of
As will be appreciated, the invention affords a new self-described query execution model and process that has significant advantages in increasing the speed and functionality of MPP data stores by decreasing network traffic and affording better control of metadata consistency. The process of the invention allows full advantage to be taken of Hadoop HDFS and other distributed file systems for multi-structured big data by extending to such systems a massively parallel execution SQL engine and the functionality.
While the foregoing has been with respect to preferred embodiments of the invention, it will be appreciated that changes to these embodiments may be made without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application is a continuation of co-pending U.S. patent application Ser. No. 15/450,389, entitled SELF-DESCRIBED QUERY EXECUTION IN A MASSIVELY PARALLEL SQL EXECUTION ENGINE filed Mar. 6, 2017 which is incorporated herein by reference for all purposes, which is a continuation of U.S. patent application Ser. No. 13/853,060, entitled SELF-DESCRIBED QUERY EXECUTION IN A MASSIVELY PARALLEL SQL EXECUTION ENGINE filed Mar. 29, 2013, now U.S. Pat. No. 9,626,411, which is incorporated herein by reference for all purposes, which claims priority to U.S. Provisional Application No. 61/769,043, entitled INTEGRATION OF MASSIVELY PARALLEL PROCESSING WITH A DATA INTENSIVE SOFTWARE FRAMEWORK filed Feb. 25, 2013 which is incorporated herein by reference for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5857180 | Hallmark | Jan 1999 | A |
6219692 | Stiles | Apr 2001 | B1 |
6363375 | Hoshino | Mar 2002 | B1 |
6678695 | Bonneau | Jan 2004 | B1 |
6928451 | Mogi | Aug 2005 | B2 |
7051034 | Ghosh | May 2006 | B1 |
7072934 | Helgeson | Jul 2006 | B2 |
7447786 | Loaiza | Nov 2008 | B2 |
7849073 | Young-Lai | Dec 2010 | B2 |
7873650 | Chapman | Jan 2011 | B1 |
7877379 | Waingold | Jan 2011 | B2 |
7984043 | Waas | Jul 2011 | B1 |
8051052 | Jogand-Coulomb | Nov 2011 | B2 |
8060522 | Birdwell | Nov 2011 | B2 |
8171018 | Zane | May 2012 | B2 |
8195705 | Calvignac | Jun 2012 | B2 |
8209697 | Kobayashi | Jun 2012 | B2 |
8239417 | Gu | Aug 2012 | B2 |
8370394 | Atta | Feb 2013 | B2 |
8640137 | Bostic | Jan 2014 | B1 |
8788464 | Lola | Jul 2014 | B1 |
8832078 | Annapragada | Sep 2014 | B2 |
8886631 | Abadi | Nov 2014 | B2 |
8935232 | Abadi | Jan 2015 | B2 |
9002813 | Gruschko | Apr 2015 | B2 |
9002824 | Sherry | Apr 2015 | B1 |
9177008 | Sherry | Nov 2015 | B1 |
9229979 | Shankar | Jan 2016 | B2 |
9262479 | Deshmukh | Feb 2016 | B2 |
20030145047 | Upton | Jul 2003 | A1 |
20030204427 | Gune | Oct 2003 | A1 |
20030208458 | Dettinger | Nov 2003 | A1 |
20030212668 | Hinshaw | Nov 2003 | A1 |
20030229627 | Carlson | Dec 2003 | A1 |
20030229639 | Carlson | Dec 2003 | A1 |
20030229640 | Carlson | Dec 2003 | A1 |
20040039729 | Boger | Feb 2004 | A1 |
20040103087 | Mukherjee | May 2004 | A1 |
20040128290 | Haas | Jul 2004 | A1 |
20040177319 | Horn | Sep 2004 | A1 |
20040215626 | Colossi | Oct 2004 | A1 |
20050193035 | Byrne | Sep 2005 | A1 |
20050209988 | Cunningham | Sep 2005 | A1 |
20050278290 | Bruce | Dec 2005 | A1 |
20060149799 | Wong | Jul 2006 | A1 |
20060248045 | Toledano | Nov 2006 | A1 |
20070094269 | Mikesell | Apr 2007 | A1 |
20070203893 | Krinsky | Aug 2007 | A1 |
20080016080 | Korn | Jan 2008 | A1 |
20080027920 | Schipunov | Jan 2008 | A1 |
20080178166 | Hunter | Jul 2008 | A1 |
20090019007 | Niina | Jan 2009 | A1 |
20090254916 | Bose | Oct 2009 | A1 |
20090327242 | Brown | Dec 2009 | A1 |
20100094716 | Ganesan | Apr 2010 | A1 |
20100198855 | Ranganathan | Aug 2010 | A1 |
20110041006 | Fowler | Feb 2011 | A1 |
20110060732 | Bonneau | Mar 2011 | A1 |
20110209007 | Feng | Aug 2011 | A1 |
20110228668 | Pillai | Sep 2011 | A1 |
20110302151 | Abadi | Dec 2011 | A1 |
20110302226 | Abadi | Dec 2011 | A1 |
20110302583 | Abadi | Dec 2011 | A1 |
20120011098 | Yamada | Jan 2012 | A1 |
20120030220 | Edwards | Feb 2012 | A1 |
20120117120 | Jacobson | May 2012 | A1 |
20120166417 | Chandramouli | Jun 2012 | A1 |
20120191699 | George | Jul 2012 | A1 |
20120203765 | Ackerman | Aug 2012 | A1 |
20120254215 | Miyata | Oct 2012 | A1 |
20120303669 | Chmiel | Nov 2012 | A1 |
20120310916 | Abadi | Dec 2012 | A1 |
20130041872 | Aizman | Feb 2013 | A1 |
20130086039 | Salch | Apr 2013 | A1 |
20130144878 | James | Jun 2013 | A1 |
20130166588 | Gruschko | Jun 2013 | A1 |
20130179474 | Charlet | Jul 2013 | A1 |
20130282650 | Zhang | Oct 2013 | A1 |
20130297646 | Watari | Nov 2013 | A1 |
20130326215 | Leggette | Dec 2013 | A1 |
20130332478 | Bornea | Dec 2013 | A1 |
20140032528 | Mandre | Jan 2014 | A1 |
20140067792 | Erdogan | Mar 2014 | A1 |
20140108861 | Abadi | Apr 2014 | A1 |
20140114952 | Robinson | Apr 2014 | A1 |
20140114994 | Lindblad | Apr 2014 | A1 |
20140156636 | Bellamkonda | Jun 2014 | A1 |
20140188825 | Muthukkaruppan | Jul 2014 | A1 |
20140196115 | Pelykh | Jul 2014 | A1 |
20140280032 | Kornacker | Sep 2014 | A1 |
Entry |
---|
“Greenplum Database 4.1 Administrator Guide”, 2011 (1 of 3). |
“Greenplum Database: Critical Mass Innovation”, 2010. |
“Parallel Processing & Parallel Database”, 1997, Oracle. |
B. Hedlund, “Understanding Hadoop Clusters and the Network”, 2011, bradhedlund.com/2011/09/10/understanding-hadoop-clusters and the network. |
Borthakur et al., “Apache Hadoop Goes Realtime at Facebook”, 2011, ACM. |
C. Zhang, “Enhancing Data Processing on Clouds with Hadoop/HBase”, 2011, University of Waterloo, Waterloo, Ontario,Canada, 2011. www.uwspace.uwaterloo.ca/handle/10012/6361. |
Hsu et al., “A Cloud Computing Implementation of XML Indexing Method Using Hadoop”, 2012, Springer-Verlag. |
K. Elmeleegy, “Piranha: Optimizing Short Jobs in Hadoop”, Aug. 30, 2013, Proceedings of the VLDB Endowment. |
Nguyen et al., “A MapReduce Workflow System for Architecting Scientific Data Intensive Applications”, 2011, ACM. |
Shafer et al., “The Hadoop Distributed Filesystem: Balancing Portability and Performance”, 2010, IEEE. |
Shvachko et al., “The Hadoop Distributed File System” 2010, IEEE. |
Wang et al. “Hadoop High Availability through Metadata Replication”, 2009, ACM. |
Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”, 2011, Princeton, cs.princeton.edu. |
Zhao et al., “Research of P2P Architecture based on Cloud Computing”, 2010, IEEE. |
Friedman et al., “SQL/Map Reduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions”, 2009, ACM. (Year: 2009). |
Abouzeid et al., “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads”, 2009, ACM. |
Jin et al., “Design of a Trusted File System Based on Hadoop”, Jun. 2012, presented at the International Conference, ISCTCS 2012, Springer, pp. 673-680. (1 of 5). |
Jin et al., “Design of a Trusted File System Based on Hadoop”, Jun. 2012, presented at the International Conference, ISCTCS 2012, Springer, pp. 673-680. (2 of 5). |
Jin et al., “Design of a Trusted File System Based on Hadoop”, Jun. 2012, presented at the International Conference, ISCTCS 2012, Springer, pp. 673-680. (3 of 5). |
Jin et al., “Design of a Trusted File System Based on Hadoop”, Jun. 2012, presented a the International Conference, ISCTCS 2012, Springer, pp. 673-680. (4 of 5). |
Jin et al., “Design of a Trusted File System Based on Hadoop”, Jun. 2012, presented a the International Conference, ISCTCS 2012, Springer, pp. 673-680. (5 of 5). |
Hunt et al., “ZooKeeper: Wait-free coordination for Internet-scale systems”, 2010, USENIX. (Year: 2010). |
Krishnamurthy et al., “Early Measurements of a Cluster-based Architecture for P2P Systems”, 2001, ACM. (Year: 2001). |
Number | Date | Country | |
---|---|---|---|
20180129707 A1 | May 2018 | US |
Number | Date | Country | |
---|---|---|---|
61769043 | Feb 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15450389 | Mar 2017 | US |
Child | 15689867 | US | |
Parent | 13853060 | Mar 2013 | US |
Child | 15450389 | US |