Distributed storage systems enable databases, files, and other objects to be stored in a manner that distributes data across large clusters of commodity hardware. For example, Hadoop® is an open-source software framework to distribute data and associated computing (e.g., execution of application tasks) across large clusters of commodity hardware.
EMC Greenplum® provides a massively parallel processing (MPP) architecture for data storage and analysis. Typically, data is stored in segment servers, each of which stores and manages a portion of the overall data set.
Distributed systems, such as a distributed database or other storage system, typically embody and/or employ a “transaction model” to ensure that a single logical operation on the data, the processing of which may be performed by more than one node, is performed collectively in a manner that ensures certain properties, such as atomicity (modifications made potentially by more than one node either succeed or fail together), consistency (database is never left in a “half-finished” state, and instead is left in a state wholly consistent with its rules), isolation (keep transactions separate from each other until they are finished), and durability (once a transaction is “committed”, its effects on the data will not be lost, due to power fail, etc.
Two-phase commit protocol or other distributed transaction commit protocols are commonly used to implement global transaction in a parallel transactional MPP database system. These distributed transaction protocols are complicated to implement and require multiple interactions between master and slave/worker nodes. Also, typically each node must keep its own log.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
A transaction model for a large-scale parallel analytic database system is disclosed. In various embodiments, a master node is responsible for keeping transaction state of metadata for the entire distributed system and maintains data consistency of whole cluster. Distributed processing units, sometimes referred to herein as “segments”, in various embodiments are stateless execution engines. The master node sends to each segment the system metadata required by that segment to execute its part of a query plan, and the segment returns to the master node that segment's query results and a metadata modification record to reflect changes made by that segment, if any, to the data in connection with executing the query. The master node implements a single node transaction model and a transaction is committed only if all participating segments complete their transaction-related work successfully. If all succeed, the master uses the metadata modification records received from the respective participating segments to update the system metadata and commits the transaction. If any one or more participating segments fail, the transaction is aborted and none of the metadata modification records is/are written to the system metadata.
When the master node 102 accepts a query, it is parsed and planned according to the statistics of the tables in the query, e.g., based on metadata 106. After the planning phase, a query plan is generated. A query plan is sliced into many slices. In a query execution phase, a “gang” or other grouping of segments is allocated for each slice to execute the slices.
In the example shown in
In various embodiments, a large-scale distributed database system such as the one shown in
In various embodiments, the master adopts a traditional single node transaction implementation, for example in various embodiments a write ahead log (WAL) and multi-version concurrency control (MVCC) algorithms are used to implement transactions. The master is responsible for the metadata's consistency, isolation, atomicity, and durability. All modification of metadata as a result of processing on segments is recorded in various embodiments on the local file system of the master.
Continuing with the example shown in
As the various segments to which portions of the query plan were assigned completed their work, each segment that successfully completed its work would send to the master a metadata modification record reflect which, if any, changes that segment made to system data. For example, if a segment appended rows to a table or portion thereof and saved those changes to an associated file, the metadata modification record may reflect a new EOF or other indication of valid file size and/or extent. As noted in
In various embodiments, modifications to metadata on segments are not visible on the master until the master receives all metadata modification records from the respective segments participating in the transaction and replays them on master. In some embodiments, during execution of a portion of a query plan metadata modification associated with processing performed by a segment is visible to the segment.
In some embodiments, if the master were to want to make the metadata modification visible on all segments during the course of an operation, the master may split the operation into multiple sub-operations and dispatch the sub-operation(s) multiple times and/or as required to make the metadata modifications visible on all segments during the course of the overall operation.
In various embodiments, the master recovers itself (e.g., after a failure) as if it is a single node system. All committed transactions must be recovered, and after recovery, the system metadata is left in a consistent state. In various embodiments, segments do not need a recovery process, because they do not maintain any system state.
In various embodiments, the master adopts a traditional single node method such MVCC and lock to enable multiple sessions to access the metadata concurrently.
In some embodiments, only an append operation is supported in the system. The master keeps track of the logical file length for each user-defined table's files. Each read session may get a different logical file length from metadata depending on the metadata's visibility. The system controls the visibility of user-defined tables by the visibility of the logical file length.
In some embodiments, each user-defined table has a set of files, each file can be appended by one and only one write session, but can be read by multiple read sessions, even during appending. Different write sessions can append to the same user-defined table, but to different files of the table, concurrently.
In various embodiments, since segments write all appended data into the file system permanently before committing a transaction and the metadata's durability is protected by the master's transaction, all append operations will take effect permanently after the master has committed the transaction.
Using techniques disclosed herein, transactional qualities such as atomicity, isolation, consistency, and durability can be provided using a transaction model that is simple, is relatively easy to implement, and which requires relatively less interaction between the master and segments.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application claims priority to U.S. Provisional Patent Application No. 61/769,043 entitled INTEGRATION OF MASSIVELY PARALLEL PROCESSING WITH A DATA INTENSIVE SOFTWARE FRAMEWORK filed Feb. 25, 2013 which is incorporated herein by reference for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5933422 | Kusano et al. | Aug 1999 | A |
7599969 | Mignet et al. | Oct 2009 | B2 |
7653665 | Stefani et al. | Jan 2010 | B1 |
7921130 | Hinshaw et al. | Apr 2011 | B2 |
7984043 | Waas | Jul 2011 | B1 |
8266122 | Newcombe et al. | Sep 2012 | B1 |
8359305 | Burke et al. | Jan 2013 | B1 |
8572051 | Chen et al. | Oct 2013 | B1 |
8713038 | Cohen et al. | Apr 2014 | B2 |
8805870 | Chen et al. | Aug 2014 | B2 |
8868546 | Beerbower et al. | Oct 2014 | B2 |
20030212668 | Hinshaw et al. | Nov 2003 | A1 |
20040030739 | Yousefi'zadeh | Feb 2004 | A1 |
20040186842 | Wesemann | Sep 2004 | A1 |
20050289098 | Barsness et al. | Dec 2005 | A1 |
20080059489 | Han et al. | Mar 2008 | A1 |
20080195577 | Fan et al. | Aug 2008 | A1 |
20080222090 | Sasaki | Sep 2008 | A1 |
20080244585 | Candea | Oct 2008 | A1 |
20090043745 | Barsness et al. | Feb 2009 | A1 |
20090182792 | Bomma et al. | Jul 2009 | A1 |
20090234850 | Kocsis et al. | Sep 2009 | A1 |
20090254916 | Bose et al. | Oct 2009 | A1 |
20090271385 | Krishnamoorthy et al. | Oct 2009 | A1 |
20100223305 | Park et al. | Sep 2010 | A1 |
20100241827 | Yu et al. | Sep 2010 | A1 |
20110047172 | Chen et al. | Feb 2011 | A1 |
20110131198 | Johnson et al. | Jun 2011 | A1 |
20110228668 | Pillai et al. | Sep 2011 | A1 |
20110231389 | Surna et al. | Sep 2011 | A1 |
20110246511 | Smith et al. | Oct 2011 | A1 |
20110302164 | Krishnamurthy | Dec 2011 | A1 |
20120036146 | Annapragada | Feb 2012 | A1 |
20120191699 | George et al. | Jul 2012 | A1 |
20120259894 | Varley et al. | Oct 2012 | A1 |
20130117237 | Thomsen | May 2013 | A1 |
20130138612 | Iyer | May 2013 | A1 |
20130166523 | Pathak | Jun 2013 | A1 |
20140019683 | Ishikawa | Jan 2014 | A1 |
20140067792 | Erdogan et al. | Mar 2014 | A1 |
20140095526 | Harada et al. | Apr 2014 | A1 |
20140108459 | Gaza et al. | Apr 2014 | A1 |
20140108861 | Abadi et al. | Apr 2014 | A1 |
20140122542 | Barnes et al. | May 2014 | A1 |
20140136590 | Marty et al. | May 2014 | A1 |
20140149357 | Gupta | May 2014 | A1 |
20140188841 | Sun et al. | Jul 2014 | A1 |
20140195558 | Murthy et al. | Jul 2014 | A1 |
20140201565 | Candea | Jul 2014 | A1 |
Number | Date | Country |
---|---|---|
2012050582 | Apr 2012 | WO |
Entry |
---|
Brad Hedlund, “Understanding Hadoop Clusters and the Network,” Bradhedlund.com, 2011, pp. 1-22. Available at http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/. |
Number | Date | Country | |
---|---|---|---|
61769043 | Feb 2013 | US |