Distributed storage systems enable databases, files, and other objects to be stored in a manner that distributes data across large clusters of commodity hardware. For example, Hadoop® is an open-source software framework to distribute data and associated computing (e.g., execution of application tasks) across large clusters of commodity hardware.
A database table or other large object may be stored in a distributed storage system as a set of files, each file comprising a portion of the object. In the Hadoop® distributed file system, for example, a file may be stored as a set of blocks. Typically, three copies of a block are stored, one on the host at which the data was written to the file, a second on another host on the same rack, and a third on a host in another rack. A storage master node stores metadata indicating the location of each of the copies.
To perform a “scan” operation required to be performed to respond to a query, for example to find records that match criteria specified in the query, data associated with rows of one or more database tables may have to be read, parsed, and analyzed.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
Optimizing the scan operators of a query for a large-scale parallel analytic database system atop a distributed storage system is disclosed. In various embodiments, data location information is leveraged to accelerate scan operations, for example by assigning a task to scan a block of data to a processing segment on or near a host on which the block of data is stored.
In the example shown in
When the master node 102 accepts a query, it is parsed and planned according to the statistics of the tables in the query, e.g., based on database system metadata 106 and storage metadata cache 107. After the planning phase, a query plan is generated. A query plan is sliced into many slices. In query execution phase, a “gang” or other grouping of segments is allocated for each slice to execute the slices. In some embodiments, the size of the gangs is dynamically determined by using knowledge of the data distribution (data block locations) and available resources (locations of segments having bandwidth available). In some embodiments, within the planning phase, a data locality based query optimization is performed for scan operators. In the example shown in
In various embodiments, two kinds of strategies may be used for dispatching, i.e., assigning tasks comprising a slice of a query plan. The first is to use a fixed number of QEs to execute each slice, for example a number of QE's that is equal to or less than the number of segments in the cluster. The scheduling algorithm to match QEs to segments in various embodiments considers the dynamically available resources and the data locality for scan nodes.
Given the total number of QEs slots available for the query, the second strategy allows variable size gangs. In typical analytical queries, high-level slices often do less work than low-level slices due to the bottom-up processing nature of a query plan. By assigning more QEs to perform low-level slices than less processing intensive upper-level slices, resources can be more fully utilized.
Where “extra” is a configurable parameter between 0 and 1. In various embodiments, the “extra” parameter can make it possible for segments having good data locality to be assigned more data to scan.
In some embodiments, the available bandwidth of a segment is determined by comparing the volume of data currently assigned to be scanned by the segment to the maximum computed as indicated above, i.e.:
Available bandwidth=Vmax−Vassigned
If the size of a data block to be processed is less than or equal to the available bandwidth, the segment is considered to have sufficient bandwidth available to scan the block.
Referring further to
Using techniques disclosed herein, scan operations may be performed very efficiently in a large-scale distributed database stored in a distributed storage system, since most scan operators read data from local disks rather than pulling data from remote hosts, resulting in greatly decreased network traffic and accelerated scan operations.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application claims priority to U.S. Provisional Patent Application No. 61/769,043 entitled INTEGRATION OF MASSIVELY PARALLEL PROCESSING WITH A DATA INTENSIVE SOFTWARE FRAMEWORK filed Feb. 25, 2013 which is incorporated herein by reference for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5933422 | Kusano | Aug 1999 | A |
6957222 | Ramesh | Oct 2005 | B1 |
7599969 | Mignet et al. | Oct 2009 | B2 |
7653665 | Stefani et al. | Jan 2010 | B1 |
7702676 | Brown | Apr 2010 | B2 |
7743051 | Kashyap | Jun 2010 | B1 |
7885953 | Chen | Feb 2011 | B2 |
7906259 | Hayashi | Mar 2011 | B2 |
7908242 | Achanta | Mar 2011 | B1 |
7921130 | Hinshaw et al. | Apr 2011 | B2 |
7984043 | Waas | Jul 2011 | B1 |
8171018 | Zane | May 2012 | B2 |
8266122 | Newcombe et al. | Sep 2012 | B1 |
8359305 | Burke et al. | Jan 2013 | B1 |
8572051 | Chen et al. | Oct 2013 | B1 |
8645356 | Bossman | Feb 2014 | B2 |
8713038 | Cohen et al. | Apr 2014 | B2 |
8805870 | Chen et al. | Aug 2014 | B2 |
8868546 | Beerbower et al. | Oct 2014 | B2 |
8935232 | Abadi | Jan 2015 | B2 |
8990335 | Fauser | Mar 2015 | B2 |
9110706 | Yu | Aug 2015 | B2 |
9235396 | Ke | Jan 2016 | B2 |
9626411 | Chang | Apr 2017 | B1 |
9639575 | Leida | May 2017 | B2 |
20030037048 | Kabra | Feb 2003 | A1 |
20030212668 | Hinshaw et al. | Nov 2003 | A1 |
20040030739 | Yousefi'zadeh | Feb 2004 | A1 |
20040073549 | Turkel | Apr 2004 | A1 |
20040095526 | Yamabuchi | May 2004 | A1 |
20040186842 | Newcombe et al. | Sep 2004 | A1 |
20050289098 | Barsness et al. | Dec 2005 | A1 |
20060224563 | Hanson | Oct 2006 | A1 |
20070050328 | Li | Mar 2007 | A1 |
20080059489 | Han et al. | Mar 2008 | A1 |
20080082644 | Isard | Apr 2008 | A1 |
20080086442 | Dasdan | Apr 2008 | A1 |
20080120314 | Yang | May 2008 | A1 |
20080195577 | Fan et al. | Aug 2008 | A1 |
20080222090 | Sasaki | Sep 2008 | A1 |
20080244585 | Candea et al. | Oct 2008 | A1 |
20090043745 | Barness et al. | Feb 2009 | A1 |
20090182792 | Bomma et al. | Jul 2009 | A1 |
20090216709 | Cheng | Aug 2009 | A1 |
20090234850 | Kocsis et al. | Sep 2009 | A1 |
20090254916 | Bose et al. | Oct 2009 | A1 |
20090271385 | Krishnamoorthy et al. | Oct 2009 | A1 |
20090292668 | Xu | Nov 2009 | A1 |
20100088298 | Xu | Apr 2010 | A1 |
20100114970 | Marin | May 2010 | A1 |
20100198806 | Graefe | Aug 2010 | A1 |
20100198807 | Kuno | Aug 2010 | A1 |
20100198808 | Graefe | Aug 2010 | A1 |
20100198809 | Graefe | Aug 2010 | A1 |
20100223305 | Park et al. | Sep 2010 | A1 |
20100241827 | Yu et al. | Sep 2010 | A1 |
20100241828 | Yu | Sep 2010 | A1 |
20100257198 | Cohen | Oct 2010 | A1 |
20100332458 | Xu | Dec 2010 | A1 |
20110047172 | Chen et al. | Feb 2011 | A1 |
20110131198 | Johnson et al. | Jun 2011 | A1 |
20110138123 | Gurajada | Jun 2011 | A1 |
20110228668 | Pillai et al. | Sep 2011 | A1 |
20110231389 | Surna et al. | Sep 2011 | A1 |
20110246511 | Smith et al. | Oct 2011 | A1 |
20110302164 | Krishnamurthy et al. | Dec 2011 | A1 |
20120036146 | Annapragada | Feb 2012 | A1 |
20120078973 | Gerdes | Mar 2012 | A1 |
20120191699 | George et al. | Jul 2012 | A1 |
20120259894 | Varley et al. | Oct 2012 | A1 |
20130030692 | Hagan | Jan 2013 | A1 |
20130031139 | Chen | Jan 2013 | A1 |
20130054630 | Briggs | Feb 2013 | A1 |
20130117237 | Thomsen et al. | May 2013 | A1 |
20130138612 | Iyer | May 2013 | A1 |
20130166523 | Pathak et al. | Jun 2013 | A1 |
20130173716 | Rogers | Jul 2013 | A1 |
20130346988 | Bruno | Dec 2013 | A1 |
20140019683 | Ishikawa | Jan 2014 | A1 |
20140067792 | Erdogan et al. | Mar 2014 | A1 |
20140095526 | Harada | Apr 2014 | A1 |
20140108459 | Gaza | Apr 2014 | A1 |
20140108861 | Abadi et al. | Apr 2014 | A1 |
20140122542 | Barnes et al. | May 2014 | A1 |
20140136590 | Marty et al. | May 2014 | A1 |
20140149355 | Gupta | May 2014 | A1 |
20140149357 | Gupta | May 2014 | A1 |
20140188841 | Sun et al. | Jul 2014 | A1 |
20140188884 | Morris | Jul 2014 | A1 |
20140195558 | Murthy et al. | Jul 2014 | A1 |
20140201565 | Candea et al. | Jul 2014 | A1 |
Number | Date | Country |
---|---|---|
102033889 | Aug 2012 | CN |
2012050582 | Apr 2012 | WO |
WO-2012124178 | Sep 2012 | WO |
Entry |
---|
Brad Hedlund, “Understanding Hadoop Clusters and the Network,” Bradhedlund.com, 2011, pp. 1-22. Available at http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/. |
Brad Hedlund “Understanding Hadoop Clusters and the Network,” 2011 (http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/). |
Brad Helund, “Understanding Hadoop CLusters and the Netwoork,” 2011. |
Number | Date | Country | |
---|---|---|---|
61769043 | Feb 2013 | US |