Database management software may utilize various hierarchical data structures to index and store data. One approach used to predict and improve database performance is to estimate the amount of data it would return in response to a given predicate. Such estimates may be used to optimize a query plan.
As noted above, the performance of a database may be improved with an optimized query plan. The optimization of a query plan may be based upon estimates of the amount of data the query would process and an execution engine's ability to carry out the query. A query may comprise a predicate with criteria such that a response to the predicate contains data satisfying the criteria. Therefore, the performance of a query may be predicted by estimating the amount of data that would satisfy a predicate associated with the query. The data estimates may be obtained without actually executing the predicate since doing so may utilize a significant amount of resources. Thus, one way to estimate the amount of data returned in response to a predicate is to sample data from the database and generate histograms based on the samples. Unfortunately, the underlying data in a database may be very large and may be distributed unevenly. Therefore, estimates produced using a sampling or histogram approach may be off by multiple orders of magnitude.
One cause of inaccurate estimates may be that the samples are unrepresentative of the criteria. Another cause may be stale statistics or unnecessarily frequent updates to the data. When the underlying data changes, a histogram may need to be refreshed to accurately reflect the change. Current policies for triggering refresh of statistics may be heuristic and these policies may fail to refresh statistics when needed or may refresh too frequently. A further cause of inaccurate estimates may be the limited precision of histograms generally. As noted above, the actual distribution of data may be much more uneven than what can be represented by a histogram.
In view of the foregoing, various examples disclosed herein provide a system, non transitory computer-readable medium and method to estimate the performance of a database. In one aspect, the estimate may be calculated based at least partially on child nodes or content of a node located in a hierarchical data structure. Such data structures may be used to index a database. Rather than generating histograms based on samples that may or may not be representative of the criteria, the database index, which may be much more reliable than samples, may be used to generate the estimate. The nodes of the index may be utilized as equal area histograms, since the data associated with the nodes may be balanced in order to accelerate the searches. Moreover, hierarchical database indices may allow estimates to be calculated at various levels of precision. The aspects, features and advantages of the application will be appreciated when considered with reference to the following description of examples and accompanying figures. The following description does not limit the application; rather, the scope of the application is defined by the appended claims and equivalents.
Non-transitory CRM 112 may also comprise any one of many physical media such as, for example, electronic, magnetic, optical, electromagnetic, or semiconductor media. More specific examples of suitable non-transitory computer-readable media include, but are not limited to, a portable magnetic computer diskette such as floppy diskettes or hard drives, a read-only memory (“ROM”), an erasable programmable read-only memory, a portable compact disc or other storage devices that may be coupled to computer apparatus 100 directly or indirectly. Alternatively, non-transitory CRM 112 may be a random access memory (“RAM”) device or may be divided into multiple memory segments organized as dual in-line memory modules (“DIMMs”). The non-transitory CRM 112 may also include any combination of one or more of the foregoing and/or other devices as well. While only one processor and one non-transitory CRM are shown in
Estimator 114 may comprise any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by processor 110. In this regard, the terms “instructions,” “scripts,” and “applications” may be used interchangeably herein. The computer executable instructions may be stored in any computer language or format, such as in object code or modules of source code. Furthermore, it is understood that the instructions may be implemented in the form of hardware, software, or a combination of hardware and software and that the examples herein are merely illustrative.
As will be discussed in more detail below, a database index may be arranged as a hierarchical, tree-like data structure that relies on splitting and merging to maintain its balance and optimize searches. In one example, estimator 114 may instruct processor 110 to read a request to estimate an amount of data that would fall within a scope of criteria in a predicate without actually executing the predicate. In another aspect, estimator 114 may instruct processor 110 to locate a node in the data structure whose child nodes or content fall within the scope of the criteria. In yet a further example, the estimate may be calculated based at least partially on the located node.
Working examples of the system, method, and non-transitory computer-readable medium are shown in
In block 202 of
A B-tree data structure may be configured to maintain its “balance” such that every path from the root node (e.g., root node 304) to a leaf node (e.g. leaf nodes 310, 312, and 314) is of the same length. Such balance may optimize a search for data beginning at the root node of the tree and ending at a leaf node. Particularly, a B-tree may be configured to maintain its balance when new data is added or deleted. That is, new data may be added or deleted in such a way that the balance is maintained. An interior node of a B tree may have a predetermined number of key entries that may be used as “separator keys” by a search engine. The keys allow a search engine to determine the correct path to the data sought after in the leaf nodes. The goal of database indices may be to optimize access time to data. As will be demonstrated below, a balanced index may also be used to provide accurate predicate response estimates.
In the example of
Referring back to
Referring back to
While the examples herein make reference to a single index data structure, it is understood that multiple indices may be used. The criteria of the estimate request may involve data located in various data sets each with its own index configuration. In this instance, a second node in a second data structure whose child nodes or content fall within the scope of the criteria may also be located. The estimate may be further calculated based at least partially on the child nodes or content of this second node. It is further understood that multiple indices may involve using a single index data structure multiple times (e.g., a self-join). Moreover, the examples herein are not limited to inner joins, but may also be applicable to semi-joins, outer joins, etc. Predicates across multiple indices may include conjunctive joins, disjunctive joins, unions, etc.
Advantageously, the above-described apparatus and method may provide more accurate estimates of database cardinality or selectivity by using the index of a database as its source of information. In this regard, balanced and well maintained index data structures may be used to provide estimates of cardinality or selectivity that are more precise that conventional techniques. In turn, users or consumers basing decisions on such estimates may be rest assured that the estimates are precise and reliable.
Although the disclosure herein has been described with reference to particular examples, it is to be understood that these examples are merely illustrative of the principles of the disclosure. It is therefore to be understood that numerous modifications may be made to the examples and that other arrangements may be devised without departing from the spirit and scope of the disclosure as defined by the appended claims. Furthermore, while particular processes are shown in a specific order in the appended drawings, such processes are not limited to any particular order unless such order is expressly set forth herein. Rather, processes may be performed in a different order or concurrently.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/070083 | 11/14/2013 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2015/073003 | 5/21/2015 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4774657 | Anderson | Sep 1988 | A |
5379422 | Antoshenkov | Jan 1995 | A |
5664171 | Agrawal | Sep 1997 | A |
5864841 | Agrawal | Jan 1999 | A |
5875445 | Antonshenkov | Feb 1999 | A |
6012054 | Seputis | Jan 2000 | A |
6108648 | Lakshmi | Aug 2000 | A |
6343288 | Lindsay | Jan 2002 | B1 |
6353826 | Seputis | Mar 2002 | B1 |
6421664 | Groeschel | Jul 2002 | B1 |
6738755 | Freytag | May 2004 | B1 |
6947934 | Chen | Sep 2005 | B1 |
7213012 | Jakobsson | May 2007 | B2 |
7313553 | Faunce | Dec 2007 | B2 |
7337167 | Muras | Feb 2008 | B2 |
7376638 | Gordon | May 2008 | B2 |
7805410 | Sasaki | Sep 2010 | B2 |
7930277 | Ge | Apr 2011 | B2 |
7987200 | Faunce | Jul 2011 | B2 |
8108399 | Dumitru | Jan 2012 | B2 |
8880511 | Kim | Nov 2014 | B2 |
20030018618 | Bestgen | Jan 2003 | A1 |
20040117396 | Avadhanam | Jun 2004 | A1 |
20040267713 | Chaudhuri | Dec 2004 | A1 |
20050160093 | Abdo | Jul 2005 | A1 |
20050228772 | Faunce | Oct 2005 | A1 |
20050240624 | Ge | Oct 2005 | A1 |
20060015809 | Hattori | Jan 2006 | A1 |
20060074875 | Faunce | Apr 2006 | A1 |
20060106777 | Faunce | May 2006 | A1 |
20060235818 | Muras | Oct 2006 | A1 |
20060271504 | Anderson | Nov 2006 | A1 |
20070136317 | Przywara | Jun 2007 | A1 |
20080065594 | Faunce | Mar 2008 | A1 |
20080071754 | Muras | Mar 2008 | A1 |
20080222087 | Balmin | Sep 2008 | A1 |
20080288444 | Edwards | Nov 2008 | A1 |
20090150413 | Basu | Jun 2009 | A1 |
20090299989 | Zhang | Dec 2009 | A1 |
20110029507 | Au | Feb 2011 | A1 |
20110145244 | Kim | Jun 2011 | A1 |
20130054567 | Graefe | Feb 2013 | A1 |
20130066880 | Schramm | Mar 2013 | A1 |
20130138679 | Kim | May 2013 | A1 |
20130151505 | Yoon | Jun 2013 | A1 |
20140101133 | Carston | Apr 2014 | A1 |
20140214793 | Tatemura | Jul 2014 | A1 |
20140229464 | Milenova | Aug 2014 | A1 |
20140379693 | May | Dec 2014 | A1 |
20150112966 | Tokuda | Apr 2015 | A1 |
Entry |
---|
Aoki, P.M., “Algorithms for Index-Assisted Selectivity Estimation”, Report No. USB//CSD-98-1021, Oct. 1998, 36 pages. (Year: 1998). |
Piatetsky-Shapiro et al., “Accurate Estimation of the Number of Tuples Satisfying a Condition”, ACM, 1984, pp. 256-276. (Year: 1984). |
Aoki P.M., “Algorithms for Index-Assisted Selectivity Estimation”, Report No. UCB//CSD-98-1021, Oct. 1998, Computer Science Division (EECS), University of California, Berkeley, California, pp. 1-36. (Year: 1998). |
Bradley M., “Guide to Database Performance and Tuning: Predicate Estimation”, Oracle Rdb Journal, Aug. 2003, pp. 1-52. (Year: 2003). |
M. Colgan, “The Oracle Optimizer Explain the Explain Plan,” An Oracle White Paper, May 2011, pp. 1-30, Oracle Corporation, Available at: <oracle.com/technetwork/database/bi-datawarehousing/twp-explain-the-explain-plan-052011-393574.pdf>. |
Lydia Parziale et al., “Best Practices for SAP BI Using DB2 9 for z/OS,” Information Management software, IBM, Third Edition, Mar. 2008, pp. Book Cover, i-XVI, 1-356, Book Spine, Book Back; Redbooks, Availablet at: <redbooks.ibm.com/redbooks/pdfs/sg246489.pdf>, 375 Pages. |
PCT Search Report/Written Opinion ˜ Application No. PCT/US2013/070083 dated Aug. 21, 2014 ˜ 12 pages. |
Number | Date | Country | |
---|---|---|---|
20160253384 A1 | Sep 2016 | US |