Modern database systems receive database queries from applications. The database queries specify particular data of one or more data sources. A database system determines a query execution plan based on a received database query and executes the query execution plan against the data sources to generate a result set.
A database system typically includes a query optimizer which selects one of several possible query execution plans for executing a query. For example, the query optimizer may determine an estimated cost of each of the several query execution plans and select a query execution plan associated with a smallest cost. The estimated costs may be determined in part based on database statistics. One such statistic is the estimated selectivity of the predicates of the received query.
Selectivity refers to a percentage of records of a table which are selected by the query predicates. Cardinality is determined by multiplying the selectivity by the number of records in the table. In order to estimate the selectivity/cardinality of particular query predicates for table R, the query may be executed against a random sampling of the records of table R and the number of sample records selected by the query is determined. The selected number is divided by the sample size to determine an estimated selectivity of the query on table R. An estimated cardinality for table R may then be determined by multiplying the estimated selectivity by a size of R. This technique is insufficient if the query selects no records of the sample table.
Conventional methods for addressing the above scenario is to determine a selectivity for each predicate of the query and to multiply the determined selectivities. The foregoing methods are often inaccurate, resulting in inaccurate determinations of execution plan costs and subsequent selection of a less-desirable query execution plan. Systems are desired to improve selectivity estimates.
The following description is provided to enable any person in the art to make and use the described embodiments and sets forth the best mode contemplated for carrying out some embodiments. Various modifications, however, will be readily apparent to those in the art.
Some embodiments provide improved estimates of selectivity via a statistical model which captures deviations between true and observed correlations of predicate selectivities. The statistical model allows balancing between the certainty of partial observations and the uncertainty of selectivities that cannot be modeled due to limitations in sampling sizes (i.e., zero qualifying tuples). Accordingly, in a zero qualifying tuple case, an initial selectivity estimate is determined and then adjusted based on the above-mentioned balancing.
Some embodiments may also provide improved estimates of selectivity in cases where the sample includes qualifying tuples and the true selectivity of a query predicate on the source table is known. Selectivities of query sub-expressions are determined in a traditional manner and these selectivities are adjusted in view of an estimated selectivity of a predicate determined based on the sample, the true selectivity of the predicate received from external sources, and a statistical model which incorporates observed correlations of predicate selectivities. The adjusted selectivities may then be used to estimate a selectivity of the full query on the source table.
More specifically, embodiments use partial correlations to improve the estimation accuracy in the case of zero qualifying tuples, exploiting statistical properties and Bayes' theorem:
p(A)=p(A|B)p(B)+p(A|
where p(
The Beta distribution may be derived for qualifying samples k≥1. Some embodiments further introduce an unbiased estimate for the zero qualifying tuple case and determine specific shape parameters in terms of the hypergeometric distribution. The inconsistency is resolved by adjusting the estimate based on the respective shape parameters, and a selectivity is derived using the adjusted estimate. Embodiments are further applicable to conjunctions including an arbitrary number of predicates.
Statistics server 110 determines an estimated selectivity of query 120 on table 130 based on sample table 140. Sample table 140 includes a subset of records of table 130 and may be generated using any sampling method that is or becomes known. Although embodiment are described herein with respect to a sample taken over a database table, embodiments may also be utilized to analyze a sample taken over a database view.
In some embodiments, statistics server 110 executes query 120 against sample table 140 and determines that query 120 selects very few or zero records of sample table 140. Statistics server 110 may generate an initial estimate of the selectivity of query 120 and then adjust the estimate based on a relative number of observations of two or more sub-expressions of query 120 within sample table 140. In other embodiments, the initial estimate is based on a significant number of selected records, but is adjusted based on externally-received information (e.g., an actual selectivity of a query predicate on table 130) and on a relative number of observations of two or more sub-expressions of query 140 within sample table 120.
Generally, server node 200 receives a query from client applications 230 and 240 and returns results thereto based on data stored within node 200. Node 200 executes program code to provide an application server and query processor 220. The application server provides services for executing server applications. For example, Web applications executing on an application server may receive HyperText Transfer Protocol (HTTP) requests from client applications 240 as shown in
Query processor 220 contains the actual stored data and engines for processing the data. Query processor 220 is responsible for processing Structured Query Language (SQL) and Multi-Dimensional eXpression (MDX) statements and may receive such statements directly from client applications 230.
Query processor 220 includes a query optimizer for use in determining query execution plans and a statistics server for determining statistics for estimating query execution plan costs. The statistics server may generate such statistics as described herein based on stored sample tables. For example, in response to reception of a query on a stored table (or view) of server node 200, the statistics server may estimate selectivity of the query on the table (or view) based on a pre-generated and stored sample table which corresponds to the table (or view). According to some embodiments, the statistics server of query processor 220 comprises an implementation of statistics server 110 to generate such estimates.
In some embodiments, the data of server node 200 may comprise one or more of conventional tabular data, row-based data, column-based data, and object-based data. Moreover, the data may be indexed and/or selectively replicated in an index to allow fast searching and retrieval thereof. Server node 200 may support multi-tenancy to separately support multiple unrelated clients by providing multiple logical database systems which are programmatically isolated from one another.
Server nodes 200 may implement an “in-memory” database, in which a full database stored in volatile (e.g., non-disk-based) memory (e.g., Random Access Memory). The full database may be persisted in and/or backed up to fixed disks (not shown). Embodiments are not limited to an in-memory implementation. For example, data may be stored in Random Access Memory (e.g., cache memory for storing recently-used data) and one or more fixed disks (e.g., persistent memory for storing their respective portions of the full database).
A query on a table is initially determined at S310. In the present example, the query includes a first predicate and a second predicate. The query may include additional predicates, as will be explained below. Next, at S320, a sample of the table is determined using any technique that is or becomes known.
Let R be a set of tuples defining a table or view. A sample S⊆R may be generated by drawing tuples from R uniformly at random and without replacement. The number of tuples in R is denoted by n:=|R| and the sample size is denoted by m:=|S|. The determined query q is defined as a conjunction of r simple predicates q:=p1∧p2 . . . ∧pr.
The result size of evaluating query q on relation R is defined as l:=|Rq|, corresponding to SELECT COUNT(*) FROM R WHERE q. Analogously, the result size of evaluating query q on sample S is k:=|Sq|. The total number of samples is given by the binomial coefficient
and the total number of samples of size m with exactly k qualifying tuples is
Since every sample is equally likely, the probability of observing k qualifying sample tuples is:
Accordingly, given k qualifying sample tuples in sample S, the number l of qualifying tuples of relation R may be estimated as
is the selectivity of query q on sample S.
A first conditional probability is determined at S330. The first conditional probability is a probability of selecting a tuple of the sample in which the first predicate and the second predicate are True. For example, it will be assumed that the source table is a relation R containing columns associated with car brand and car color. A sample of the source table is acquired at S320.
Characteristics 400 also indicate the conditional probability p(Audi|Blue), which is the probability of observing a tuple having the first predicate Audi in a sample prefiltered by the second predicate color=Blue. Specifically, 5 out of 10 Blue cars are also Audis and the first conditional probability p(Audi|Blue)=0.5.
A second conditional probability is determined at S340. The second conditional probability is a probability of selecting a tuple of the sample in which the first predicate is True and the second predicate is False. According to characteristics 400, 50 out of 100 tuples include Audi but a different color than Blue and therefore the second conditional probability p(Audi|
Next, at S350, the first conditional probability is adjusted. The first conditional probability is adjusted based on a number of tuples in the sample in which the second predicate is True, a number of tuples in the sample in which the second predicate is not True, and a number of tuples in the sample in which the first predicate is True and the second predicate is not True.
The adjustment at S350 takes into account a relative number of observations underlying each estimated probability, under the assumption that more observations corresponds to greater certainty. For example, both observed conditional probabilities p(Audi|Blue) and p(Audi|
According to some embodiments, the conditional probabilities may be modeled by a probability density function derived from a Bernoulli process for which the integral from 0 to the observed selectivity
is 0.5 and which integrates to 1. Such a function, referred to herein as Beta distribution B(a, b) with the right choice of parameters (a, b), coincides with the assumption that it is equally likely to under- and overestimate the real selectivity {tilde over (p)}(Audi|Blue) based on k. Plot 510 of
Accordingly, the adjustment of S350 may include balancing predicate selectivities according to their certainty. The intent of S350 is to adjust p(Audi|Blue) and p(Audi|
To illustrate, it is assumed that an external statistic is known, providing a true selectivity of a single query predicate on source table R. In the present example, the true selectivity {tilde over (p)}(Audi)=0.6 is assumed to be known. This value contradicts the observed sample where p(Audi)=0.5. While the observed p(Blue) might still be true despite this contradiction, the conditional probabilities p(Audi|Blue) and p(Audi|
Adjustments according to the present example are depicted in diagram 520 of
Derivation of the beta distribution B according to some embodiments is now described. Let S=s1, . . . , sm be the sample, where s1, . . . , sm are the respective tuples and q a query predicate. A bitvector X=x1, . . . , xm can be derived from S by evaluating q over S. The ith entry in X is set to 1 if and only if q satisfies the ith sample tuple from S. Traditional sampling based estimators would terminate at this point and return the estimate
However, embodiments are not only interested in the point estimate but also in the likelihood of the true selectivity {tilde over (p)} deviating from p. Since the real fraction of satisfying tuples {tilde over (p)} is unknown, it will be treated as a random variable of a function depending on X. In particular, the conditional density function ƒ ({tilde over (p)}=z|X) is desired, with Pr[(a≤{tilde over (p)}≤b)]=∫ab ƒ({tilde over (p)}=z|X)dz which is derived as follows.
First, the conditional probability is transformed using Bayes' rule:
Because m<<n holds, and the sample tuples are selected independently and uniformly at random, the probability Pr[X|{tilde over (p)}=z] can be modeled as Bernoulli process.
Pr[X|{tilde over (p)}=z]=zk(1−z)m-k
The function ƒ(z) is a probability distribution independent of X, which may be referred to as prior distribution. By setting ƒ(z)∝z−α(1−z)−α, α∈(0,1), the following probability density function conditioned on X is obtained:
This is known as the beta distribution B with shape parameters (a, b)=(k+1−α,m−k+1). The denominator of the above is independent of z and can be seen as a normalizing constant. Since the equation is a probability density function and 0≤z≤1, ∫01 ƒ({tilde over (p)}=z|X)dz=1 must hold true. Narrowing the integral's upper bound leads to the cumulative beta distribution (CDF):
CDF(x,B(a,b))=∫0xƒ({tilde over (p)}=z|X)dz=1−∫x1ƒ({tilde over (p)}=z|X)dz
The cumulative distribution therefore describes how likely {tilde over (p)} is underestimated (overestimated) given X. Since
is expected to be the median, α is sought such that:
A good closed form approximation of the median is as follows:
Substituting a=k+1−α, b=m−k+1−α with α=⅔ gives:
Given the approximation, it is concluded that for α=⅔ the median of the respective beta distribution indeed becomes
Since {tilde over (p)}(Audi)=0.6≠0.5=p(Audi), the above equation is no longer satisfied. S350 involves adjusting p(Audi|Blue) and p(Audi|
must hold (0<p(B)<1). Further, let =XA, XB be a matrix with XA, XB as bitvector where the ith entry of XA is set to 1 if and only if the ith sample tuple is an Audi and the ith entry of XB is set to 1 if and only if the ith sample tuple is a Blue. Deriving the distribution B for both conditional probabilities gives:
ƒ({tilde over (p)}(A|B)=z|)=ƒ(zA|B=z|X+)=BA|B
ƒ({tilde over (p)}(A|B)=z|)=ƒ(zA|
where X+ (X−) is the bit vector containing the ith entry of XA if and only if the ith entry of XB equals 1 (0).
S350 includes solving the imposed inconsistency in a way that is most consistent with the observed sample. p(A|B)′=zA|B, p′(A|
p(A)=p(B)zA|B+p(
CDF(zA|B,BA|B)=CDF(zA|
holds.
By substituting zA|B with
it is observed that the constrained solution to p(A)=p(B)zA|B+p(
Returning to process 300, the adjusted first conditional probability is used to determine the selectivity of the query on the table at S360. Specifically, the adjusted first conditional probability is multiplied with the probability of selecting a tuple of the sample in which the second predicate is true. In the present example, the selectivity of query Audi and Blue is determined by multiplying adjusted p(Audi|Blue) with p(Blue).
Process 300 may also be used to balance the certainty of observed correlations from the sample in a case where no tuples of the sample are selected by the query. Since the sample-derived conditional probability is zero, an unbiased initial estimate for l=|Rq| is determined. Second, the beta distribution is modeled with regard to l, where
is the median. Next, the initial estimate is adjusted according to the certainty of the (partial) correlations given by S. In other words, the certainty and uncertainty of the observations are balanced, where the uncertainty results from the limited sample size.
Providing an unbiased estimate for
with respect to the hypergeometric distribution is not trivial when k=0. The sample is assumed to be representative but too small, i.e.
(otherwise P0 could have been modelled by the sample). If it were possible to observe fractions of qualifying tuples, the fraction would be expected to fall between k=0 and k=1 qualifying sample tuples. Due to the lack of additional information, k0=0.5 may be an initial estimate as it is equidistant from k=0 and k=1. Accordingly, the probability of seeing less than 0.5 qualifying sample tuples should be equally likely to observing more than 0.5 tuples. Therefore, given n, m, X˜hypergeometric(n, m, l) and according to the preliminary considerations l is sought such that:
Pr[X≤k0=0.5]≥½ and Pr[X≥k0=0.5]≥½
However, X relates to a discrete distribution, and since there cannot be half a qualifying tuple and Pr[X≤k0=0.5]=Pr[X<k=1]=Pr[X=k=0], the above equation would hold true for every k0∈[0,1). In this case there is no unique solution and 0.5 refers to a so called weak median [13].
However, for {tilde over (x)}˜hypergeometric(n, 2m, l) it can be assumed that
Pr[{tilde over (X)}≤k=1]≥½ and Pr[{tilde over (X)}≥k=1]≥½
⇐Pr[{tilde over (X)}=k=0]=P(n,2m,k=0,l)=½
Therefore, by doubling the sample size k=1 becomes the strong median of {tilde over (X)}, giving raise to a unique solution of l. Accordingly, a number of qualifying relation tuples is determined which provides a 50% chance of observing at least one qualifying sample tuple when considering a sample twice the size. Given m, n, a solution to P(n, 2m, k=0, l)=0.5 is given by:
An initial and unbiased estimate for the fraction of qualifying tuples in view of a sample having no qualifying tuples is
Accordingly, the first conditional probability determined at S330 is
in the case or a sample having no qualifying tuples.
In order to adjust this first conditional probability at S350 according to partially observed correlations, it is first modeled as a distribution as described above. When no sample tuple qualifies for the given predicate, a beta distribution B0=B(a, b) is determined with
The closed form approximation for the median noted above may be used to derive the respective shape parameters (a, b). The approximation is however constrained by a, b>1, and k=0 is a special case in which the median needs to be
instead of
Substituting a according to the approximation gives
thus violating the constraint and leading to a significant approximation error of ˜10%. The inventors have numerically determined that a0=0.634, b0=m provides a consistently accurate approximation of the beta distribution shape parameters for the zero tuple case, where
It can be shown numerically that using a0=0.634, b0=m provides a relative error less than 1% for m>1, which is rapidly decreasing (less than 0.1% for m≥10).
The above discussion provides an unbiased initial estimate for non-qualifying samples where 0.5 tuples are the strong median. Considering one qualifying tuple as a strong median, it becomes straightforward to apply
directly. This provides a 50% chance of observing a non-qualifying sample and a 50% chance of observing at least one qualifying tuple, i.e.:
The following summarizes the shape parameters (a, b) for the beta distribution depending on the (partial) observation:
The above provides a beta distribution that models the uncertainty of an initial estimate p0 that can not be directly observed by the sample. Instead of setting p0=0 as observed in the zero qualifying tuple case,
is used. The use of
introduces an inconsistency to the above equation p(A)=p(A|B)p(B)+p(A|
Returning to the prior example of
Based on the example, the shape parameters for the second conditional density function (i.e., beta distribution) are derived at S340 as:
(a,b)=(kA
Accordingly, the second conditional density function p(Audi|
ƒ({tilde over (p)}(A|
with =XA, XB, where XA, XB are bit vectors of the respective sample columns and X+, X− are separated from XA with respect to the boolean factor Blue of XB.
Next, at S350 both conditional probabilities are treated as variables zAB, zA
CDF(zA|B,B(0.634,10))=CDF(zA|
Solving p(Audi)=zA|Bp(Blue)+zA|
As stated before, p(A|B) and p(A|
to solve p(Audi)=zA|Bp(Blue)+zA|
the problem can be reformulated as a solution to the following equation, which depends on only one variable:
This equation may be solved via Brent's method, which is a combination of the bisection method, the secant method and inverse quadratic interpolation. An open source implementation of via Brent's method is provided by the C++ Boost library. After solving the equation, either zA|
Brent's method requires as input a target function and bounds within which the minimum may be found. Some embodiments call Brent's method for two separate cases—representing both possibilities of
with each call having its own target function and bounds. According to some embodiments, the case for which the bounds are tighter is chosen for solving.
The following target function may be used in some embodiments:
ψ(zA|B):=((CDF(zA|B,BA|B)−CDF(zsub,BA|
where the cumulative distribution function can be computed efficiently for every value of zA|B (zsub) by calling Boost's implementation of the incomplete beta function. However, since calculating the difference of two vastly different floating point numbers is imprecise, the following target function may be used:
Some embodiments employ two approaches to narrow down the search space of Brent's method. First, both zA|B, zA|
Second, under the assumption that the sample is accurate but unable to model the correct selectivity due to its limited size, kobs−1<ktrue<kobs+1. Substituting k with k−1, k+1 gives the respective lower and upper bound for Brent's method.
In a case that the sample constrained by the predicate over attribute B is empty, i.e.
Then BA|B becomes the uniform distribution B(1,1). Since there is nothing to infer from the empty sample zA|B is set to 0 and 1 in order to derive the lower and upper bound for zA|
Considering a conjunction of two predicates (with no qualifying sample tuples), the selectivity estimation may be performed by intersecting two bit vectors and calling Brent's method with the aforementioned target function.
The above-described concepts may be used to estimate the combined selectivity of a conjunctive predicate, solely based on sampling. Generally, and as will be described with respect to process 700 of
If no sample tuple satisfies the full conjunction of all predicates, as many predicates as possible are combined to a subexpression which is satisfied by at least one tuple of the sample. The selectivity of the subexpression is then adjusted as described above based on the relative number of observations of the subexpression and of a next simple predicate. Next, the residual predicates are combined and connected to the previously-visited predicates by determining the selectivities of the residual subexpression and adjusting this selectivity based on the relative number of observations of each subexpression.
With respect to process 700, a query on a table is determined. The table includes a plurality of columns and the query includes a plurality of predicates respectively associated with each column. The following discussion of process 700 assumes that all single predicates have at least one qualifying entry in their respective bit vector, i.e., only (partial) conjunctions lead to a null vector or no qualifying sample tuple, respectively.
A sample of the table is determined at S710 as described above. Next, at S715 a bit vector is determined for each of those columns of the sample which are associated with a predicate of the query. A bit vector for a given column includes a “1” or a “0” for each row of the sample, depending on whether the predicate associated with the given column is satisfied by the value of the column at the row.
At S720, a bit vector of a current column (i.e., A) is intersected (i.e., logical AND) with the bit vector of a next column to generate a result bit vector.
It is determined at S740 that more predicate columns exist, flow returns to S720 to intersect the current result bit vector (associated with columns A and B) with the bit vector of a next column (column C).
Generally, the consecutive bit vectors are intersected as long as qualifying entries in the result bit vector can be observed. If it is eventually determined at S740 that no further columns exist, then the sample includes at least one qualifying tuple and the last-determined selectivity is returned. In some embodiments, if the final result bit vector indicates the presence of only one qualifying tuple in the sample, a selectivity of
may be returned as described above.
Returning to the example, a result bit vector is then formed at S720 based on the result bit vector associated with columns A, B and C and the bit vector associated with column D. Since this bit vector includes all “0”s, flow proceeds from S730 to S750. At S750, first conditional probability pA is determined as equal to the last-determined selectivity (i.e., corresponding to the partial selectivity of the predicates associated with columns A, B and C) and second conditional probability pB is determined as equal to the number of “1”s of the next column (column D) bit vector divided by the sample size.
Next, and as described above, the first conditional probability (i.e., pA=p(ABC) is adjusted at S755 based on the number of “1”s of the result bit vector (i.e., kABC
B
ABC|D
=B(a0=0.634,b0=mD),
B
ABC|D
=B(kABC
As previously discussed BABC|D, BABC|
Next, p(EFG|ABCD) is estimated by connecting the residual predicates and therefore the sample columns EFG.
At S765, pA2=p(EFG), pB2=p(ABC), and =XEFG, XABC. pA2 is then adjusted at S770 as described herein. More specifically, the first conditional probability (i.e., pA2) is adjusted at S770 based on the number of “1”s of the remaining columns combined bit vector (i.e., kEFG
B
1
=B(a0=0.634,b0=mABC),
B
2
=B(kEFG|
which are balanced as described above to return p(EFG|ABC). The selectivity estimate of the full conjunctive predicate then reads:
p(ABCDEFG)=p(EFG|ABC)p(ABC|D)p(D).
Instead of relying on a fixed order of predicates, the above algorithm may use a greedy heuristic. This implementation may take a set of simple predicates as input and return the enumerated predicates along with their combined selectivity. In the first phase, the predicates are enumerated according to the heuristic. If, for example, the predicates over ABC are already enumerated and adding any of the residual predicates leads to zero qualifying tuples, then the selectivity of the first subexpression that results in a non-qualifying sample is estimated. However, besides p(ABCD), estimates are provided for p(ABCE), p(ABCF) and p(ABCG). According to the heuristic, the subexpression with the best cost to selectivity ratio is selected, e.g. p(ABCD).
Then, assuming that qualifying tuples for the combined subexpression over EFG, p(ABCDE)=p(E|ABC)p(ABCD) and p(ABCDF), p(ABCDG) are estimated. If p(ABCDE) has been selected, the process continues with p(ABCDEF)=p(FE|ABC) (ABCD), etc., where all combinations of residual predicates (e.g., the subexpression over FE) are treated as a single predicate.
Enumeration is not limited to the greedy heuristic. The enumerator could separately shuffle subexpressions according to any policy while continuously proposing (partial) plans.
Server node 1400 includes processing unit(s) 1410 operatively coupled to communication device 1420, data storage device 1430, one or more input devices 1440, one or more output devices 1450 and memory 1460. Communication device 1420 may facilitate communication with external devices, such as an external network or a data storage device. Input device(s) 1440 may comprise, for example, a keyboard, a keypad, a mouse or other pointing device, a microphone, knob or a switch, an infra-red (IR) port, a docking station, and/or a touch screen. Input device(s) 1440 may be used, for example, to enter information into apparatus 1400. Output device(s) 1450 may comprise, for example, a display (e.g., a display screen) a speaker, and/or a printer.
Data storage device 1430 may comprise any appropriate persistent storage device, including combinations of magnetic storage devices (e.g., magnetic tape, hard disk drives and flash memory), optical storage devices, Read Only Memory (ROM) devices, etc., while memory 1460 may comprise Random Access Memory (RAM).
Application server 1431 and query processor 1432 may each comprise program code executed by processor(s) 1410 to cause server 1400 to perform any one or more of the processes described herein. Such processes may include estimating selectivities of queries on tables 1434 based on corresponding sample tables 1433. Embodiments are not limited to execution of these processes by a single computing device. Data storage device 1430 may also store data and other program code for providing additional functionality and/or which are necessary for operation of server 1400, such as device drivers, operating system files, etc.
The foregoing diagrams represent logical architectures for describing processes according to some embodiments, and actual implementations may include more or different components arranged in other manners. Other topologies may be used in conjunction with other embodiments. Moreover, each component or device described herein may be implemented by any number of devices in communication via any number of other public and/or private networks. Two or more of such computing devices may be located remote from one another and may communicate with one another via any known manner of network(s) and/or a dedicated connection. Each component or device may comprise any number of hardware and/or software elements suitable to provide the functions described herein as well as any other functions. For example, any computing device used in an implementation some embodiments may include a processor to execute program code such that the computing device operates as described herein.
Embodiments described herein are solely for the purpose of illustration. Those in the art will recognize other embodiments may be practiced with modifications and alterations to that described above.