SYSTEM AND METHOD FOR IMPROVING CARDINALITY ESTIMATION IN A RELATIONAL DATABASE MANAGEMENT SYSTEM

Description

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a block diagram showing software components of a relational database management system suitable for a method for estimating cardinalities according to the present invention;

FIG. 2 is a block diagram showing a data processing system employing the present invention;

FIG. 3 illustrates one example of a method for computing the calibration estimator in accordance with an embodiment of the present invention;

FIG. 4 illustrates one example of an alternate method for determining the calibration estimator in accordance with an embodiment of the present invention;

FIG. 5A is a graph comparing embodiments of the present invention with prior art methods in terms of accuracy vs. correlation;

FIG. 5B is a graph comparing embodiments of the present invention with prior art methods in terms of accuracy vs. data skew;

FIG. 6A is a graph comparing embodiments of the present invention with prior art methods in terms of accuracy vs. sample rate;

FIG. 6B is a graph comparing embodiments of the present invention with prior art methods in terms of accuracy vs. number of attributes; and

FIG. 7 is a graph comparing an embodiment of the present invention with prior art methods in terms of accuracy vs. sample rate on Census Income data.

The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.

DETAILED DESCRIPTION OF THE INVENTION

Reference is made to FIG. 1 which shows in block diagram form a Relational Database Management System or RDBMS system 10 suitable for use with a method according to the present invention. One skilled in the art will be familiar with how a RDBMS is implemented. Such techniques are straightforward and well known in the art. Briefly, the RDBMS 10 comprises a client application module 12 and a server module 14 as shown in FIG. 1. One of the functions of the server 14 is to process the SQL query entered by the database user. The server 14 comprises a relational data services and SQL compiler 16. The SQL compiler 16 includes a plan optimization module 18 or query optimizer. The primary function of the query optimizer 18 is to find an access strategy or query plan that would incur or result in minimum processing time and input/output time for retrieving the information requested by the user. In FIG. 1, the query plan is represented by block 20.

Reference is next made to FIG. 2 which shows a data processing system 22 incorporating the present invention. The data processing system 22 comprises a central processing unit 24, a video display 26, a keyboard 28, random access memory 30 and one or more disk storage devices 32. One skilled in the art will recognize the data processing system 22 a conventional general purpose digital computer. In FIG. 2, the relational database management system 10 incorporating the present invention includes a software module which includes a query optimizer, and which is stored or loaded on the disk storage device 32. Data items, e.g. cards, tables, rows, etc. which are associated with the relational database management system 10 can be stored on the same disk 32 or on another disk 34.

The method, a hybrid approach to selectivity estimation for conjunctive predicates (HASE), according to the invention makes consistent use of synopses and sample information when both present. To achieve this goal, the method uses a novel estimation scheme utilizing a powerful mechanism called generalized raking. The method formalizes selectivity estimation in the presence of single attribute synopses and sample information as a constrained optimization problem. By solving this problem, the method obtains a new set of weights associated with the sampled tuples, which has the advantageous property of reproducing the known selectivities when applied to individual predicates.

It will be understood that the description presented herein will be mainly concerned with selectivity estimation for conjunctive predicates of the form Q=P₁̂P₂. . . P_mwhere each component P_iis a simple predicate on a single attribute, taking the form of (attribute op constant) with op being one of the comparison operators <,≦,=,≠,≧, or >(e.g., R.a=100 or R.a≦200).

Calibration

For example, for a sample of data with known selectivities of individual predicates P_i. The method begins with an estimator constructed based on the sample only, without reference to any additional information, such as the HT estimator (Eq. (2)). For each tuple j in table R, in addition to the variable of interest y_j, the method in accordance with the invention also associates with it an auxiliary vector x_jto reflect the results of evaluating P_ion j. For purposes of this example, each predicate P_idivides tuples in R into two disjoint subsets, D_iand D_i, according to whether they satisfy the predicate or not. Also for purposes of this example, further define D_m+1=R i.e., j ∈ D_m+1for all j. Let x_jbe a column vector of length m+1: x_j^T=(x_j1, . . . ,x_jm,x_j,m+1), with the i-th (1≦i≦m+1) element being 1 if j ∈ D_i, and 0 otherwise. For instance, in the running example described above, x_j^T=(1,0,1) indicates that tuple j satisfies P₁, but not P₂.

Let t_x^T=(t_x1, . . . ,t_xm,t_x,m+1)=1/N Σ_j∈Rx_j. Clearly, t_xi=1/N Σ_j∈Sx_ji=s_i(1≦i≦m), the selectivity of predicate P_i, and t_x,m+1=1. Therefore,

t
_x
^T=(s₁,s₂, . . . ,s_m,1) (3)

Also, for purposes of this example s_ican be obtained based on synopsis structures, and x_jare observed for each tuple j ∈ S. This allows construction of a new estimator (the calibration estimator)

$\begin{matrix} {\hat{s}}_{cal} = \frac{1}{N} \sum_{j \in S}^{} w_{j} y_{j}, & (4) \end{matrix}$

where the weights w_jare as close to the weights d_j=1/π_jas possible according to some distance metric (recall that π_jis the inclusion probability of j), and where

$\begin{matrix} \frac{1}{N} \sum_{j \in S}^{} w_{j} y_{j} = t_{x}, & (5) \end{matrix}$

meaning that the weighted average of the observed x_jhas to reproduce the known selectivities s_i.

In light of the definition of x_jand Eq. (3), Eq. (5) can be rewritten as

$\begin{matrix} \frac{1}{N} \sum_{j \in S ⋂ D_{i}}^{} w_{j} = s_{i}, i = 1, 2, \dots, m + 1. & (6) \end{matrix}$

where s_m+1=s. Now w_jhas a natural representation interpretation: it is the number of tuples “represented” by the sampled tuple j.

In the running example, Eq. (6) becomes

$\begin{matrix} \frac{1}{10000} \sum_{j \in S ⋂ D_{1}}^{} w_{j} = 0.6, \frac{1}{10000} \sum_{j \in S ⋂ D_{2}}^{} w_{j} = 0.3, and \frac{1}{10000} \sum_{j \in S}^{} w_{j} = 1 & (7) \end{matrix}$

Although in general, there can be many possible choices for the sets of weights {w_j} satisfying the constraints in Eq. (6), the goal of the method is to select a set of new weights that are as close as possible to the original weights d_i=1/π_i, which enjoy the desirable property of producing unbiased estimates. By keeping the distance between the new weights and the original weights as small as possible, in accordance with one method of the invention, the new weights remain nearly unbiased. Thus, the method advantageously provides a constrained optimization solution as described herein.

The constrained optimization solution. Let D(x) be a distance function (with x=w_j/d_j) that measures the distance between the new weights w_jand the original weights d_j. The query optimizer assures that D(x) satisfies the following requirements (for reasons that will become clear later): (i) D is positive and strictly convex, (ii) D(1)=D′(1)=0, and (iii) D″(1)=1. The optimization for the method to determine is:

Minimize

$\begin{matrix} \sum_{j \in S}^{} d_{j} D (w_{j} / d_{j}) & (8) \end{matrix}$

subject to

$\begin{matrix} \frac{1}{N} \sum_{j \in S}^{} w_{j} x_{j} = t_{x} . & (9) \end{matrix}$

Here, both x_jand t_xare defined earlier. Since D(w_j/d_j) can have a large response to even a slight change in w_jwhen d_jis small, the query optimizer minimizes Σ_j∈Sd_jD(w_j/d_j) instead of Σ_j∈SD(w_j/d_j) in order to dampen this effect. Also note that different distance functions can be used to measure the distance between {w_j} and {d_j}, as long as the distance function complies with conditions (i) to (iii).

Alternative methods of the invention can choose different distance functions. For example, the following two distance functions may be chosen for computational efficiency and interpretability. Both of these distance functions exhibit properties (i) to (iii):

The linear distance function

$D_{lin (w_{j} / d_{j})} = \frac{1}{2} {(\frac{w_{j}}{d_{j}} - 1)}^{2},$

and

The multiplicative distance function:

$D_{mul (w_{j} / d_{j})} = \frac{w_{j}}{d_{j}} \log \frac{w_{j}}{d_{j}} - \frac{w_{j}}{d_{j}} + 1$

It will be appreciated that any suitable distance function may be chosen.

In accordance with features of the present invention the following methods may be used to solve the constrained optimization problem. One method for solving constrained optimization problems is the method of Lagrange multipliers. Note that the optimization problem can states as follows:

Minimize

$\begin{matrix} \sum_{j \in S}^{} d_{j} D (w_{j} / d_{j}) - λ^{T} (\sum_{j \in S}^{} w_{j} x_{j} - {Nt}_{x}) & (10) \end{matrix}$

with respect to w_j(j ∈ S),

where λ=(λ₁, . . . ,λ_m,λ_m+1) is a Lagrange multiplier. Differentiating Eq. (10) with respect to w_j, to obtain:

D′(w_j/d_j)−x_j^Tλ=0 (11)

Then solve the system formed by Eq. (11) and (9) for w_j. To do this, obtain from (11) that

w
_j
=d
_j
F(x_j^Tλ), (12)

where F(x) is the inverse function of D′(x). Conditions (i)-(iii) dictate that the inverse function always exists, and F(0)=F′(0)=1. Substituting (12) into Eq. (9), results in the calibration equations

$\begin{matrix} \sum_{j \in S}^{} d_{j} F (x_{j}^{T} λ) x_{j} = {Nt}_{x}, & (13) \end{matrix}$

which can be solved numerically using Newton's method.

$Let φ (λ) = \sum_{j \in S}^{} d_{j} F (x_{j}^{T} λ) x_{j} - {Nt}_{x} . Then φ^{'} (λ) = \partial φ (λ) / \partial λ = \sum_{j \in S}^{} d_{j} F^{'} (x_{j}^{T} λ) x_{j} x_{j}^{T} .$

Then obtain successive estimates of λ, denoted by λ_k(k=0,1, . . . ), through the following iteration:

λ_k+1=λ_k+[φ′(λ_k)]⁻¹φ(λ_k) (14)

take λ₀=0. Since one has

$φ (0) = \sum_{j \in S}^{} d_{j} F (0) x_{j} - {Nt}_{x} = \sum_{j \in S}^{} d_{j} x_{j} - {Nt}_{x}, and$

$φ^{'} (0) = \sum_{j \in S}^{} d_{j} F^{'} (0) x_{j} x_{j}^{T} = \sum_{j \in S}^{} d_{j} x_{j} x_{j}^{T},$

the first iteration yields λ₁=(Σ_j∈Sd_jx_jx_j^T)⁻¹(Σ_j∈Sd_jx_j−Nt_x). The subsequent values of λ_kcan be obtained following Eq. (14) until convergence.

In summary, the method to estimate the selectivity of Q is presented in FIG. 3. Continuing the running example, the true frequencies obtained by evaluating the query Q on table R, and the observed frequency information based on a simple random sample S are given in Tables 1(a) and 1(b) showing true frequencies and observed frequencies from the sample, respectively (both tables are normalized so that all frequencies sum up to 1). The last row and column in each table correspond to the marginal frequencies.

From Table 1(a) and Table 1(b), it is seen that the true selectivity of Q is 0.05 (the cell corresponding to P₁=truêP₂=true in FIG. 1(a)), and the sampling-based selectivity estimate is 0.09 (the cell corresponding to P₁=trueΛP₂=true in Table 1(b)).

TABLE 1(a)

True frequencies

P₂= true
P₂= false
—

P₁= true
0.05
0.55
.60

P₁= false
0.25
0.15
.40

—
.30
.70

TABLE 2(b)

Observed frequencies

P₂= true
P₂= false
—

P₁= true
0.09
0.56
.65

P₁= false
0.24
0.11
.35

—
.33
.67

Clearly, the marginal frequencies obtained from the sample do not agree with the true marginal frequencies; therefore, calibration is needed. Applying the method shown in FIG. 3 to solve the calibration equations as shown in Eq. (7), obtains the following calibrated weights (using the multiplicative distance function):

w_j≈60 for j ∈ S∩D₁∩D₂,w_j≈102 for j ∈ S∩D₁∩ D₂

w_j≈97 for j ∈ S∩ D₁∩D₂,w_j≈140 for j ∈ S∩ D₁∩ D₂.

The selectivity estimate can then be determined:

${\hat{s}}_{cal} = \frac{1}{N} \sum_{j \in S}^{} w_{j} y_{j} = \frac{1}{N} \sum_{j \in S ⋂ D_{1} ⋂ D_{2}}^{} w_{j} = 60 \times 9 / 10000 = 0.054 .$

The estimation error is E(ŝ_cal)=|0.054−0.05|/0.05=8%. Compared with the error of the prior art synopsis-based estimate E(ŝ_his)=260% and the error of the prior art sampling-based estimate E(ŝ_spl)=80%, this method represents a significant improvement in the estimation accuracy.

An alternative implementation. Now is presented an alternative method for solving the calibration equations, which takes advantage of the intrinsic structure of the equations in (6) and does not require matrix inversion.

Since w_j=d_jF(x_j^Tλ), Eq. (6) becomes

$\begin{matrix} \frac{1}{N} \sum_{j \in S ⋂ D_{1}}^{} d_{j} F (x_{j}^{T} λ) = s_{i}, i = 1, \dots, m + 1. & (15) \end{matrix}$

Observe that the i-th Eq. (2≦i≦m) can be solved for λ_i, assuming all other λ_l(l≠i) are known, and the first and last equations can be solved for λ₁and λ_m+1assuming all other λ_l(l≠1,l≠m+1) are known.

This method is shown in FIG. 4. It will be appreciated that such an iterative procedure converges to a proper solution, and in the case of multiplicative distance functions, this method yields a variant of the classical iterative proportional fitting algorithm. Replacing lines 6 to 11 in FIG. 3 with the method shown in FIG. 4 results in an alternative estimation method.

Distance measures. We now present the implications of the choice of distance functions D described earlier. In general, different distance functions result in different calibration estimators. However, it will be understood that regardless of the distance functions used (as long as the functions comply with conditions (i)-(iii)), the estimates obtained using the outcome of the specific optimization problem will converge asymptotically.

Therefore, for medium to large sized samples (empirically, with sample size greater than 30), the choice of distance function does not have a heavy impact on the properties of the estimator; in general only slight differences in the estimates produced by using different functions will arise.

The main difference between the distance functions is thus their computational efficiency as well as interpretability.

For the linear function, D_lin, D′(x)=x−1; therefore, the inverse function is F(z)=z+1. In FIG. 2, it is can be verified that λ converges at

λ₁=(Σ_j∈Sd_jx_jx_j^T)⁻¹(Σ_j∈Sd_jx_j−t_x).

Therefore, when the linear function is used, only one iteration is required, which makes the linear method the faster of the two distance functions considered here. A major drawback of this function is that the weights can be negative. This can lead to negative selectivity estimates. For instance, in the running example, taking a sample of size 10 from R, and the observed frequencies are the following: P₁=true∩P₂=true: 2; P₁=true,P₂=false: 5; P₁=false∩P₂=true: 3; P₁=false∩P₂=false: 0. Solving the calibration equation, results in w_j=−500 for j ∈ S∩D₁∩D₂. Therefore, the selectivity estimate ŝ_cal=2×(−500)/10000=−0.1. Negative weights and selectivity estimates do not have a natural interpretation and thus are undesirable. Note that, however, this usually only occurs for small-sized samples. When the sample size gets large, all estimators with distance functions satisfying conditions (i)-(iii) are asymptotically equivalent and give positive weights and selectivity estimates.

For the multiplicative function, D_mul, D′(x)=log x; the inverse function is therefore F(z)=e^z. When the multiplicative function is used, it may require more than one iteration, but it often converges after only a few iterations (typically two in our experiments). An advantage of using this function is that it always leads to positive weights because w_j=d_jF(x_j^Tλ)=d_jexp {x_j^Tλ}>0.

Probabilistic bounds on the estimation error. Let π_jlbe the probability that both j and l are included in the sample, and π_jj=π_j. Assuming that the sampling scheme is such that the π_jl's are strictly positive. Let β be a vector satisfying the equation

$\sum_{j \in R}^{} d_{j} x_{j} (y_{j} - x_{j}^{T} β) = 0$

and let Δ_jl=π_jl−π_jπ_l, ε_j=y_j−x_j^Tβ. Which gives the following result on the error bounds of the estimation error. When the sample size is sufficiently large, for a given constant α ∈ (0,1), the selectivity s_Qis bounded by (ŝ_cal−z_α/2√{square root over (V(ŝ_cal))},ŝ_cal+z_α/2√{square root over (V(ŝ_cal))} with probability 1−α, where z_α/2is the upper alp ha/2 point of the standard normal distribution, and V(ŝ_cal)=Σ_j∈RΣ_j∈R(Δ_jl/π_jl)(w_jε_j)(w_lε_i).

Proof Sketch: When the linear distance function is used, w_j=d_j(1+x_j^Tλ). We know from Section 3.5 that the solution of the calibration equation converges at λ=(Σ_j∈Sd_jx_jx_j^T)⁻¹(Σ_j∈Sd_jx_j−t_x). Therefore, w_j=d_j[1+x_j^T(Σ_j∈Sd_jx_jx_j^T)⁻¹(Σ_j∈Sd_jx_j−t_x)]. Let {circumflex over (β)}_sbe the solution to the equation

$\sum_{j \in S}^{} d_{j} x_{j} (y_{j} - x_{j}^{T} {\hat{β}}_{s}) = 0.$

Then the estimator ŝ_calcan be written as

${\hat{s}}_{cal} = \frac{1}{N} \sum_{j \in S}^{} w_{j} y_{j} = {\hat{s}}_{sp 1} + \frac{1}{N} {(t_{x} - \sum_{j \in S}^{} d_{j} x_{j})}^{T} {\hat{β}}_{s},$

which takes the form of a generalized regression estimator (GREG). Applying results on the asymptotic variance of GREG to obtain the asymptotic variance of the estimator ŝ_cal:

$V ({\hat{s}}_{cal}) = \sum_{j \in R}^{} \sum_{j \in R}^{} (Δ_{jl} / π_{jl}) (w_{j} ɛ_{j}) (w_{l} ɛ_{l}) .$

Since it has been shown that all estimators with distance functions satisfying conditions (i)-(iii) are asymptotically equivalent, all estimators have the same asymptotic variance V(ŝ_cal). When the sample S is large enough, the Central Limit Theorem applies. Therefore, for a given constant or α ∈ (0,1), s_Qis bounded by (ŝ_cal−z_α/2√{square root over (V(ŝ_cal))}, ŝ_cal+z_α/2√{square root over (V(ŝ_cal))} with probability 1−α.

Utilizing multi-attribute synopses. In the discussion, it has been assumed that there is prior knowledge of the selectivities s_iof individual predicates P_ibased on single-attribute synopsis structures. However, it will be understood that the estimation procedure can be advantageously extended so that multi-attribute synopsis structures can also be utilized when they are present.

For example, suppose that a multi-dimensional synopsis exists on a set of attributes A. Thus, in accordance with one method of the invention it is straightforward to derive lower-dimensional synopses from higher-dimensional synopses, i.e., synopses on any subset(s) of A can be obtained from the synopsis on A. Let A_Qbe the set of attributes involved in query Q. If A∩A_Q≠Ø, the synopsis on A can be utilized. Let U=A∩A_Q, and let P_Ube the conjuncts of predicates in which attributes in U are involved. Then the selectivity s_Uof P_Ucan be estimated based on the synopsis on U. We augment the auxiliary vector x_jby an additional element reflecting whether j satisfies P_U. Changes are also made accordingly to t_x, with the addition of an element with value s_U. The algorithms for solving the calibration equations presented above can then be applied in order to obtain ŝ_cal.

Experimental evaluation. This section reports the results of an experimental evaluation of the estimation methods disclosed herein. The following compares the accuracy of the methods in accordance with the invention with that of the synopsis-based and sampling-based approaches using synthetic as well as a real data set. The real data set used is the Census Income data.

Synthetic data are used to study the properties of the methods presented herein in a controlled manner. A large number of synthetic data sets are generated by varying the following parameters:

Data skew: The data in each attribute are generated from a Zipfian distribution with parameter z ranging from 0 (uniform distribution) to 3 (highly-skewed distribution). The number of distinct values in each attribute is fixed to 10.

Correlation: By default, the data are independently generated for each attribute. We introduce correlation between a pair of attributes by transforming the data such that the correlation coefficient between the two attributes is approximately ρ. The parameter ρ ranges from 0 to 1, representing an increasing degree of correlation. In particular, ρ=0 corresponds to the case where there is no correlation between the two attributes; ρ=1 indicates that the two attributes are fully dependent, i.e., knowing the value of one attribute enables one to perfectly predict the value of the other attribute. This is achieved by first independently generating the data for both attributes (say, A₁and A₂) and then performing the following transformation. For each tuple with A_i=a₁and A₂=a₂, replace a₂by a₁×ρ+a₂×√{square root over (1−p²)}, suitably rounded. For three or more attributes, create data such that the correlation coefficient between any pair of attributes is approximately ρ.

The real data set Census Income contains weighted census data extracted from the 1994 and 1995 population surveys conducted by the U.S. Census Bureau. It has 199,523 tuples and 40 attributes representing demographic and employment related information. Out of the 40 attributes, 7 are continuous, and 33 are nominal.

The following evaluates the methods presented herein on two different query workloads. The first set of queries consist of 100 range queries where each predicate in the query takes the form of (attribute<=constant) with randomly chosen constant. The second set of queries consist of 100 equality queries where each predicate takes the form of (attribute=constant) where constant is randomly chosen.

It will also be appreciated that simple random sampling are used as the sampling scheme in the experiments for both the sampling-based approach and the methods presented herein. All numbers reported are averages of 30 repetitions.

It will also be understood that the exact frequency distributions of individual attributes as the synopses are used, and that the absolute relative error defined in Eq. (1) is used as the error metric.

Results on synthetic data. In all experiments, similar trends are observed for both range and equality queries; thus only the results on range queries are reported because of space limitations.

First the effects of various parameters in the case of two attributes (i.e., only two predicates on two different attributes are involved in the query) are shown, and then show the effect of the number of attributes on the estimation accuracy. The individual selectivities are obtained based on the frequencies of values in each attribute. Since results indicate that the number of tuples T in the table does not have a significant effect on the accuracy of the estimators, only the results for T=100,000 are shown here.

Correlation. The effect of the correlation between attributes on the estimation accuracy by varying the correlation coefficient ρ from 0 to 1, representing an increasing degree of correlation. are shown. FIG. 5A presents a typical result.

The accuracy of the methods in accordance with the present invention increase with the degree of correlation. Since the methods utilize sample information, when the degree of correlation increases, the number of distinct value combinations in the two attributes decreases, as the data become more “concentrated”. Therefore, the sample space (containing all distinct value combinations) becomes smaller, and thus sampling becomes more efficient (i.e., for a given sample rate, it is more likely to include in the sample a tuple satisfying the query).

In addition, as the degree of correlation increases, the benefit of adjusting the weights in accordance with known single-attribute synopses becomes more evident. In the extreme case where the two attributes are fully dependent (ρ=1), it essentially produces the exact selectivity, provided that there is at least one tuple in the sample satisfying the query.

To understand why this is the case, consider the following query: Q=P₁∩P₂=(A₁=a)∩(A₂=b). Full dependency dictates that if there is at least one tuple in the table satisfying this query, then for any other value c (c≠a)in A₁and d (d≠b) in A₂, both (A₁=a)∩(A₂=d) and (A₁=c)∩(A₂=b) evaluate to false. This implies that s=s₁=s₂.

Therefore, if in the auxiliary vector x_jfor tuple j, we have x_j1=1 (which corresponds to A₁=a), then y_j(the variable indicating whether j satisfies Q) must also be 1, and vice versa. Since we know s₁, we have

$\frac{1}{N} \sum_{j \in S}^{} w_{j} x_{j 1} = s_{1}$

as a constraint in the optimization problem. If we can find a set of w_jthat satisfy this constraint, then the calibration estimator

$\frac{1}{N} \sum_{j \in S}^{} w_{j} y_{j}$

must also yield s₁, which means a perfect selectivity estimate.

One exception to this analysis is that when there is no tuple j ∈ S satisfying Q, it may no longer be possible to produce the exact estimate. In such cases, all y_j(j ∈ S) are 0; therefore, regardless of the weights, the calibration estimator 1/N Σ_j∈Sw_jy_jwill also be zero, which may be different from the exact selectivity.

In all cases, the methods disclosed herein produce significantly more accurate estimates than the sampling-based method, with a 50%-100% reduction in error. Both distance functions give very close estimates, verifying the claim that estimators using different distance functions are asymptotically equivalent.

Data skew. The effect of data skew by varying the Zipfian parameter z from 0 (uniform) to 3 (highly-skewed), a typical result is shown in FIG. 5(b) It will be seen that the errors increase as the data becomes increasingly more skewed. The reason is that when the data skew in each attribute increases, the frequencies of some value combinations decrease. As a result, when there is a query on those value combinations with low occurrence frequencies, it becomes increasingly possible that no sampled tuple can satisfy the query. This gives rise to more errors, because with no sampled tuple satisfying the query, the estimate has to be zero, whereas the actual selectivities are not.

Note that this situation is different from the case of increasing correlation as discussed above. The main effect of increasing the skew is a decrease in the frequencies of some value combinations, not necessarily reducing the number of value combinations present in the table. Increasing correlation, on the other hand, generally results in a reduction in the number of value combinations.

Another interesting observation from FIG. 5(b) is that the accuracy of the prior art synopsis-based approach remains virtually the same regardless of the data skew. The reason is as follows. Assuming independence between attributes, the synopsis-based approach estimates the selectivity by ŝ_his=s₁*s₂In FIG. 5(b), the two attributes are fully dependent, which implies that the actual selectivity s=s₁=s₂. Thus, E(ŝ_his)=(s−s₁s₂)/s₁=1−s₁. The average error over a large number of (uniformly) randomly selected equality queries is therefore 1−avg(s₁). In this case, since there are 10 distinct values in each attribute, avg(s₁)=1/10=0.1 the average error of the estimate is thus 1−0.1=0.9. Therefore, the accuracy of this approach does not change with data skew in this case.

Sample rate FIG. 6A shows a typical result on how the three methods behave as the sample rate is increased. The number of attributes in the data set is 2. The accuracy of the synopsis-based approach remains unchanged across the range of sample rates, because it does not depend on sampling. It will be appreciated that the accuracy of the methods presented herein, in accordance with the present invention, improves with increasing sample rate. For all sample rates, the methods disclosed herein, in accordance with the present invention, improve outperform both the synopsis-based and the sampling-based approaches.

It is also worth noting that with methods of the present invention, the same level of accuracy with a much smaller sample rate than that required by the sampling-based approach may be achieved. For example, in FIG. 6A, the sampling-based approach has an error of 0.07 when the sample rate is 0.005. The methods presented herein achieve approximately the same level of accuracy with a sample rate of 0.001, resulting in a reduction by a factor of 5. It will be appreciated that this translates into more significant I/O savings because of the non-linear relationship between the I/O cost and the sample rate as discussed earlier.

Number of attributes. The number of attributes involved in the query range from 2 to 5 to study the impact of the number of attributes on the estimation accuracy. A typical result is shown in FIG. 6B. Clearly, the accuracy of all three approaches decreases as the number of attributes increases since having more attributes would introduce more sources of errors. A space of higher dimensionality requires a much larger sample to cover a fixed portion of the space, in comparison with a space of lower dimensionality.

Note from FIG. 6B, however, that the methods disclosed herein disclosed herein, in accordance with the present invention, outperforms the other two prior art approaches for all number of attributes, and has a lower rate of decrease in accuracy.

Results on real data. Since the Census Income data has 40 attributes, there are 40×39=1560 attribute pairs. Randomly choosing 100 attribute pairs and recording the accuracy of the methods disclosed herein with prior art approaches, as the sample rate increases, results in FIG. 7. It will be seen that the trends are similar to those for the synthetic data, with the methods of the present invention significantly outperforming both the synopsis-based and the sampling-based approaches. The error response to the number of attributes is also similar to that for the synthetic data, and is therefore omitted here.

It will be understood that the capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.

As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.

Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.

The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.

While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.

Claims

1. A method for improving selectivity estimation for conjunctive predicates for use in a query optimizer for a relational database management system, the method comprising: sampling a relational database for generating a sample data set;estimating cardinalities of the sample data set;adjusting the estimated cardinalities of the sample data set, wherein adjusting cardinalities of the sample data set comprises:determining a first weight set;determining a second weight set; andminimizing at least one distance between the first weight set and the second weight set.
2. The method as in claim 1, wherein determining the first weight set comprises: determining a plurality of tuples in the sample data set; andweighting each of the plurality of tuples in the sample data set according to predetermined statistics.
3. The method as in claim 2 wherein determining the second weight set comprises using a distance function to derive the second weight set.
4. The method as in claim 3 wherein using a distance function further comprises using a linear distance function.
5. The method as in claim 3 wherein using a distance function further comprises using a multiplicative distance function.
6. The method as in claim 1 further comprising determining individual and combined predicates.
7. The method as in claim 6 wherein estimating the cardinalities of the sample data set further comprises estimating the cardinalities with respect to the individual and combined predicates.
8. A relational database management system for improving cardinality estimation for use with a computer system wherein queries are entered for retrieving data, the system comprising: means for sampling a relational database for generating a sample data set;means for estimating cardinalities of the sample data set;means for adjusting the estimated cardinalities of the sample data set, wherein in means for adjusting cardinalities of the sample data set comprises:means for determining a first weight set;means for determining a second weight set; andmeans for minimizing at least one distance between the first weight set and the second weight set.
9. The relational database management system as in claim 8, wherein determining the first weight set comprises: means for determining a plurality of tuples in the sample data set; andmeans for weighting each of the plurality of tuples in the sample data set according to predetermined statistics.
10. The relational database management system as in claim 8 wherein determining the second weight set comprises means for using a distance function to derive the second weight set.
11. The relational database management system in claim 10 wherein using a distance function further comprises means for using a linear distance function.
12. The relational database management system as in claim 10 wherein using a distance function further comprises means for using a multiplicative distance function.
13. The relational database management system as in claim 8 further comprising means for determining individual and combined predicates.
14. The relational database management system as in claim 13 wherein estimating the cardinalities of the sample data set further comprises means for estimating the cardinalities with respect to the individual and combined predicates.
15. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform a method for improving cardinality estimation in a relational database management system, the method comprising: sampling a relational database for generating a sample data set;determining individual and combined predicates;estimating cardinalities of the sample data set, wherein estimating the cardinalities of the sample data set further comprises:estimating the cardinalities with respect to the individual and combined predicates;adjusting the estimated cardinalities of the sample data set, wherein adjusting cardinalities of the sample data set comprises:determining a first weight set, wherein determining the first weight set comprises:determining a plurality of tuples in the sample data set;weighting each of the plurality of tuples in the sample data set according to predetermined statistics;determining a second weight set, wherein determining the second weight set comprises;using a distance function to derive the second weight set, wherein using the distance function further comprises selecting the distance function from the group consisting of a linear distance function and a multiplicative distance function; andminimizing at least one distance between the first weight set and the second weight set

SYSTEM AND METHOD FOR IMPROVING CARDINALITY ESTIMATION IN A RELATIONAL DATABASE MANAGEMENT SYSTEM

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims