The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
Reference is made to
Reference is next made to
The method, a hybrid approach to selectivity estimation for conjunctive predicates (HASE), according to the invention makes consistent use of synopses and sample information when both present. To achieve this goal, the method uses a novel estimation scheme utilizing a powerful mechanism called generalized raking. The method formalizes selectivity estimation in the presence of single attribute synopses and sample information as a constrained optimization problem. By solving this problem, the method obtains a new set of weights associated with the sampled tuples, which has the advantageous property of reproducing the known selectivities when applied to individual predicates.
It will be understood that the description presented herein will be mainly concerned with selectivity estimation for conjunctive predicates of the form Q=P1̂P2 . . . Pm where each component Pi is a simple predicate on a single attribute, taking the form of (attribute op constant) with op being one of the comparison operators <,≦,=,≠,≧, or >(e.g., R.a=100 or R.a≦200).
For example, for a sample of data with known selectivities of individual predicates Pi. The method begins with an estimator constructed based on the sample only, without reference to any additional information, such as the HT estimator (Eq. (2)). For each tuple j in table R, in addition to the variable of interest yj, the method in accordance with the invention also associates with it an auxiliary vector xj to reflect the results of evaluating Pi on j. For purposes of this example, each predicate Pi divides tuples in R into two disjoint subsets, Di and
Let txT=(tx1, . . . ,txm,tx,m+1)=1/N Σj∈R xj. Clearly, txi=1/N Σj∈S xji=si (1≦i≦m), the selectivity of predicate Pi, and tx,m+1=1. Therefore,
t
x
T=(s1,s2, . . . ,sm,1) (3)
Also, for purposes of this example si can be obtained based on synopsis structures, and xj are observed for each tuple j ∈ S. This allows construction of a new estimator (the calibration estimator)
where the weights wj are as close to the weights dj=1/πj as possible according to some distance metric (recall that πj is the inclusion probability of j), and where
meaning that the weighted average of the observed xj has to reproduce the known selectivities si.
In light of the definition of xj and Eq. (3), Eq. (5) can be rewritten as
where sm+1=s. Now wj has a natural representation interpretation: it is the number of tuples “represented” by the sampled tuple j.
In the running example, Eq. (6) becomes
Although in general, there can be many possible choices for the sets of weights {wj} satisfying the constraints in Eq. (6), the goal of the method is to select a set of new weights that are as close as possible to the original weights di=1/πi, which enjoy the desirable property of producing unbiased estimates. By keeping the distance between the new weights and the original weights as small as possible, in accordance with one method of the invention, the new weights remain nearly unbiased. Thus, the method advantageously provides a constrained optimization solution as described herein.
The constrained optimization solution. Let D(x) be a distance function (with x=wj/dj) that measures the distance between the new weights wj and the original weights dj. The query optimizer assures that D(x) satisfies the following requirements (for reasons that will become clear later): (i) D is positive and strictly convex, (ii) D(1)=D′(1)=0, and (iii) D″(1)=1. The optimization for the method to determine is:
subject to
Here, both xj and tx are defined earlier. Since D(wj/dj) can have a large response to even a slight change in wj when dj is small, the query optimizer minimizes Σj∈S djD(wj/dj) instead of Σj∈S D(wj/dj) in order to dampen this effect. Also note that different distance functions can be used to measure the distance between {wj} and {dj}, as long as the distance function complies with conditions (i) to (iii).
Alternative methods of the invention can choose different distance functions. For example, the following two distance functions may be chosen for computational efficiency and interpretability. Both of these distance functions exhibit properties (i) to (iii):
The linear distance function
and
The multiplicative distance function:
In accordance with features of the present invention the following methods may be used to solve the constrained optimization problem. One method for solving constrained optimization problems is the method of Lagrange multipliers. Note that the optimization problem can states as follows:
with respect to wj(j ∈ S),
where λ=(λ1, . . . ,λm,λm+1) is a Lagrange multiplier. Differentiating Eq. (10) with respect to wj, to obtain:
D′(wj/dj)−xjTλ=0 (11)
Then solve the system formed by Eq. (11) and (9) for wj. To do this, obtain from (11) that
w
j
=d
j
F(xjTλ), (12)
where F(x) is the inverse function of D′(x). Conditions (i)-(iii) dictate that the inverse function always exists, and F(0)=F′(0)=1. Substituting (12) into Eq. (9), results in the calibration equations
which can be solved numerically using Newton's method.
Then obtain successive estimates of λ, denoted by λk (k=0,1, . . . ), through the following iteration:
λk+1=λk+[φ′(λk)]−1φ(λk) (14)
take λ0=0. Since one has
the first iteration yields λ1=(Σj∈S djxjxjT)−1(Σj∈S djxj−Ntx). The subsequent values of λk can be obtained following Eq. (14) until convergence.
In summary, the method to estimate the selectivity of Q is presented in
From Table 1(a) and Table 1(b), it is seen that the true selectivity of Q is 0.05 (the cell corresponding to P1=truêP2=true in
Clearly, the marginal frequencies obtained from the sample do not agree with the true marginal frequencies; therefore, calibration is needed. Applying the method shown in
wj≈60 for j ∈ S∩D1∩D2,wj≈102 for j ∈ S∩D1∩
wj≈97 for j ∈ S∩
The selectivity estimate can then be determined:
The estimation error is E(ŝcal)=|0.054−0.05|/0.05=8%. Compared with the error of the prior art synopsis-based estimate E(ŝhis)=260% and the error of the prior art sampling-based estimate E(ŝspl)=80%, this method represents a significant improvement in the estimation accuracy.
An alternative implementation. Now is presented an alternative method for solving the calibration equations, which takes advantage of the intrinsic structure of the equations in (6) and does not require matrix inversion.
Since wj=djF(xjTλ), Eq. (6) becomes
Observe that the i-th Eq. (2≦i≦m) can be solved for λi, assuming all other λl(l≠i) are known, and the first and last equations can be solved for λ1 and λm+1 assuming all other λl(l≠1,l≠m+1) are known.
This method is shown in
Distance measures. We now present the implications of the choice of distance functions D described earlier. In general, different distance functions result in different calibration estimators. However, it will be understood that regardless of the distance functions used (as long as the functions comply with conditions (i)-(iii)), the estimates obtained using the outcome of the specific optimization problem will converge asymptotically.
Therefore, for medium to large sized samples (empirically, with sample size greater than 30), the choice of distance function does not have a heavy impact on the properties of the estimator; in general only slight differences in the estimates produced by using different functions will arise.
The main difference between the distance functions is thus their computational efficiency as well as interpretability.
For the linear function, Dlin, D′(x)=x−1; therefore, the inverse function is F(z)=z+1. In
λ1=(Σj∈S djxjxjT)−1(Σj∈S djxj−tx).
Therefore, when the linear function is used, only one iteration is required, which makes the linear method the faster of the two distance functions considered here. A major drawback of this function is that the weights can be negative. This can lead to negative selectivity estimates. For instance, in the running example, taking a sample of size 10 from R, and the observed frequencies are the following: P1=true∩P2=true: 2; P1=true,P2=false: 5; P1=false∩P2=true: 3; P1=false∩P2=false: 0. Solving the calibration equation, results in wj=−500 for j ∈ S∩D1∩D2. Therefore, the selectivity estimate ŝcal=2×(−500)/10000=−0.1. Negative weights and selectivity estimates do not have a natural interpretation and thus are undesirable. Note that, however, this usually only occurs for small-sized samples. When the sample size gets large, all estimators with distance functions satisfying conditions (i)-(iii) are asymptotically equivalent and give positive weights and selectivity estimates.
For the multiplicative function, Dmul, D′(x)=log x; the inverse function is therefore F(z)=ez. When the multiplicative function is used, it may require more than one iteration, but it often converges after only a few iterations (typically two in our experiments). An advantage of using this function is that it always leads to positive weights because wj=djF(xjTλ)=dj exp {xjTλ}>0.
Probabilistic bounds on the estimation error. Let πjl be the probability that both j and l are included in the sample, and πjj=πj. Assuming that the sampling scheme is such that the πjl's are strictly positive. Let β be a vector satisfying the equation
and let Δjl=πjl−πjπl, εj=yj−xjTβ. Which gives the following result on the error bounds of the estimation error. When the sample size is sufficiently large, for a given constant α ∈ (0,1), the selectivity sQ is bounded by (ŝcal−zα/2√{square root over (V(ŝcal))},ŝcal+zα/2√{square root over (V(ŝcal))} with probability 1−α, where zα/2 is the upper alp ha/2 point of the standard normal distribution, and V(ŝcal)=Σj∈R Σj∈R (Δjl/πjl)(wjεj)(wlεi).
Proof Sketch: When the linear distance function is used, wj=dj(1+xjTλ). We know from Section 3.5 that the solution of the calibration equation converges at λ=(Σj∈S djxjxjT)−1(Σj∈S djxj−tx). Therefore, wj=dj[1+xjT(Σj∈S djxjxjT)−1(Σj∈S djxj−tx)]. Let {circumflex over (β)}s be the solution to the equation
Then the estimator ŝcal can be written as
which takes the form of a generalized regression estimator (GREG). Applying results on the asymptotic variance of GREG to obtain the asymptotic variance of the estimator ŝcal:
Since it has been shown that all estimators with distance functions satisfying conditions (i)-(iii) are asymptotically equivalent, all estimators have the same asymptotic variance V(ŝcal). When the sample S is large enough, the Central Limit Theorem applies. Therefore, for a given constant or α ∈ (0,1), sQ is bounded by (ŝcal−zα/2√{square root over (V(ŝcal))}, ŝcal+zα/2√{square root over (V(ŝcal))} with probability 1−α.
Utilizing multi-attribute synopses. In the discussion, it has been assumed that there is prior knowledge of the selectivities si of individual predicates Pi based on single-attribute synopsis structures. However, it will be understood that the estimation procedure can be advantageously extended so that multi-attribute synopsis structures can also be utilized when they are present.
For example, suppose that a multi-dimensional synopsis exists on a set of attributes A. Thus, in accordance with one method of the invention it is straightforward to derive lower-dimensional synopses from higher-dimensional synopses, i.e., synopses on any subset(s) of A can be obtained from the synopsis on A. Let AQ be the set of attributes involved in query Q. If A∩AQ≠Ø, the synopsis on A can be utilized. Let U=A∩AQ, and let PU be the conjuncts of predicates in which attributes in U are involved. Then the selectivity sU of PU can be estimated based on the synopsis on U. We augment the auxiliary vector xj by an additional element reflecting whether j satisfies PU. Changes are also made accordingly to tx, with the addition of an element with value sU. The algorithms for solving the calibration equations presented above can then be applied in order to obtain ŝcal.
Experimental evaluation. This section reports the results of an experimental evaluation of the estimation methods disclosed herein. The following compares the accuracy of the methods in accordance with the invention with that of the synopsis-based and sampling-based approaches using synthetic as well as a real data set. The real data set used is the Census Income data.
Synthetic data are used to study the properties of the methods presented herein in a controlled manner. A large number of synthetic data sets are generated by varying the following parameters:
Data skew: The data in each attribute are generated from a Zipfian distribution with parameter z ranging from 0 (uniform distribution) to 3 (highly-skewed distribution). The number of distinct values in each attribute is fixed to 10.
Correlation: By default, the data are independently generated for each attribute. We introduce correlation between a pair of attributes by transforming the data such that the correlation coefficient between the two attributes is approximately ρ. The parameter ρ ranges from 0 to 1, representing an increasing degree of correlation. In particular, ρ=0 corresponds to the case where there is no correlation between the two attributes; ρ=1 indicates that the two attributes are fully dependent, i.e., knowing the value of one attribute enables one to perfectly predict the value of the other attribute. This is achieved by first independently generating the data for both attributes (say, A1 and A2) and then performing the following transformation. For each tuple with Ai=a1 and A2=a2, replace a2 by a1×ρ+a2×√{square root over (1−p2)}, suitably rounded. For three or more attributes, create data such that the correlation coefficient between any pair of attributes is approximately ρ.
The real data set Census Income contains weighted census data extracted from the 1994 and 1995 population surveys conducted by the U.S. Census Bureau. It has 199,523 tuples and 40 attributes representing demographic and employment related information. Out of the 40 attributes, 7 are continuous, and 33 are nominal.
The following evaluates the methods presented herein on two different query workloads. The first set of queries consist of 100 range queries where each predicate in the query takes the form of (attribute<=constant) with randomly chosen constant. The second set of queries consist of 100 equality queries where each predicate takes the form of (attribute=constant) where constant is randomly chosen.
It will also be appreciated that simple random sampling are used as the sampling scheme in the experiments for both the sampling-based approach and the methods presented herein. All numbers reported are averages of 30 repetitions.
It will also be understood that the exact frequency distributions of individual attributes as the synopses are used, and that the absolute relative error defined in Eq. (1) is used as the error metric.
Results on synthetic data. In all experiments, similar trends are observed for both range and equality queries; thus only the results on range queries are reported because of space limitations.
First the effects of various parameters in the case of two attributes (i.e., only two predicates on two different attributes are involved in the query) are shown, and then show the effect of the number of attributes on the estimation accuracy. The individual selectivities are obtained based on the frequencies of values in each attribute. Since results indicate that the number of tuples T in the table does not have a significant effect on the accuracy of the estimators, only the results for T=100,000 are shown here.
Correlation. The effect of the correlation between attributes on the estimation accuracy by varying the correlation coefficient ρ from 0 to 1, representing an increasing degree of correlation. are shown.
The accuracy of the methods in accordance with the present invention increase with the degree of correlation. Since the methods utilize sample information, when the degree of correlation increases, the number of distinct value combinations in the two attributes decreases, as the data become more “concentrated”. Therefore, the sample space (containing all distinct value combinations) becomes smaller, and thus sampling becomes more efficient (i.e., for a given sample rate, it is more likely to include in the sample a tuple satisfying the query).
In addition, as the degree of correlation increases, the benefit of adjusting the weights in accordance with known single-attribute synopses becomes more evident. In the extreme case where the two attributes are fully dependent (ρ=1), it essentially produces the exact selectivity, provided that there is at least one tuple in the sample satisfying the query.
To understand why this is the case, consider the following query: Q=P1∩P2=(A1=a)∩(A2=b). Full dependency dictates that if there is at least one tuple in the table satisfying this query, then for any other value c (c≠a)in A1 and d (d≠b) in A2, both (A1=a)∩(A2=d) and (A1=c)∩(A2=b) evaluate to false. This implies that s=s1=s2.
Therefore, if in the auxiliary vector xj for tuple j, we have xj1=1 (which corresponds to A1=a), then yj(the variable indicating whether j satisfies Q) must also be 1, and vice versa. Since we know s1, we have
as a constraint in the optimization problem. If we can find a set of wj that satisfy this constraint, then the calibration estimator
must also yield s1, which means a perfect selectivity estimate.
One exception to this analysis is that when there is no tuple j ∈ S satisfying Q, it may no longer be possible to produce the exact estimate. In such cases, all yj(j ∈ S) are 0; therefore, regardless of the weights, the calibration estimator 1/N Σj∈S wjyj will also be zero, which may be different from the exact selectivity.
In all cases, the methods disclosed herein produce significantly more accurate estimates than the sampling-based method, with a 50%-100% reduction in error. Both distance functions give very close estimates, verifying the claim that estimators using different distance functions are asymptotically equivalent.
Data skew. The effect of data skew by varying the Zipfian parameter z from 0 (uniform) to 3 (highly-skewed), a typical result is shown in
Note that this situation is different from the case of increasing correlation as discussed above. The main effect of increasing the skew is a decrease in the frequencies of some value combinations, not necessarily reducing the number of value combinations present in the table. Increasing correlation, on the other hand, generally results in a reduction in the number of value combinations.
Another interesting observation from
Sample rate
It is also worth noting that with methods of the present invention, the same level of accuracy with a much smaller sample rate than that required by the sampling-based approach may be achieved. For example, in
Number of attributes. The number of attributes involved in the query range from 2 to 5 to study the impact of the number of attributes on the estimation accuracy. A typical result is shown in
Note from
Results on real data. Since the Census Income data has 40 attributes, there are 40×39=1560 attribute pairs. Randomly choosing 100 attribute pairs and recording the accuracy of the methods disclosed herein with prior art approaches, as the sample rate increases, results in
It will be understood that the capabilities of the present invention can be implemented in software, firmware, hardware or some combination thereof.
As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.