A query optimizer generates query execution plans based on received database queries. Query execution plans may be generated based on database statistics, such as the estimated cardinality of the received query. Cardinality refers to a number of records of a table which are selected by a query, while selectivity refers to a percentage of records of a table which are selected by a query.
A basic approach for determining the selectivity of a query consisting of several predicates is to determine a selectivity for each predicate and to multiply the determined selectivities. This approach assumes that data values are uniformly distributed and the selectivities of predicates are independent. The maximum entropy method has been suggested as an approach for determining a selectivity estimate for a complete predicate conjunct based on several selectivity estimates for partial predicate conjuncts. Assuming three predicates p0, p1, p2 whose selectivities are estimated to be s0=0.5, s1=0.5, and s2=0.5, a selectivity for p0∧p1 of s01=0.4 and a selectivity for p1∧p2 of s12=0.1, the selectivity of the full predicate conjunct p0∧p1∧p2 may be determined as 0.08 using entropy maximization. In contrast, the independence assumption generates a selectivity estimate for the full predicate conjunct of 0.5*0.5*0.5=0.125, which is inconsistent because it is greater than the selectivity for p1∧p2.
The maximum entropy method derives selectivities for each combination of predicates of a complete predicate conjunct by finding the unique vector x=(x0, x1, . . . x2
H(s)=Σi−xi log xi,
subject to the constraints given by any known selectivities of various predicate conjuncts. An iterative scaling algorithm is conventionally used to solve the entropy maximization problem. This algorithm exhibits very slow convergence and a high asymptotic complexity (i.e., O(m2*n)) in each iteration, where m is the number of known selectivities, z the number of predicates and n=2z. Accordingly, the use of an iterative scaling algorithm to solve the entropy maximization problem is typically unsuitable for use in modern database systems. Newton's method has been considered as an alternative approach, but exhibits similarly unsuitable performance, particularly as the number of known selectivities and/or the number of predicates increases.
Systems for efficiently solving an entropy maximization problem are desired. Such systems may preferably allow for implementation in a modern database management system to estimate selectivities without consumption of undue processing resources or processing time.
The following description is provided to enable any person in the art to make and use the described embodiments and sets forth the best mode contemplated for carrying out some embodiments. Various modifications, however, will be readily apparent to those in the art.
Some embodiments provide improved determination of selectivities via an efficient implementation of a Newton's method-based algorithm to solve the entropy maximization problem. Such an implementation may be executed by a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) to provide suitable performance in some database scenarios. Some embodiments further execute the implementation using a multi-threaded GPU in a manner providing additional performance benefits.
Statistics server 110 receives complete predicate conjunct 120, which may comprise a multi-predicate query (e.g., p0∧p1∧p2) for which an execution plan is to be generated. Based on known selectivities of one or more conjuncts of the predicates 130 (e.g., s02 (corresponding to p0∧p2), s2 (corresponding to p2)), statistics server 110 generates estimated selectivity for all predicate conjuncts 140.
In some embodiments, statistics server 110 executes generates an estimated selectivity for all predicate conjuncts 140 using a Newton's method-based algorithm to solve a maximum entropy problem. Conventional implementations of such an algorithm exhibit high complexity (e.g., O(n2)), but an implementation according to some embodiments utilizes a recursive procedure that avoids redundant computations and results in a Newton's method-based algorithm with lower computational complexity (e.g., O(n log n)).
Generally, server node 200 receives a query from client applications 230 and 240 and returns results thereto based on data stored within node 200. Node 200 executes program code to provide an application server and query processor 220. The application server provides services for executing server applications. For example, Web applications executing on an application server may receive Hypertext Transfer Protocol (HTTP) requests from client applications 240 as shown in
Query processor 220 contains the actual stored data and engines for processing the data. Query processor 220 is responsible for processing Structured Query Language (SQL) and Multi-Dimensional eXpression (MDX) statements and may receive such statements directly from client applications 230.
Query processor 220 includes a query optimizer for use in determining query execution plans and a statistics server for determining statistics used to estimate query execution plan costs. The statistics server may generate such statistics based on other stored statistics as described herein. For example, in response to reception of a query consisting of a conjunct of several predicates on a stored table (or view) of server node 200, the statistics server may estimate selectivity of the query based on known selectivities of various conjuncts of the query predicates. According to some embodiments, the statistics server of query processor 220 comprises an implementation of statistics server 110 to generate such estimates.
In some embodiments, the data of server node 200 may comprise one or more of conventional tabular data, row-based data, column-based data, and object-based data. Moreover, the data may be indexed and/or selectively replicated in an index to allow fast searching and retrieval thereof. Server node 200 may support multi-tenancy to separately support multiple unrelated clients by providing multiple logical database systems which are programmatically isolated from one another.
Server nodes 200 may implement an “in-memory” database, in which a full database stored in volatile (e.g., non-disk-based) memory (e.g., Random Access Memory). The full database may be persisted in and/or backed up to fixed disks (not shown). Embodiments are not limited to an in-memory implementation. For example, data may be stored in Random Access Memory (e.g., cache memory for storing recently-used data) and one or more fixed disks (e.g., persistent memory for storing their respective portions of the full database).
A query is initially determined at S310. The query may include a plurality of predicates as is known in the art and may be received by a statistics server for the purpose of estimating corresponding selectivities as described herein. Next, at S320, known selectivities of each of a set of conjuncts of the plurality of predicates are determined. The known selectivities may be pre-stored and/or on-the-fly calculated estimates which are produced from single column histograms, two-dimensional histograms, and/or sampling as is known.
Selectivities of all conjuncts of the plurality of predicates are estimated at S330 based on the known selectivities determined at S320. These selectivities are represented below as b=Cx, where x represents selectivities of all conjuncts of the complete set of predicates. As also described below, Cx may be determined recursively to provide efficient determination of selectivities at S330. The recursive determination of Cx may be leveraged for efficient implementation of other steps of the present Newton's method-based algorithm.
Entropy maximization algorithms require a matrix-based representation of the subject maximum entropy problem, which is not conventionally defined in the case of selectivity estimation. The foregoing description is intended to formalize the entropy maximization problem as a series of vector and matrix operations, from which efficient implementations will then be derived.
The table below summaries aspects of the notation to be used in the following description.
Using the above notation, a conjunctive query of z predicates (i.e., a complete predicate conjunct) may be represented as follows:
p0∧ . . . ∧pz−1
The predicates may consist of selection predicates or join predicates.
Let N={0, . . . , z−1} be the set of numbers from 0 to z−1. All subsets X⊆N can be represented as a bit-vector of length z denoted by bv(X) where the set bits indicate the indexes of those elements of N which are also included in the subset X. Further, this bit-vector can be interpreted as a binary number. No distinction is made between the bit-vector and the integer it represents and either may be used. For example, the notation i⊆j denotes the fact that i has a ‘1’ only in those positions where j has a ‘1’, i.e., j=i|j is True.
For any X⊆N, the following formula is defined:
β(X):=∧i∈Xpi
i.e., β(X) is the conjunction of all predicates pi whose index i is contained in X. The following table gives a complete overview for z=3, where the bits are ordered from least significant to most significant and where the first column gives the integer value and its bit-vector representation of the index set X and the second column the corresponding conjunction of predicates contained in X. β(i) may be used instead of β(X) if i is the bit-vector/integer representation of some X.
The selectivity of β(X), i.e., the probability of β(X) being true, is denoted by β(X). In the case of the empty set, the empty conjunct is always true (i.e., β(Ø)=β(0)=1).
A conjunction of literals containing all predicates either positively or negatively is called a complete conjunct. For n=3, the following table contains a list of all complete conjuncts:
Two different complete conjuncts can never be true simultaneously. The complete conjuncts have been indexed by their bit-vector representation, where a positive atom corresponds to ‘1’ and a negative atom corresponds to ‘0’. For a given X⊆N, the complete conjunct X is denoted by γ(X):
The probability of a complete conjunct γ(X) for some X being true is denoted by γ(X).
For a given X⊆N, the bit-vectors y of the complete conjuncts γ(Z) contributing to β(X) can be expressed as all the bit-vectors y which contain a ‘1’ at least at those positions where the bit-vector representation bv(X) of X contains a ‘1’. That is,
{y|y⊇bv(X)}.
Consider X={0} (≙100). Then
where s (p) denotes the selectivity of the complete conjunct p. For X={0,1} (≙110):
β(X)=s(p0∧p1∧¬p2)+s(p0∧p1∧p2).
As a special case, for X=Ø(≙000), all complete conjuncts contribute to β(Ø). Further, the sum thereof must be one. Consequently, it is assumed that the empty set is contained in the set of known selectivities T, i.e., Ø∈T.
In case T=2N, all selectivities are known. Define n=2z. Then, the complete design matrix A∈n,n is defined as
where indices are used in [0, 2z−1]. C is unit upper triangular, nonsingular, positive definite, and persymmetric. For z=3:
This design matrix assists the determination of selectivities for positive conjuncts from probabilities for complete conjuncts. Let b=(β(0), . . . , β(n−1))t the column vector containing all the selectivities β(X) for all X∈2N and x=(γ(0), . . . , γ(n−1))t the column vector containing all the selectivities for all complete conjuncts. Then,
Cx=b
holds.
Now notation will be established to eliminate rows and columns in some matrix A. Let A∈n,n be some matrix. Let T⊆{0, . . . , n−1}, m: =|T|, be a set of column indices. Then, the matrix where only the columns in T are retained is denoted A|c(T). Likewise, the matrix derived by retaining only the rows in T is denoted A|r(T). These operations can be expressed via matrix multiplication. For an index set T with m=|T|, the matrix Em,n,T∈m,n is defined as
where T [i] denotes the i-th element of the sorted index set T. For example, for m=4, n=8, T={1,3,5,7}:
Then, for A∈Rn,n
A|r(T)=Em,n,TA
A|c(T)=A(Em,n,T)t
holds. For a given subset T⊆{0, . . . , n−1} (of known selectivities), only those rows from the complete design matrix C for which there is an entry in T are retained. The problem specific (partial) design matrix D for T is defined as
D:=C|r(T)=Em,n,TC∈m,n
where m: =|T|. Clearly, the rank of D is m.
For z predicates, a given vector βT of known selectivities and indices T thereof, the problem is to find the solution to Dx=βT that maximizes the entropy of the solution vector x. That is, the problem to solve can be specified as
where n=2z. Note that Σi=1n-1xi=1 must be true, but this is implied since it is assumed that Ø∈T always holds.
The nine steps below comprise a Newton's method-based algorithm to solve the dual problem of the entropy maximization problem above:
where it is supposed that the set {x∈n: Dx=βT, x≥0} has a nonempty interior.
In contrast to iterative scaling, a Newton-based approach to solve the optimization problem exhibits local quadratic convergence. As input, the algorithm receives the vectors b and T of known selectivities and their indices, and some ∈>0 to be used in the stop criterion. The algorithm returns the solution x maximizing the entropy and the vector Cx containing the β-selectivities for all possible predicate conjuncts. Although T does not occur in the algorithm, it is used in the definition of the design matrix D as noted above.
Newton(b(=βT),T,∈)
The steps in the algorithm differ vastly in complexity. The initializations of w and x have complexity O(n) and O(m), respectively, and are thus rather uncritical. The calculation of w=w+z in Line 6 has complexity O(m) and is thus also rather uncritical.
The calculation of A=Ddiag(x)Dt in Line 4 of Newton can be very expensive if implemented literally. Note that diag(x) is a diagonal (n×n)-matrix with x on its diagonal. Using standard matrix multiplication, the complexity of this step is O(m*n2+m2*n). The present inventors have noted that diag(x) contains only zeroes besides the diagonal and conceived of the following more-efficient procedure which leverages this structural condition and does not rely on materializing diag(x):
get_DdiagxDt(D, x)
Ay=b−Dx is solved for y in Line 5 of Newton. Calculating Dx has complexity O(m*n). To solve the equation, note that the (m, m) matrix A=Ddiag(x)D′ calculated in step (2) is symmetric, non-singular, and positive definite. Thus, the efficient Cholesky decomposition can be applied to derive a lower triangular matrix L with A=LLt. Then, the solution y can be derived using back substitution. The complexity of this procedure is O(m3).
Dtw is calculated in Line 7, which has complexity O(m*n). Line 8 with complexity O(m) is also uncritical, as Dx has been already calculated in Line 5. In Line 9, the product of the complete design matrix C with the primal solution vector x is calculated. Using standard matrix multiplication, this step has complexity O(m*n).
The present inventors have derived a reduction-based algorithm for implementing the above Newton's method-based algorithm that avoids redundant computations, resulting in lower computational complexity than the above-described implementation thereof. The reduction-based algorithm may be executed at S330 and is based on a recursive definition of the complete design matrix C.
Assuming the complete design matrix for z predicates is denoted by Cz∈n×n with n=2z, then
characterize the complete design matrix C. Another possibility to define C is to use the Kronecker product ⊗. That is, with
what follows is
Cz+1=C1⊗Cz
It is desired to efficiently calculate Cx at S330 for some vector x∈n to efficiently implement the Newton-based algorithm. If x∈n is apportioned into two halves x1, x2∈n/2, it is observed that
The term Cz−1x2 occurs twice but has to be calculated only once. Based on this observation, a recursive procedure may be implemented to calculate Czx in O(z2z), i.e. O(n log n), substituting n=2z. The algorithmic complexity of the Newton's method-based algorithm may therefore be advantageously reduced from O(n2) to O(n log n).
An efficient iterative algorithm get_Cx may further be provided as listed below in order to avoid the overhead of recursion. It is assumed that the in/out argument Cx has been initialized with x. Further, vp_add is an AVX2-based implementation to add two vectors of length h.
void get_Cx(double*Cx, uint z)
1. w=h=s=t=0;
2. for n=1<<z;
3. for (w=2; w<=n; w<<=1)//width
4. for (s=0; s<n; s+=w)//start of first half
5. h=(w>>1); II half of width
6. t=s+h;//start of second half
7. vp_add(Cx+s, Cx+t, h);
A procedure to efficiently calculate Cty can be devised similarly by replacing Cx by Ctx and vp_add(Cx+s, Cx+t, h) by vp_add(Ctx+t, Ctx+s, h). This algorithm may be called get_Ctx to w′.
To leverage the recursive characterization of C to efficiently calculate Dx and Dtx, it is noted that for n=2z (z being the number of predicates), (1) the complete design matrix C is of dimension (n, n) and (2) the design matrix D is of dimension (m, n). In typical applications m will be much smaller than n=2z.
Since the calculation of Cx in Line 9 of Newton can be implemented efficiently as described above, Dx=Em,n,TCx can also be evaluated efficiently based on D: =C|r(T)=Em,n,TC∈m,n by first calculating Cx and then picking the components contained in T. This evaluation is performed only once during Newton to calculate the expressions Dx in Lines 5 and 8, and Cx in Line 9. Further, Ctx can be calculated efficiently using algorithm get_Ctx. Thus, calculating Dtw in Line 7 can be implemented efficiently by exploiting the fact that Dt=CtEm,n,TT. w may be embedded into a vector w′ in n via
(0≤i<m, 0≤j<n) and apply algorithm get_Ctx.
An efficient implementation of Line 4 will now be discussed. As Cx has already been calculated, it will be shown how to calculate (Ddiag(x)Dt) from Cx. Initially, an efficient algorithm to calculate (Cdiag(v)Ct) is provided.
It is noted that (diag(v)Ct)=(Cdiag(v))t. Further,
Thus, using
(Cdiag(x)Ct) can be calculated from Cx. Since
Ddiag(x)Dt=(Em,n,TC)diag(x)(Em,n,TC)t=Em,n,T(Cdiag(x)Ct)Em,n,T
Cx can be used to fill (Ddiag(x)Dt)∈m,m via
(Ddiag(x)Dt)[i,j]=(Cx)[T[i]|T[j]]
for 0≤i, j<m.
The above-described implementation of Newton can be implemented efficiently on a modern CPU or GPU. Implementation on a multi-threaded GPU will, providing even further efficiencies, now be described. The description is focused on the implementation of Cx because, as described above, this operation is the core of the implementation of Lines 4, 7, 8 and 9 of Newton. Efficient GPU implementation of Line 5, the Cholesky decomposition, is also described.
Embodiments extend the implementation of get_Cx to support massive multi-threading.
Assuming a maximum number of threads per thread block of 1024, the first ten iterations can be performed without requiring communication between different thread blocks. During these ten iterations, the shared memory of the GPU is used and access to global memory is only required once when loading x into shared memory and once when writing Cx back to global memory. This arrangement is beneficial because shared GPU memory may offer lower latency and significantly higher bandwidth than global memory. Hence, for z≤10, embodiments may use a kernel using shared memory such as shown in listing 500 of
For z>10, and again assuming a maximum number of threads per thread block of 1024, no efficient shared memory implementation is possible as threads of one thread block would need to access shared memory allocated in another thread block. Consequently, all memory accesses should go to global memory and global synchronization through individual kernel launches is required. This global kernel shown in listing 600 of
As described above, solving Ay=b−Dx for y in Line 5 of Newton can be performed using Cholesky decomposition. The cuSolver library from the CUDA toolkit may be used for large problems, i.e. for m≥40. cusolverDnDpotrf is first relied on to factorize A in a kernel call. Then, the kernel cusolverDnDpotrs is called, where b−Dx is passed as argument and y is received as a result.
As multiple kernel calls are involved in these steps, and each kernel call implies a call overhead of approximately 5-10 μs, a variant of the Cholesky decomposition using only a single kernel call may be implemented. This kernel may be used as a solver for small problems, i.e. m<40. The implementation calculates the solution of the system of equations via Gaussian elimination without pivoting. It is implemented to run in a single thread block using shared memory.
The following is a description of how the various kernels are combined to implement Newton on a GPU according to some embodiments. As shown in the pseudo code below, the initialization in Lines 1-3 and the main loop of Newton are realized in function NewtonGPU.
get_Cx_GPU (x, direction)
1. y=getCxShared(x,direction)
2. for w←1 to z−10
3. Cx=getCxGlobal<10+w>(y,direction)
4. return Cx
buildMatrixA(b,x)
While the logic of the main loop is the same as in Newton, the code is organized to minimize the number of kernel calls. For example, in Line 5 both Ddiag(x)Dt and b−Dx are computed in a single kernel call to buildMatrixA. Cx is first computed calling getCxShared and then, if z>10, getCxGlobal is called in a loop for every 10<w≤z. In the second step of kernel buildMatrixA, Cx is gathered from the elements for Dx and A=Ddiag(x)Dt as explained above.
Line 6 of the loop in function Newton_GPU is implemented using the cuSolver library of CUDA as described above for larger problems. Line 7 computes w=w−y using thrust::transform from Thrust, the CUDA C++ template library. Then, Line 8 fuses Line 7 and the computation of b−Dx in Line 8 of Newton into a single kernel productOfDtw. This kernel first distributes vector w into x, and then productOfDtw uses the logic of get_Cx_GPU to compute Dtw using direction=1 as a parameter to handle the transposed matrix. As part of this computation the vectors uold, unew and x can also be calculated in the same kernel. After the call to productOfDtw the vector uold contains the element-wise delta of the last loop iteration. This vector is used in Line 9 to determine δ to check for convergence of the algorithm. The L∞ norm and ∈=10−8 may be used in a GPU implementation. If convergence is reached, the solution is returned in Line 11 by issuing one final call to get_Cx_GPU(x,0).
Server node 700 includes processing unit(s) 710 operatively coupled to communication device 720, data storage device 730, one or more input devices 740, one or more output devices 750 and memory 760. Communication device 720 may facilitate communication with external devices, such as an external network or a data storage device. Input device(s) 740 may comprise, for example, a keyboard, a keypad, a mouse or other pointing device, a microphone, knob or a switch, an infra-red (IR) port, a docking station, and/or a touch screen. Input device(s) 740 may be used, for example, to enter information into apparatus 700. Output device(s) 750 may comprise, for example, a display (e.g., a display screen) a speaker, and/or a printer.
Data storage device 730 may comprise any appropriate persistent storage device, including combinations of magnetic storage devices (e.g., magnetic tape, hard disk drives and flash memory), optical storage devices, Read Only Memory (ROM) devices, etc., while memory 760 may comprise Random Access Memory (RAM).
Application server 731 and query processor 732 may each comprise program code executed by processor(s) 710 to cause server 700 to perform any one or more of the processes described herein. Such processes may include estimating selectivities of queries on tables 734 based on statistics 733. Embodiments are not limited to execution of these processes by a single computing device. Data storage device 730 may also store data and other program code for providing additional functionality and/or which are necessary for operation of server 700, such as device drivers, operating system files, etc.
The foregoing diagrams represent logical architectures for describing processes according to some embodiments, and actual implementations may include more or different components arranged in other manners. Other topologies may be used in conjunction with other embodiments. Moreover, each component or device described herein may be implemented by any number of devices in communication via any number of other public and/or private networks. Two or more of such computing devices may be located remote from one another and may communicate with one another via any known manner of network(s) and/or a dedicated connection. Each component or device may comprise any number of hardware and/or software elements suitable to provide the functions described herein as well as any other functions. For example, any computing device used in an implementation some embodiments may include a processor to execute program code such that the computing device operates as described herein.
Embodiments described herein are solely for the purpose of illustration. Those in the art will recognize other embodiments may be practiced with modifications and alterations to that described above.
Number | Name | Date | Kind |
---|---|---|---|
20070027837 | Kutsch | Feb 2007 | A1 |
20180341679 | Moerkotte | Nov 2018 | A1 |
Entry |
---|
J.L. Hennessy et al., Computer Architecture a Quantitative Approach, Elsevier, Fifth Edition, 2012 (Year: 2012). |
What is a Table, Database.Guide, 2016, found at https://database.guide/what-is-a-table/ (Year: 2016). |
D. Havenstein et al., Fast Entropy Maximization for Selectivity Estimation of Conjunctive Predicates on CPUs and GPUS, Proceedings of the 23rd International Conference on Extending Database Technology (EDBT), Mar. 30-Apr. 2, 2020 (Year: 2020). |
M. Muller et al., Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopsis, Proceedings of the VLD Endowment, vol. 11, No. 9, 2018 (Year: 2018). |
C. Lanczos, Solution of Systems of Linear Equations by Minimized Iterations, Journal of Research of the National Bureau of Standards, vol. 49, No. 1, 1952 (Year: 1952). |
Larson, Per-Ake et al., “Cardinality Estimation Using Sample Views with Quality Assurance”, SIGMOD '07, Jun. 11-14, 2007, Beijing, China, 12pgs. |
Markl, Volker et al., “Consistent selectivity estimation via maximum entropy”, The VLDB Journal (2007), vol. 16, DOI:10.1007s00788-006-0030-1, (pp. 55-76, 23 total pages). |
Golub, Gene H. et al., “Matrix Computations”, 3rd Edition, The John Hopkins University Press, 1996, (cover p. 1+i-vi +Contents vii-xxviii+pp. 1-694, 723 total pages). |
Leis, Viktor et al., “How Good Are Query Optimizers, Really?”, Proccedings of the VLDB Endowment, vol. 9, No. 3, 2015, (pp. 204-215, 12 total pages). |
Stillger, Michael et al., “LEO-DB2's LEarning Optimizer”, Proceedings of the 27th VLDB Conference, Roma, Italy, 2001, 10pgs. |
Gemulla, Rainer et al., “Maintaining Bernoulli Samples over Evolving Multisets”, PODS'07, Jun. 11-14, 2007, Beijing, China, 10pgs. |
Harville, David A. “Matrix Algebra From a Statistician's Perspective”, http://www.gbv.de/dms/goettingen/228932173, downloaded Jul. 15, 2020, 8pgs. |
Moerkotte, Guido et al., “Proximal Operator of Quotient Functions with Application to a Feasibility Problem in Query Optimization”, HAL archives-ouvertes, HAL ID: hal-00942453, Feb. 5, 2014, 16pgs. |
Poosala, Viswanath et al., “Selectivity Estimation Without the Attribute Value Independence Assumption”, Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997, (pp. 486-495, 10 total pages). |
Cormode, Graham et al., “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”, Foundations and Trends in Databases, vol. 4, Nos. 1-3, (2011), DOI: 10.1561/1900000004, (contents 2pgs+pp. 1-294, total 296 pages). |
Ioannidis, Yannis “The History of Histograms (abridged)”, Proceedings of the 29th VLDB conference, Berlin, Germany, 2003, 12pgs. |
Number | Date | Country | |
---|---|---|---|
20220012302 A1 | Jan 2022 | US |