1. Field of the Invention
This invention relates generally to database management, and particularly to a method for detecting data attribute dependencies.
2. Description of Background
Datasets (e.g., files or other electronic collections of data) exhibit, often complex, dependency structures among their data attributes. Detecting these data attribute dependencies is important for a variety of purposes, such as database query optimization, data mining, metadata discovery, and database system management in general. For example, in the context of query optimization, dependency detection is needed for “statistics configuration.” Current approaches to detecting data attribute dependencies include so-called proactive approaches, in which all data is scanned or sampled to detect dependencies, and reactive approaches, in which data from query feedback (i.e., the results of queries) is analyzed to detect dependencies.
However, such proactive approaches can be inefficient or even unfeasible, e.g., because of high computational needs, such as to examine a large number of data attributes. Furthermore, such reactive approaches can be inefficient or unfeasible, e.g., because of instability when there is a limited number of feedback records, sensitivity to the order in which feedback records are processed, high complexity and computational needs (which may also make such approaches difficult to incorporate and/or maintain in commercial database management systems), and/or lack of flexibility to reduce computational needs for applications other than database query optimization. Therefore, an approach to detect data attribute dependencies is desirable that can be effectively incorporated into database management systems, is stable (e.g., providing accurate detection of data attribute dependencies regardless of the order in which feedback records are processed, even when the number of available feedback records is small), and is flexible (e.g., applicable to various applications in which detection of data attribute dependencies is needed).
A method for detecting data attribute dependencies is provided. An exemplary embodiment of the method includes obtaining at least one data attribute pair of a dataset to analyze for dependency, obtaining at least one query feedback record related to the data attribute pair, obtaining at least one observation of the data attribute pair from the query feedback record that includes a selectivity and at least one of a first marginal selectivity or a second marginal selectivity, completing the observation, if it does not include the first marginal selectivity and the second marginal selectivity, by estimating the missing marginal selectivity, adjusting the observation if needed to make it logically consistent among a plurality of observations of the data attribute pair, computing a statistic HM of the data attribute pair, determining whether the data attribute pair is dependent by comparing the statistic HM to a threshold value, determining a dependency measure of the data attribute pair by normalizing the statistic HM with respect to a normalizing factor, and saving the dependency measure of the data attribute pair to a system catalog.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.
The subject matter that is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
According to exemplary embodiments of the invention described herein, a method for detecting data attribute dependencies is provided. In accordance with such exemplary embodiments, an approach to detect data attribute dependencies is provided that can be effectively incorporated into database management systems, is stable (e.g., providing accurate detection of data attribute dependencies regardless of the order in which feedback records are processed, even when the number of available feedback records is small), and is flexible (e.g., applicable to various applications in which detection of data attribute dependencies is needed).
Turning now to the drawings in greater detail, wherein like reference numerals indicate like elements,
Exemplary server device 102 includes processor 104, input/output component(s) 106, and memory 108, which are in communication via bus 103. Processor 104 may include multiple (e.g., two or more) processors, which may, e.g., implement pipeline processing, and may also include cache memory (“cache”) and controls (not depicted). The cache may include multiple cache levels (e.g., L1, L2, etc.) that are on or off-chip from processor 104 (e.g., an L1 cache may be on-chip, an L2 cache may be off-chip, etc.). Input/output component(s) 106 may include one or more components that facilitate local and/or remote input/output operations to/from server device 102, such as a display, keyboard, modem, network adapter, ports, etc. (not depicted). Memory 108 includes software 110 for detecting data attribute dependencies, which is executable, e.g., by server device 102 via processor 104. Memory 108 may include other software, data etc. (not depicted).
Each sub-predicate that appears in the conjunctive predicate, e.g., “A=αi” in the above example, is a simple predicate. It is assumed that (αi,βi)≠(αj,βj) for j≠i. Each observation Oi is a set having one of the forms (i) Oi={fα
A QFR contains the observed cardinality for a (simple or conjunctive) predicate, together with the estimated cardinality computed by a database query optimizer. Whereas, the quantity fα
In block 208, each incomplete observation (e.g., each observation of the form, Oi={fα
For example, consider the case in which fα
The goal is to choose the estimate f*·β
In block 210, the completed observations are adjusted, as needed, to ensure “logical consistency” among them, For example, an observation Oi={fα
In block 212, a statistic HM of the data attribute pair is computed from the (completed and consistency-adjusted) observations in O={O1,O2, . . . ,On}. The value of the statistic HM equals zero when the independence assumption holds and the value of the statistic HM increases as the deviation from the independence assumption increases. For example, the statistic HM may be computed according to HM=MxtQx, where M is the number of elements in the dataset; “t” denotes the vector or matrix transpose operation; x=(x1,x2, . . . ,xn) is a column vector whose entries are given by xi=(fα
for 1≦i,j≦n. The matrix Q may be computed by first using known methods to compute the symmetric Schur decomposition of the matrix Σ: Σ=GtDG, where G is a real orthogonal matrix G and D=diag(d1,d2, . . . ,dn) is a diagonal matrix of non-negative numbers. Denote by r=r(Q) the rank of Q, which equals the number of strictly positive diagonal entries of both D and {tilde over (D)}. Then set Q=Gt{tilde over (D)}G, where {tilde over (D)}=diag({tilde over (d)}1,{tilde over (d)}2, . . . ,{tilde over (d)}n), with
for 1≦i≦n, where ε is a small nonnegative number, e.g., which may be chosen as equal to the precision of the computing device.
In some embodiments, the QFRs used for dependency detection and ranking are obtained as a sample of the records in a query feedback warehouse, in order to speed up the computation of HM, which is of order O(n3), where n is the number of observations in O. If, for some reason, the sampling approach does not provide a sufficient decrease in computation cost, then it is possible to further reduce the processing cost by incrementally and approximately maintaining the statistic HM. In some exemplary embodiments, known techniques for incrementally maintaining a singular value decomposition (“SVD”) may be applied, since the symmetric Schur decomposition is a special case of an SVD. An exemplary SVD updating method is the “folding-in” technique, which can be applied in the current setting as follows. Suppose that the dimension of Σ is currently n×n. If a new feedback record is obtained, then it is effectively needed to expand Σ by padding it with 2n+1 elements computed as discussed above for the computation of the vector x. This process can be viewed as appending an n×1 column vector y and then a 1×(n+1) row vector z. Recall that r=r(Q) is the number of positive diagonal entries of D, and fix a positive integer k≦r. By appropriately renumbering the feedback records (and hence permuting the rows and columns of Σ), it can be assumed that the diagonal elements of D appear in descending order from upper left to lower right. Denote by Dk the square diagonal submatrix including the first k rows and columns of D; observe that Dk is nonsingular. Let Gk denote the submatrix obtained from G by dropping all but the first k rows of G. Then the “folding-in” method proceeds by appending first the column vector ytGktDk−1 and then the row vector zGktDk−1 to Gk; the matrix Dk remains unchanged. The cost of this update is O(nk). Then HM≈MxktGktDk−1Gkxk; computing HM is an O(k2) operation. When k=r and there are no numerical roundoff errors, this process is exact; otherwise, error accrues. The updates can also be batched into blocks of m feedback records, i.e., the vectors y and z can be replaced by n×m and m×(n+m) matrices, respectively. If desired, more accurate (and expensive) updating schemes are possible.
In block 214, a dependency of the data attributes A and B is determined by comparing the statistic HM to a threshold value θ: attributes A and B are determined to be dependent if HM>θ, and independent if HM≦θ. In some exemplary embodiments, θ is the (1−p) quantile of a standard chi-squared distribution with r=r(Q) degrees of freedom, where p is the maximum allowable probability (e.g., specified by a user) of erroneously declaring attributes A and B dependent when A and B are actually independent. The intuition underlying this procedure is that, if M is large and if the elements t1,t2, . . . ,tM in the dataset were generated as independent and identically distributed samples from a hypothetical “superpopulation” distribution in which attributes A and B were truly independent with marginal frequencies equal to those actually observed, then HM would have approximately a standard chi-squared distribution with r degrees of freedom. Thus θ is chosen so that, under the foregoing “true independence” scenario, the probability that HM exceeds θ (so that A and B are erroneously declared dependent) does not exceed p. This type of superpopulation approach is standard in the theory of survey sampling. In the (unlikely) case in which complete feedback observations are available for all possible pairs (α,β)εDA×DB, then the foregoing procedure essentially reduces to the classical chi-squared test for independence.
In block 216, a dependency measure is computed for the attributes A and B that were determined to be dependent in block 214. Practical data sets may have a large set of dependent attribute pairs. Thus, there may be a need to rank each detected dependent attribute pair in order of decreasing dependency measure. Such a ranking measure for a given pair can be obtained by normalizing the statistic HM computed for that pair to obtain a normalized dependency measure of the form HM/z, where z is a normalizing factor. Normalization is needed to obtain fair comparisons between different data-attribute pairs, since the HM values for different pairs are obtained, in general, from different numbers of feedback observations. In some exemplary embodiments, z is chosen as the (1−p) quantile of the standard chi-squared distribution with r=r(Q) degrees of freedom. Based on experimentation, values of p between 0.005 and 0.05 yield acceptable results. Other possible choices of z include:
1. Table cardinality: z1=M. Observe that, when comparing attribute pairs in the same dataset, this normalization is equivalent to using the “raw” value of HM.
2. Minimum number of distinct values in the full dataset: z2=min(|DA|,|DB|). This normalization is basically the normalization for standard chi-squared analysis of independence.
3. Minimum number of distinct values in the feedback warehouse: z3=min(|D′A|,|D′B|). Here, D′A (⊂DA) is the number of distinct values of attribute A appearing in the feedback records, and similarly for D′B. This normalization is basically the “feedback version” of the normalization in 2.
4. Minimum number of distinct values used to compute HM: z4=min(nA,nB). Here, nA is the number of distinct αi values actually used in computing HM, and similarly for nB.
5. Degrees of freedom: z5=r. This normalization can also be viewed as a feedback version of 2 above.
6. Courant-Fischer bound: z6=M∥x∥2/d*, where d* is the smallest positive diagonal element of the matrix D used to compute HM and ∥x∥2 is the sum of the squares of the elements of the vector x used to compute HM. This value of z is an upper bound on HM, and therefore will normalize HM to lie in the range [0,1]. This normalization can be numerically unstable, however, because d* can be close to 0. A desirable normalization z, i.e., the (1−p) chi-squared quantile, can be viewed as a rough approximation of z6; whereas, z6 represents an upper bound on HM, the quantity z represents a stable, approximate upper bound that is exceeded with low probability.
In block 218, the dependency measure of the data attribute pair is saved, e.g., to a system catalog for use in a database system. If there are a plurality (e.g., two or more) of data attribute pairs, then the ranked dependency measures of the data attribute pairs are saved. In some exemplary embodiments, if space in the system catalog is limited, the dependency measures of the k most dependent attribute pairs are saved, where the number k may be determined by a user.
Exemplary computer system 100 and server device 102 are illustrated and described with respect to various components, modules, etc. for exemplary purposes. It should be understood that other variations, combinations, or integrations of such elements that provide the same features, functions, etc. are included within the scope of embodiments of the invention.
The flow diagram described herein is just an example. There may be many variations to this diagram or the blocks (or operations) thereof without departing from the spirit of embodiments of the invention. For instance, the blocks may be performed in a differing order, or blocks may be added, deleted or modified. All of these variations are considered a part of the claimed invention. Furthermore, although an exemplary execution of the flow diagram blocks is described with respect to the exemplary computer system 100 and server device 102, execution of the flow diagram blocks may be implemented with other hardware and/or software architectures that provide the same features, functions, etc. in accordance with exemplary embodiments of the invention.
Exemplary embodiments of the invention can be implemented in hardware, software, or a combination of both. Those embodiments implemented in software may, for example, include firmware, resident software, microcode, etc. Exemplary embodiments of the invention may also be implemented as a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or other instruction execution system. In this regard, a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use in connection with the instruction execution system, apparatus, or device.
The computer-usable or computer-readable medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (apparatus, device, etc.) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk, or an optical disk. Some current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W), or digital video disk (DVD).
A data processing system suitable for storing and/or executing program code can include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, or cache memories that provide temporary storage of at least some program code to reduce the number of times the code needs to be retrieved from bulk storage during execution.
Input/output (I/O) devices (e.g., keyboards, displays, pointing devices, etc.) can be coupled to the data processing system either directly or through intervening I/O controllers. Network adapters may also be coupled to the data processing system to allow the system to be coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Telephonic modems, cable modems, and ethernet cards are a few examples of the currently available types of network adapters.
While exemplary embodiments of the invention have been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims that follow. These claims should be construed to maintain the proper protection for the invention first described.