The present invention relates generally to the field of data management and, in particular, relates to approximation.
There is a lot of interest in approximate query and request processing over compact, precomputed data synopses to address the problem of dealing with complex queries over massive amounts of data in interactive decision-support and data-exploration environments. For several of these application scenarios, exact answers are not required and users may, in fact, prefer fast, approximate answers to their queries. Examples include the initial, exploratory drill-down queries in ad-hoc data mining systems, where the goal is to quickly identify the interesting regions of the underlying database, or aggregation queries in decision-support systems, where the full precision of the exact answer is not needed and the first few digits of precision suffice (e.g., the leading digits of a total in the millions or the nearest percentile of a percentage).
Haar wavelets are a mathematical tool for the hierarchical decomposition of functions with several successful applications in signal and image processing. A number of recent studies have also demonstrated the effectiveness of the Haar wavelet decomposition as a data-reduction tool for database problems, including selectivity estimation and approximate query and request processing over massive relational tables and data streams. Briefly, the decomposition process is applied over an input data set along with a thresholding procedure in order to obtain a compact data synopsis comprising a selected small set of Haar wavelet coefficients. Several research studies have demonstrated that fast and accurate approximate query and request processing engines can be designed to operate solely over such precomputed compact wavelet synopses.
The Haar wavelet decomposition was originally designed with the objective of minimizing the overall root-mean-squared error (i.e., the L2-norm) in the data approximation. However, recent work on probabilistic wavelet synopses also demonstrates their use for optimizing other error metrics, including the maximum relative error in the approximate reconstruction of individual data values, which is a metric for query answers and enables meaningful, non-trivial error guarantees for reconstructed values. While the use of the traditional Haar wavelet decomposition gives the user no knowledge on whether a particular answer is highly-accurate or off by many orders of magnitude, the use of probabilistic wavelet synopses provides the user with an interval where the exact answer is guaranteed to lie into.
Despite the surge of interest in wavelet-based data reduction and approximation in database systems, relatively little attention has been paid to the application of wavelet techniques to complex tabular data sets with multiple measures (multiple numeric entries for each table cell.) Such massive, multi-measure tables arise naturally in several application domains, including online analytical processing (OLAP) environments and time-series analysis/correlation systems. As an example, a corporate sales database may tabulate, for each available product, (1) the number of items sold, (2) revenue and profit numbers for the product, and (3) costs associated with the product, such as shipping and storage costs. Similarly, a network-traffic monitoring system takes readings on each time-tick from a number of distinct elements, such as routers and switches, in the underlying network and typically several measures of interest need to be monitored (e.g., input/output traffic numbers for each router or switch interface) even for a fixed-network element.. Both of these types of applications may be characterized not only by the potentially very large domain sizes for some dimensions (e.g., several thousands of time ticks or different products sold), but also by the huge amounts of collected data.
Recently, the idea of extended-wavelet coefficients was introduced as a flexible, space-efficient storage format for extending conventional wavelet-based summaries to the context of multi-measure data sets. However, the synopsis-construction techniques can only be used to minimize (for a given space budget) the weighted sum of the overall L2-norm errors for each measure. Still, given the pitfalls and shortcomings of L2-error-optimized wavelet synopses for building effective approximate query processing engines, there is a clear need for more sophisticated wavelet-based summarization techniques for multi-measure data that can be specifically optimized for different error metrics (such as the relative error metric).
Various deficiencies of the prior art are addressed by various exemplary embodiments of the present invention of probabilistic wavelet synopsis for multiple measures, including algorithms for constructing effective probabilistic wavelet-synopses over multi-measure data sets and techniques that can accommodate a number of different error metrics, including the relative-error metric, thus enabling meaningful error guarantees on the accuracy of the approximation for individual measure values. By operating on all measures simultaneously, exemplary embodiments judiciously allocate the available space to all measures based on the difficulty of accurately approximating each one, and exploit storage dependencies among coefficient values to achieve improved storage utilization and, therefore, improve accuracy in data reconstruction over prior techniques that operate on each measure individually.
One embodiment is a method for probabilistic wavelet synopses for data sets with multiple measures. In response to a request, a wavelet synopsis is constructed that minimizes an error metric for a data domain having multiple measures. The wavelet synopsis includes extended wavelet coefficients. Space is allocated by applying a probabilistic thresholding technique that is based on unbiased randomized rounding of the extended wavelet coefficients. The probabilistic thresholding includes accounts for storage dependencies among the extended wavelet coefficients and selects rounding values such that the error metric is minimized, while not exceeding a prescribed space limit for the probabilistic wavelet synopsis. An approximation in response to the request is provided.
Another embodiment is a method for probabilistic wavelet synopses for multiple measures, where a synopsis space is allocated to extended wavelet coefficients in an error tree based on marginal error gains by, at each step, attempting to allocate additional space to a subset of the extended wavelet coefficients that results in a largest reduction in a maximum normalized standard error (NSE2) per unit of space used. Estimated current and potential maximum NSE2 values are calculated at a root coefficient of the error tree for each data measure and an approximation to the maximum minimization problem for the extended wavelet coefficients is provided.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
The invention will be primarily described within the general context of an embodiment of wavelet synopses for multiple measures, however, those skilled in the art and informed by the teachings herein will realize that the invention is applicable to approximation, uncertainty, and probabilistic databases, benchmarking and performance evaluation, data cleaning, transformation, migration, and lineage, data mining and knowledge discovery, data models, semantics, and query languages, data privacy and security, data stream and publish-subscribe systems, data warehousing and OLAP, digital libraries, embedded, sensor, and mobile databases, metadata management, middleware and workflow management, multimedia databases, optimization, performance, availability, and reliability, parallel, distributed, and heterogeneous databases, peer-to-peer and networked data management, personalized information systems, physical database design, indexing, and tuning, replication, caching, and view management, scientific, biological, and statistical databases, spatial, temporal, and real-time databases, storage and transaction management, text databases and information retrieval, user interfaces and data visualization, web information and Web services, extensible markup language (XML) and semi-structured databases and many different kinds of data management.
Probabilistic Wavelets over Multiple Measures
Formulation and Exact PODP Solution
The problem of constructing probabilistic wavelet synopses over multi-measure data sets using the space-efficient extended wavelet coefficient format is formally defined below. Utilizing this more involved storage format for coefficients forces non-trivial dependencies between thresholding decisions made across different measures, thus significantly increasing the complexity of probabilistic coefficient thresholding. More specifically, these dependencies cause the principle of optimality based on a total ordering or partial solutions that is required by the earlier single-measure dynamic programming (DP) solutions to be violated, rendering these techniques inapplicable in the multi-measure setting. Thus, exemplary embodiments include a probabilistic thresholding scheme for multi-measure data sets based on the idea of an exact partial-order DP (PODP) formulation. Briefly, the PODP solution generalizes earlier single-measure DP schemes to data sets with M measures by using an M-component vector objective and an M-component less-than partial order to prune sub-problem solutions that cannot possibly be part of an optimal solution.
Fast, Greedy Approximate Probabilistic-Thresholding Algorithm
Given the very high space and time complexities of exemplary embodiments of the PODP algorithm, an exemplary embodiment of a novel, greedy approximation algorithm (GreedyRel) is used for probabilistic coefficient thresholding over multi-measure data. Briefly, the GreedyRel heuristic exploits the error-tree structure for Haar wavelet coefficients in greedily allocating the available synopsis space based on the idea of marginal error gains. More specifically, at each step, GreedyRel identifies for each error subtree, the subset of wavelet coefficients that are expected to give the largest per-space reduction in the error metric, and allocates space to the best such subset overall (i.e., in the entire tree). The time and space complexities of the GreedyRel are only linear in the number of measures involved and the data-set size and, in fact, are also significantly lower than those of earlier DP algorithms for the single-measure case. Note that the complexities of the earlier DP algorithms are, even for the single-measure case, at least quadratic to the domain size, thus yielding the GreedyRel algorithms as a practical solution, even for the single-measure case, for constructing accurate probabilistic wavelet synopses over large data sets.
Experimental Results Verifying the Effectiveness of the Approach
Results from an extensive experimental study of exemplary embodiments are provided with both synthetic and real-life data sets. The results validate the approach, demonstrating that (1) the algorithms easily outperform naive approaches based on optimizing individual measures independently, typically producing errors that are up to a factor of seven smaller than prior techniques; and (2) the greedy thresholding scheme provides near-optimal and, at the same time, very fast and scalable solutions to the probabilistic wavelet synopsis construction problem.
The Haar Wavelet Transform
Wavelets are a useful mathematical tool for hierarchically decomposing functions in ways that are both efficient and theoretically sound. Broadly speaking, the wavelet decomposition of a function consists of a coarse overall approximation along with detail coefficients that influence the function at various scales. Suppose the one-dimensional data vector A is given with A containing the N=8 data values A=[2,2,0,2,3,5,4,4]. The Haar wavelet transform of A can be computed as follows. First, the values are averaged together pairwise to get a new lower-resolution representation of the data with the following average values [2,1,4,4]. In other words, the average of the first two values (that is, 2 and 2) is 2 that of the next two values (that is, 0 and 2) is 1, and so on. Some information has been lost in this averaging process. To be able to restore the original values of the data array, some detail coefficients are stored that capture the missing information. In Haar wavelets, these detail coefficients are simply the differences of the (second of the) averaged values from the computed pairwise average. Thus, in the simple example, for the first pair of averaged values, the detail coefficient is 0 since 2−2=0, for the second we again need to store −1 since 1−2=−1. Note that no information has been lost in this process—it is fairly simple to reconstruct the eight values of the original data array from the lower-resolution array containing the four averages and the four detail coefficients. Recursively applying the above pairwise averaging and differencing process on the lower-resolution array containing the averages, the following full decomposition is obtained.
The wavelet transform (also known as the wavelet decomposition) of A is the single coefficient representing the overall average of the data values followed by the detail coefficients in the order of increasing resolution. Thus, the one-dimensional Haar wavelet transform of A is given by WA=[11/4, −5/4,1/2,0,0,−1,−1,0]. Each entry in WA is called a wavelet coefficient. The main advantage of using WA instead of the original data vector A is that for vectors containing similar values most of the detail coefficients tend to have very small values. Thus, eliminating such small coefficients from the wavelet transform (i.e., treating them as zeros) introduces only small error when reconstructing the original data, resulting in a very effective form of lossy data compression. Furthermore, the Haar wavelet decomposition can also be extended to multi-dimensional data arrays through natural generalizations of the one-dimensional decomposition process described above. Multi-dimensional Haar wavelets have been used in a wide variety of applications, including approximate query answering over complex decision-support data sets.
Error Tree and Conventional Wavelet Synopses
A helpful tool for exploring the properties of the Haar wavelet decomposition is the error tree structure, which is a hierarchical structure built based on the wavelet transform process.
Given a node u in an error tree T, let path(u) denote the set of all proper ancestors of u in T (i.e., the nodes on the path from u to the root of T, including the root but not u) with non-zero coefficients. A property of the Haar wavelet decomposition is that the reconstruction of any data value di depends only on the values of coefficients on path(di); more specifically, di=Σc
The support region for a coefficient ci is defined as the set of (contiguous) data values that ci is used to reconstruct; the support region for a coefficient ci is uniquely identified by its coordinate i.
Given a limited amount of storage for building a wavelet synopsis of the input data array A, a thresholding procedure retains a certain number B<<N of the coefficients as a highly-compressed approximate representation of the original data (the remaining coefficients are implicitly set to 0). Conventional coefficient thresholding is a deterministic process that seeks to minimize the overall root-mean-squared error (L2 error norm) of the data approximation by retaining the B largest wavelet coefficients in absolute normalized value. L2 coefficient thresholding has also been the method of choice for the bulk of existing work on Haar-wavelets applications in the data-reduction and approximate query processing domains.
Probabilistic Wavelet Synopses
Unfortunately, wavelet synopses optimized for overall L2 error using the above-described process may not always be the best choice for approximate query processing systems. Such conventional wavelet synopses suffer from several problems, including the introduction of severe bias in the data reconstruction and wide variance in the quality of the data approximation, as well as the lack of non-trivial guarantees for individual approximate answers. A probabilistic wavelet synopsis addresses these shortcomings. This approach constructs data summaries from wavelet-transform arrays. In a nutshell, the idea is to apply a probabilistic thresholding process based on randomized rounding that randomly rounds coefficients either up to a larger rounding value or down to zero, so that the value of each coefficient is correct on expectation. More formally, each non-zero wavelet coefficient ci is associated with a rounding value λ and a corresponding retention probability yi=ci/λi such that 0<yi≦1 and the value of coefficient ci in the synopsis becomes a random variable Ci∈{0, λi}, where
Ci={λi with probability yi,
0 with probability 1−yi}
In other words, a probabilistic wavelet synopsis essentially rounds each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability yi. Note that the above rounding process is unbiased; that is, the expected value of each rounded coefficient is E[Ci]=λi ·yi +0 ·(1 −yi) =ci, i.e., the actual coefficient value, while its variance is
and the expected size of the synopsis is simply
Thus, since each data value can be reconstructed as a simple linear combination of wavelet coefficients and, by linearity of expectation, it is easy to see that probabilistic wavelet synopses guarantee unbiased approximations of individual data values as well as range-aggregated query answers.
There are several different algorithms for building probabilistic wavelet synopses. The coefficient rounding values {λi} need to be selected such that some desired error metric for the data approximation is minimized, while not exceeding a prescribed space limit B for the synopsis (i.e., E[[synopsis]]≦B ). These strategies are based on formulating appropriate dynamic-programming (DP) recurrences over the Haar error-tree that explicitly minimize either (a) the maximum normalized standard error (MinRelVar) or (b) the maximum normalized bias (MinRelBias) for each reconstructed value in the data domain. The rationale for these probabilistic error metrics is that they are directly related to the maximum relative error (with an appropriate sanity bound s, whose role is to ensure that relative-error numbers are not unduly dominated by small data values in the approximation of individual data values based on the synopsis; that is, both the MinRelVar and MinRelBias schemes try to (probabilistically) control the quantity
where {circumflex over (d)}i denotes the data value reconstructed based on the wavelet synopsis. Note, of course, that {circumflex over (d)}i is again a random variable, defined as the ±1 summation of all (independent) coefficient random variables on path(d). Bounding the maximum relative error in the approximation also allows for meaningful error guarantees to be provided on reconstructed data values.
To accomplish this, the DP algorithms seek to minimize the maximum normalized standard error (NSE) in the data reconstruction, defined as
where Var({circumflex over (d)}i)=Σcj∈path (d)Var(j,yj). The algorithms also naturally extend to multi-dimensional data and wavelets, with a running time of O(Nz2DqB(qlog(qB)+D2D)) (Nz being the number of nodes with at least one non-zero coefficient value, N being the maximum domain size and D being the number of dimensions), an overall space requirement of O(Nz2DqB) and an in-memory working-set size of O(2DqB log N). Note that for synopsis spaces B=O(Nz), the above running time and space complexities are at least quadratic to the number of tuples.
Extended Wavelet Coefficients
The wavelet coefficients can be stored as tuples with D+1 fields, where D is the dimensionality of the data array. Each of these tuples contains the D coordinates of the stored wavelet coefficient (one per dimension), which are used to determine the coefficient's support region, and the stored coefficient value. In multi-measure data sets, storage dependencies among different coefficient values may arise. This occurs because two or more coefficient values for different measures may correspond to the same coefficient coordinates, which results in duplicating the storage of these coordinates. This storage duplication increases with the number of the data set's dimensions due to the increased size of the coefficient coordinates.
To alleviate these shortcomings, the notion of an extended wavelet coefficient is introduced. For a data set comprising M measures, an extended wavelet coefficient is a flexible, space-efficient storage format that can be used to store any subset of up to M coefficient values for each combination of coefficient coordinates. Briefly, this is achieved through the use of a bitmap of size M, which helps determine exactly the subset of coefficient values that has been stored; thus, the ith bitmap bit is set if and only if the coefficient for the ith measure has been stored (1 ≦i ≦M ). More formally, each extended wavelet coefficient is defined as a triplet (C,β, V) consisting of (1) the coordinates C of the coefficient; (2) a bitmap β of size M, where the ith bit denotes the existence or absence of a coefficient value for the ith measure; and (3) the set of stored coefficient values V. The (coordinates, bitmap) pair is referred to as the coefficient's header for an extended wavelet coefficient.
Probabilistic Wavelets for Multiple Measures
Problem Formulation and Overview
It has been demonstrated that exploiting storage dependencies among coefficient values can lead to better storage utilization (i.e., store more useful coefficient values for the same space bound) and, therefore, improve accuracy to queries. However, those algorithms can only be applied towards minimizing the overall L2 error of the approximation, not for minimizing other error metrics, such as the maximum relative error, which is relevant for providing approximate query answers. On the other hand, while the known work utilized the notion of probabilistic wavelet synopses to propose algorithms that minimize the maximum relative error of the approximation, none of these algorithms can exploit storage dependencies between coefficient values to construct effective probabilistic wavelet synopses for multi-measure data sets.
In exemplary embodiments of the present invention, the notion of the extended wavelet coefficients and the probabilistic wavelet synopses are utilized as helpful tools to develop algorithms that seek to minimize the maximum relative error of the approximation in multi-measure data sets. To simplify the exposition, this description first focuses primarily on the one-dimensional case and then extensions to multi-dimensional wavelets are described.
Expected Size of Extended Coefficients
The sharing of the common header space (i.e., coordinates+bitmap) among coefficient values introduces non-trivial dependencies in the thresholding process across coefficients for different measures. To be more precise, consider a data set with M measures and let cij denote the Haar coefficient value corresponding to the jth measure at coordinate i and let yij denote the retention probability for cij in the synopsis. Also, let ECi be the extended wavelet coefficient at coordinate i and let H denote the space required by an extended coefficient header. The unit of space is set equal to the space required to store a single coefficient value (e.g., size of a float) and all space requirements are expressed in terms of this unit. The expected space requirement of the extended coefficient ECi is computed as:
The first summand in the above formula captures the expected space for all (non-zero) individual coefficient values at coordinate i. The second summand captures the expected header overhead. To see this, note that if at least one coefficient value is stored, then a header space of H must also be allotted. And, of course, the probability of storing ≧1 coefficient values is just one minus the probability that none of the coefficients is stored.
Equation (2) demonstrates that the sharing of header space amongst the individual coefficient values cij for different measures creates a fairly complex dependency of the overall extended-coefficient space requirement on the individual retention probabilities yij. Given a space budget B for the wavelet synopsis, exploiting header-space sharing and this storage dependency across different measures is crucial for achieving effective storage utilization in the final synopsis. This implies that the exemplary embodiments of the probabilistic-thresholding strategies for allocating synopsis space cannot operate on each measure individually; instead, space allocation explicitly accounts for the storage dependencies across groups of coefficient values (corresponding to different measures). This significantly complicates the design of probabilistic-thresholding algorithms for extended wavelet coefficients.
Problem Statement and Approach
A goal is to minimize the maximum relative reconstruction error for each individual data value; this also allows exemplary embodiments to provide meaningful guarantees on the accuracy of each reconstructed value. More formally, an aim is to produce estimates {circumflex over (d)}ij of the data values dij, for each coordinate i and measure index j, such that |{circumflex over (d)}ij−dij|≦ε·max {|dij|, sj}, for given per-measure sanity bounds sj>0, where the error bound ε>0 is minimized subject to the given space budget for the synopsis. Since probabilistic thresholding implies that {circumflex over (d)}ij is again a random variable, and using an argument based on the Chebyshev bound, it is easy to see that minimizing the overall NSE across all measures (or equivalently, the maximum NSE2) guarantees a maximum relative error bound that is satisfied with high probability. Thus, the probability-thresholding problem may be defined for extended wavelet coefficients as follows.
Maximum NSE Minimization for Extended Coefficients
Find the retention probabilities yij for coefficients cij that minimize the maximum NSE for each reconstructed data value across all measures; that is,
subject to the constraints 0 <yij ≦1 for all non-zero cij and E[[synopsis ]]=Σi]≦B , where the expected size E[[ECi]]of each extended coefficient is given by equation (2).
The above maximum NSE minimization problem for multi-measure data is addressed by exemplary embodiments of the present invention. Exemplary embodiments of algorithms exploit both the error-tree structure of the Haar decomposition and the above-described storage dependencies (equation (2)) for extended coefficients in order to intelligently assign retention probabilities {yij} to non-zero coefficients within the overall space-budget constraint B. The exemplary embodiments also rely on quantizing the space allotments to integer multiples of 1/q, where q>1 is an integer input parameter; that is, the constraint
is modified in the above problem formulation.
An Algorithm Formulation: Partial-Order Dynamic Programming
Consider an input data set with M measures. An exemplary embodiment of the present invention includes a partial-order dynamic programming (PODP) algorithm that processes the nodes in the error tree bottom-up and calculates for each node i and each space budget 0≦Bi≦B to be allocated to the extended wavelet coefficient values in the node's entire subtree, a collection of incomparable solutions. Each such solution R[i, Bi] is an M-component vector of NSE2 values corresponding to all M measures for the data values in the subtree rooted at node i and assuming a total space of Bi allotted to extended coefficients in that subtree. A goal of the PODP algorithm is, of course, to minimize the maximum component of the vector R[root,B]; that is, minimize maxk=1, . . . ,M{R[root,B]k}.
A complication in the optimization problem is that, for a given synopsis space budget, these M per-measure NSE values are not independent and cannot be optimized individually; this is, again, due to the intricate storage dependencies that arise between the approximation at different measures because of the shared header space (equation (2)). As already described, the thresholding algorithm exploits these dependencies to ensure effective synopsis-space utilization. This implies that the thresholding schemes need to treat these M-component NSE vectors as a unit during the optimization process.
Let dm
To ensure optimality, the bottom-up computation of the DP recurrence cannot afford to maintain just the locally-optimal partial solution for each subtree. In other words, merely tabulating the R[i,B] vector with the minimum maximum component for each internal tree node and each possible space allotment is not sufficient—more information needs to be maintained and explored during the bottom-up computation. As a simple example, consider the scenario depicted in
of the coefficient values of node i are depicted when total space y=yil +yi2 has been allocated to them and for the data values in the left subtree of node i. It is easy to see that, in this example, even though R′[2i,B−y] is locally-suboptimal at node 2i (because its maximal component is larger than the one of R▭), it gives a superior overall solution of [1+2,3+0.5]=[3,3.5] at node i, when combined with i's local variance vector.
In the exemplary embodiment of the PODP algorithm, unlike most other DP solutions, the conventional principle of optimality based on a total ordering of partial solutions is no longer applicable. Thus, locally-suboptimal R[i,B]'s (i.e., with large maximum component NSE2s) cannot be safely pruned, because they may, in fact, be part of an optimal solution higher up in the tree. However, there does exist a safe pruning criterion based on a partial ordering of the R[i,B] vectors defined through the M-component less-than operator =M, which is defined over M-component vectors u,v as follows:
u =Mνif and only if ui≦νi, ∀i ∈{1, . . . ,M}.
For a given coordinate i and space allotment B, we say that a partial solution R′[i,B] is covered by another partial solution R[i,B] if and only if R[i,B]=M R′[i,B]. It is easy to see that, in this case, R′[i,B] can be safely pruned from the set of partial solutions for the (i,B) combination, because, intuitively, R[i,B] can always be used in its place to give an overall solution of at least as good quality.
In the exemplary embodiment of the partial-order dynamic programming (PODP) solution to the maximum NSE minimization problem for extended coefficients, the partial, bottom-up computed solutions R[i,B] are M-component vectors of per-measure NSE2 values for coefficient subtrees and such partial solutions are only pruned based on the =M partial order. Thus, for each coordinate-space combination (i,B), the exemplary embodiment of the PODP algorithm tabulates a collection R[i,B] of incomparable solutions, that represent the boundary points of =M,
R[i,B]={R[i,b]:for any other R′[i,B]∈R[i,B]}, R[i,b]≠MR′[i,B]and R′[i,b]≠MR[i,B]
Of course, for each allotment of space B to the coefficient subtree rooted at node i, the exemplary embodiment of the PODP algorithm needs to iterate over all partial solutions computed in R[i,B] in order to compute the full set of (incomparable) partial solutions for node i's parent in the tree. Similarly, at leaves or intermediate root nodes, all possible space allotments {yij} to each individual measure are considered and the overall space requirements of the extended coefficient are estimated using equation (2). Using an integer parameter q>1 to quantize possible space allotments introduces some minor complications with respect to the shared header space (e.g., some small space fragmentation) that the exemplary embodiment of the algorithm handles.
The main drawback of the exemplary embodiment of the PODP-based solution is the dramatic increase in time and space complexity compared to the single-measure case. PODP relies on a much stricter, partial-order criterion for pruning suboptimal solutions that implies that the sets of incomparable partial solutions R[i,B} that need to be stored and explored during the bottom-up computation can become very large. For instance, in the simple case of a leaf coefficient, it is easy to see that the number of options to consider can be as high as O(qM), compared to only O(q) in the single-measure case; furthermore, this number of possibilities can grow extremely fast (in the worst case, exponentially) as partial solutions are combined up the error tree.
A Fast, Greedy, Approximation Algorithm
Given the very high running-time and space complexities of the exemplary embodiment of the PODP-based solution described above, an exemplary embodiment of an effective approximation algorithm to the maximum NSE minimization problem for extended coefficients is provided. This exemplary embodiment is a very efficient, greedy, heuristic algorithm (termed GreedyRel) for this optimization problem. Briefly, GreedyRel tries to exploit some of the properties of dynamic-programming solutions, but allocates the synopsis space to extended coefficients greedily based on the idea of marginal error gains. An idea is to try, at each step, to allocate additional space to a subset of extended wavelet coefficients in the error tree that results in the largest reduction in the target error metric (i.e., maximum NSE2) per unit of space used.
The exemplary embodiment of the GreedyRel algorithm relies on three operations: (1) estimating the maximum per-measure NSE2 values at any node of the error tree; (2) estimating the best marginal error gain for any subtree by identifying the subset of coefficients in the subtree that are expected to give the largest per-space reduction in the maximum NSE2; and (3) allocating additional synopsis space to the best overall subset of extended coefficients (in the entire error tree). Let Tij denote the error subtree (for the jthmeasure) rooted at cij .
Estimating Maximum NSE2 at Error-Tree Nodes
In order to determine the potential reduction in the maximum squared NSE due to extra space, the exemplary embodiment of GreedyRel first needs to obtain an estimate for the current maximum NSE2 at any error-tree node. GreedyRel computes an estimated maximum NSE2 G[i,j] over any data value for the jth measure in the Tij subtree, using the recurrence:
The estimated maximum NSE2 value is the maximum of two costs calculated for the node's two child subtrees, where each cost sums the estimated maximum NSE2 of the subtree and the node's variance divided by the subtree normalization term. While one can easily show that in the optimal solution the maximum NSE2 in a subtree will occur for the smallest data value (the proof is based on similar arguments to the single-measure case), the above recurrence is only meant to provide an easy-to-compute estimate for a node's maximum NSE2 (under a given space allotment) that GreedyRel can use.
Estimating the Best Marginal Error Gains for Subtrees
Given an error subtree Tij (for the jth measure), the exemplary embodiment of the GreedyRel algorithm computes a subset potSet[i,j] of coefficient values in Tij, which, when allotted additional space, are estimated to provide the largest per=space reduction of the maximum squared NSE over all data values in the Tij subtree. The exemplary embodiments of the algorithms allocate the retention probabilities in multiples of 1/q, where q>1. Let G[i,j] be the current estimated maximum NSE2 for Tij (as described above) and let Gpot[i,j] denote the potential estimated maximum NSE2 for Tij, assuming that the retention probabilities of all coefficient values in potSet[i,j] are increased by a (minimal) additional amount of 1/q. Also, let potSpace[i,j] denote the increase in the overall synopsis size, i.e., the cumulative increase in the space for the corresponding extended coefficients, when allocating the extra space to the coefficient values in potSet[i,j]. The exemplary embodiment of the GreedyRel algorithm computes potSpace[i,j] and estimates the best error-gain subsets potSet[i,j] through the underlying error-tree structure.
Consider a coefficient value Ckj∈potSet[i,j]. Based on equation (2), it is easy to see that an increase of δy
The total extra space potSpace[i,j] for all coefficient values in potSet[i,j] can be obtained by adding the results of equation (4) for each of these values (with):
The marginal error gain for potSet[i,j] is then simply estimated as gain(potSet[i, j])=(G[i, j]−Gpot[i, j])/ potSpace[i, j].
To estimate the potSet[i,j] sets and the corresponding Gpot[i,j] (and gain( )) values at each node, GreedyRel performs a bottom-up computation over the error-tree structure. For a leaf coefficient cij, the only possible choice is potSet[i,j]={cij}, which can result in a reduction in the maximum NSE2 if cij ≠0 and yij<1 (otherwise, the variance of the coefficient is already 0 and can be safely ignored). In this case, the new maximum NSE2 at cij is simply
For a non-leaf coefficient cij, GreedyRel considers three distinct cases of forming potSet[i,j] and selects the one resulting in the largest marginal error gain estimate: (1) potSet[i,j]={cij} (i.e., select only cij for additional storage); (2) potSet[i,j]=potSet[k,j], where k ∈{2i,2i+1} is such that G[i,j]=G[k,j]+Var(cij, yij)/Norm(k,j) (i.e., select the potSet form the child subtree whose estimated maximum NSE2 determines the current maximum NSE2 estimate at cij); and (3) potSet[ij]=potSet[21j]∪potSet[2i+1,j] (i.e., select the union of the potSets from both child subtrees). Among the above three choices, GreedyRel selects the one resulting in the largest value for gain(potSet[i,j]) and records the choice made for coefficient cij (1, 2, or 3) in a variable chij. In order to estimate gain(potSet[i,j]) for each choice, GreedyRel uses the following estimates for the new maximum NSE2 Gpot[i,j] at cij] at cij (index k is defined as in case (2) above and I={2i,2i+1}−{k}):
As an example, consider the scenario depicted in
Note that GreedyRel does not need to store the coefficient sets potSet[i,j] at each error-tree node. These sets can be reconstructed on the fly, by traversing the error-tree structure, examining the value of the chij variable at each node cij, and continuing along the appropriate subtrees of the node, until reaching nodes with chij=1.
Distributing the Available Synopsis Space
After completing the above described steps, the exemplary embodiment of the GreedyRel algorithm has computed the estimated current and potential maximum NSE2 values G[0,j] and Gpot[0,j] (along with the corresponding potSet and potSpace) at the root coefficient (node 0) of the error tree, for each data measure j. Because one objective is to minimize the maximum squared NSE among all measures over the entire domain, GreedyRel selects, at each step, the measure jmax with the maximum estimated NSE2 value at the root node (i.e., jmax=arg maxj{G[0,j]}), and proceeds to allocate additional space of potSpace[0,jmax] to the coefficients in potSet[0,jmax]. This is done in a recursive, top-down traversal of the error tree, starting from the root node and proceeding as follows (i denotes the current node index): (1) chijmax=1, set
(2) if chijmax=2, then recurse to the child subtree Tk, k ∈{2i, 2i+1} through which the maximum NSE2 estimate G[i,jmax] is computed at node i, and (3) if chijmax=3, then recurse to both child subtrees T2iand T2i+1. Furthermore, after each of the above steps, compute the new G, Gpot, potSpace and ch values at node i. These quantities need to be evaluated for all measures, because the space dependencies among the coefficient values, the increase of the coefficient value for measure Jmax may alter the ch values for the other measures.
Time and Space Complexity
For each of the N error-tree nodes, the exemplary embodiment of GreedyRel maintains the variables G[i,j], Gpot[i,j], potSpace[i,j], and chij. Thus, the space requirements per node are O(M), resulting in a total space complexity of O(NM).
In the bottom-up initialization phase (steps 1-6), GreedyRel computes, for each error-tree node, the values of the G[i,j], Gpot[i,j], potSpace[i,j], and chij variables (for each measure j). Each of these O(M) calculations can be done in O(1) time, making the total cost of the initialization phase O(NM). Then, note that each time GreedyRel allocates space to a set of K coefficients, the allocated space is ≧K ×1/q (see equation (4)). To reach these K coefficients, GreedyRel traverses exactly K paths of maximum length O(log N). For each visited node, the new values of G, Gpot, potSpace, and ch are computed, which requires O(M) time. Finding the measure jmax with the maximum estimated NSE2 value at the root requires time O(log M) when using a heap structure to store just the G[0,j] values. Thus, GreedyRel distributes space ≧K×1 /q in time O(KM log N+log M), making the amortized time per-space-quantum 1 /q equal to O(M log N+log M/K)=O(M log N). Because the total number of such quanta that are distributed is Bq, the overall running time complexity of GreedyRel is O(NM+BMq log N).
Finally, exemplary embodiments of the GreedyRel algorithm naturally extend to multiple dimensions with a modest increase of D×2D in its running time complexity. These extensions, along with the extensions of PODP to multiple dimensions are described below. Because the number of non-zero coefficients values in multi-dimensional data sets may be significantly larger than the number of tuples, a thresholding step limits the space needed by the algorithm. This thresholding step can, of course, also be used in the one-dimensional case to further reduce the running time and space requirements of the exemplary embodiment of the GreedyRel algorithm. This step can e performed without introducing any reconstruction bias. Table 2 contains a synopsis of the running time and space complexities of the exemplary embodiment of the GreedyRel algorithm and the MinRelVar algorithm, where Nz denotes the number of error-tree nodes containing at least one non-zero coefficient value and maxD denotes the maximum domain size among all dimensions.
Flow Charts of Exemplary Embodiments
have been checked at 410. If so, the list of calculated results R[i,b] is returned at 404. Otherwise, the next eligible combination is checked at 412. It is determined whether all space allotments
to the left subtree have been checked at 414. If not, the next eligible allotment bL to left subtree is considered at 416 and leftList and rightList are computed, i.e., leftList=ComputeR[2i,bL] and
at 418. Then, at 420, it is determined whether all pairs in leftList, rightList have been checked. If so, then the list of calculated results R[i,b] is returned at 404. Otherwise, the next eligible pair is considered and the potential solution R[i,b] is computed at 422. Then, it is determined whether the new solution is dominated by other calculated solutions. If so, then control flows back to 420. Otherwise, the solution is stored in R[i,b] and any stored dominated solutions are removed at 426.
After 608, another loop at 614 is performed until spaceLeft >0, when, if the OccupiedSpace ≦0 at 616, the array y of retention probabilities is returned at 624. Inside the loop at 618, the measure j ∈{1, . . . , M} is found with a maximum value of G[0,j]. Then, OccupiedSpace=traverse(0, j,q,y,spaceLeft) at 620. (See
At 706, it is determined whether chij=2. If not, control flows to 720, where index k of the subtree that determines the value of G[i,j] is found and I denotes the index of the other subtree. Otherwise, at 722 index k of the subtree that determines the value of G[i,j] is found and at 724 allocatedSpace=traverse(k,i,q,y,spaceLeft) and, then, control flows to 712.
After 720, where index k of the subtree that determines the value of G[i,j] is found and I denotes the index of the other subtree, allocatedSpace=traverse(k,i,q,y,spaceLeft) at 726. Then, it is determined whether spaceLeft>allocatedSpace at 728. If not, control flows to 712. Otherwise, allocatedSpace =allocatedSpace+traverse(l,j,q,y,spaceLeft−allocatedSpace) at 730 and, then, control flows to 712
At 712, G[i, j], Gpot[i, j], potSpace[i, j] and chj are recomputed and allocatedSpace is returned at 714.
Experimental Study
An extensive experimental study was conducted of exemplary embodiments of algorithms for constructing probabilistic synopses over data sets with multiple measures. One objective in the study was to evaluate both the scalability and the obtained accuracy of the exemplary embodiment of the GreedyRel algorithm for a large variety of both real-life and synthetic data sets containing multiple measures.
The study demonstrated that an exemplary embodiment of the GreedyRel algorithm is a highly scalable solution that provides near optimal results and improved accuracy to individual reconstructed answers. This exemplary embodiment of the GreedyRel algorithm provided a fast and highly-scalable solution for constructing probabilistic synopses over large multi-measure data sets. Unlike earlier schemes, such as PODP, this GreedyRel algorithm scales linearly with the domain size, making it a viable solution for large real-life data sets. This GreedyRel algorithm consistently provided near-optimal solutions when compared to PODP, demonstrating that it constitutes an effective technique for constructing accurate probabilistic synopses over large multi-measure data sets. Compared to earlier approaches that operate on each measure individually, this GreedyRel algorithm significantly reduced the maximum relative error of the approximation and, thus, was able to offer significantly tighter error guarantees. These improvements were typically of a factor two, but, in many cases, up to 7 times smaller maximum relative errors were observed.
Experimental Techniques and Parameter Setttings
The experimental study compared an exemplary embodiment of the GreedyRel algorithm and a PODP algorithm for constructing probabilistic data synopses over multi-measure data sets, along with a technique called IndDP that partitioned the available space equally over the measures and, then, operated on each measure individually by utilizing a dynamic programming MinRelVar algorithm. To provide a more fair comparison to the IndDP algorithm, the majority of the experiments included data sets where all the measured quantities exhibited similar characteristics, thus yielding a uniform partitioning of the synopsis space over all the measures as the appropriate space allocation technique. Experiments were also performed with a GreedyL2 algorithm, which is designed to minimize the average sum squared error in multi-measure data sets. However, the GreedyL2 algorithm consistently exhibited significantly larger errors than the exemplary embodiments. A parameter in the exemplary embodiments of the algorithms was the quantization parameter q, which was assigned a value of 10 for the GreedyRel and IndDP algorithms and a smaller value of 4 for the PODP algorithm to reduce its running time. Moreover, the sanity bound of each measure was set to the 5%-quantile value for the measure's data values.
Experimental Data Sets
Several one-dimensional synthetic multi-measure data sets were used in experiments. A Zipfian data generator was used to produce Zipfian distributions of various skews, such as a low skew of about 0.5 to a high skew of about 1.5, with the sum of values for each measure set at about 200,000. Each Zipfian distribution was assigned one of three possible shapes: NoPerm, Normal, or PipeOrgan. NoPerm was the typical Zipfian distribution, where smaller domain values were assigned higher values for the measured quantities. Normal resembled a bell-shaped normal distribution, with higher (lower) values at the center (endpoints) of the domain. PipeOrgan assigned higher (lower) data values to the endpoints (middle) of the domain. In all cases, the centers of the M distributions were shifted and placed in random points of the domain. Also considered were several different combinations of used Zipfian distributions. In an AllNoPerm combination, all M of the Zipfian distributions had the NoPerm shape. Similarly, in an AllNormal combination, all M of the Zipfian distributions had the Normal shape. Finally, in a Mixed combination, ⅓ of the M distributions had the NoPerm shape, ⅓ had the Normal shape, and the remaining ⅓ had the PipeOrgan shape. The results presented are indicative of the multiple possible combinations of the parameters.
In the experimental study, a real-life data set was also used. A weather data set contained meteorological measurements obtained by a station at the University of Washington. This was a one-dimensional data set for which the following six quantities were extracted: wind speed, wind peak, solar irradiance, relative humidity, air temperature, and dewpoint temperature.
Approximation Error Metric
In all cases, the study focused on the maximum relative error of the approximation, because it provided guaranteed error-bounds for the reconstruction of any individual data value and was the error metric that the exemplary embodiments of the algorithms tried to minimize.
Comparing PODP and GreedyRel
The accuracy and running time of the exemplary embodiment of the GreedyRel algorithm was compared to the PODP algorithm. In
Running Time Comparison of GreedyRel and IndDP
Accuracy Comparison of GreedyRel and IndDP in Synthetic Data Sets.
For the synthetic data sets, a domain size of 256 was used. The obtained accuracy in terms of the maximum error of the approximation for the GreedyRel and the IndDP algorithms and six representative combinations of synthetic data sets is presented. These size combinations arise from considering Zipfian distributions with skew 0.6 and 1, along with the other possible combinations of the used Zipfian distributions (i.e., AllNoPerm, AllNormal, and Mixed). The synthetic data sets in this section contain six measures/distributions.
Consider the six possible combinations arising from distributions having skew equal to 1. In
Similar results were also observed for the six combinations of synthetic data sets, arising from setting the skew of the distributions to 0.6. In
Accuracy Comparison of GreedyRel and IndDP in Real Data Sets
In
In summary, exemplary embodiments of effective techniques for building wavelet synopses over multi-measure data sets are provided. These techniques seek to minimize, given a storage constraint, the maximum relative error of reconstructing any data value among all measures. The difficulty of the problem compared to the single measure case is demonstrated and a partial-order dynamic programming (PODP) solution is provided. Given the high time and space complexities of PODP, a fast and scalable approximate algorithm is provided that greedily allocates synopsis space based on the idea of marginal error gains. Experimental evaluation demonstrated that the GreedyRel algorithm exhibited near-optimal solutions, while, at the same time, outperformed prior techniques based on optimizing each measure independently. GreedyRel is a viable solution, even in the single-measure case, for constructing accurate probabilistic wavelet synopses over large data sets.
Pseudo Code for GreedyRel Algorithm
Table 1 below shows pseudo code for an exemplary embodiment of the GreedyRel algorithm. In the later steps of this algorithm, the available synopsis space may become smaller than potSpace[i, jmax]; in this case, rather than recursing on both child subtrees of a node (when chijmax=2), this algorithm first recurses on the child causing the maximum estimated squared NSE, and then recurses on the other child with any remaining space (steps 12-16 of traverse).
Extensions to Multi-Dimensionsal Wavelets
Exemplary embodiments of the present invention extend to multi-dimensional data. For a D-dimensional data set, the error-tree structure becomes significantly more complex. Each node in the error tree (besides the root node) corresponds to a set of (at most) 2D−1 wavelet coefficients with the same support region, but different signed contributions for each region quadrant. Furthermore, each error-tree node i (besides the root node) may have up to 2D children, corresponding to the quadrants of the common support region for all coefficients in i.
Extending PODP
Exemplary embodiments of the PODP algorithm for multi-dimensional data sets generalize the corresponding multi-dimensional MinRelVar strategy in a way analogous to the one-dimensional case. PODP needs to consider, at each internal node of the error tree, the optimal allocation of space to the <2D−1 wavelet coefficients of the node and its <2D child subtrees. The extension of PODP to multi-dimensional data sets is therefore a fairly simple adaptation of the multi-dimensional MinRelVar algorithm. However, PODP needs to maintain, for each node i and each possible space allotment B, a collection R[i,B] of incomparable solutions. This, once again, makes the time/space requirements of PODP significantly higher than those of MinRelVar.
Extending GreedyRel
The first modification involved in extending an exemplary embodiment of the GreedyRel algorithm to multi-dimensional data sets has to do with the computation of G[i,j], which now involves examining the estimated NSE2 values over <2D child subtrees and maintaining the maximum such estimate. Let S(i) denote the set of the <2D−1 coefficients of nod i and let il, . . . ,ip be the indexes of i's child nodes in the error tree. Then,
Another modification involves the estimation of marginal error gains at each node. A total of three possible choices for forming potSet[i,j] for each (node, measure) combination were described above. Each node has up to 2D child subtrees, resulting in a total of 2D+1 possible choices of forming potSetOi,j]. The first choice is to increase the retention probability for measure j of one of the <2D−1 coefficients in node i. In this case, include in potSet[i,j] the coefficient in node i that is expected to exhibit the largest marginal gain for measure j. For each of the remaining 2D possible choices of forming potSet[i,j], the kth choice (1<k<2D) considers the marginal gain of increasing the retention probabilities in the child subtrees through which the k maximum NSE2 values occur, as estimated in the right-hand side of the above equation for G[i,j]. At each node, the computation of Gpot[i,j], potSpace[i,j], and chij incurs a worst-case time cost of O(D×2D) due to the possible ways of forming potSet[i,j] and the sorting operation of 2D quantities. Let N denote the total number of cells in the multi-dimensional data array and maxD denote the maximum domain size of any dimension. Then, the running time complexity of GreedyRel becomes O(D×2D×(NM+BMqlogmax D)). Of course, in most real-life scenarios using wavelet-based data reduction, the number of dimensions is typically a small constant (e.g., 4-6) and the number of tuples can be exponential (O(ND)) to the maximum domain size N.
Improving the Complexity of GreedyRel
In the wavelet decomposition process of a multi-dimensional data set, the number of non-zero coefficients produced may be significantly larger than the number Nz of non-zero data values. One adaptive coefficient thresholding procedure retains at most Nz wavelet coefficients without introducing any reconstruction bias. Using this procedure, the MinRelVar algorithm can be modified so that its running time and space complexity have a dependency on Nz and not on N (i.e., the total number of cells in the multi-dimensional data array). It is thus be desirable if to modify the GreedyRel algorithm in a similar way, in order to decrease its running time and space requirements.
Let Nz denote the number of error tree nodes that contain non-zero coefficient values, possibly after the afore-mentioned thresholding process. For any node in the error tree containing zero coefficient values that was at most one node in its subtree and does not contain non-zero coefficient values, no computation is needed. Equivalently, exemplary embodiments of the algorithm computes G, Gpot values in: nodes containing non-zero coefficient values or nodes that contain zero coefficient values, but which are the least common ancestor of at least two non-zero tree nodes beneath it in the error tree.
Let k be a node that is the only node in its subtree with non-zero coefficient values. The G, Gpot values in the descendant nodes of k do not need to be considered, because they will be zero. An observation is that for any ancestor of k that contains just a single, non-zero error tree beneath it (which is certainly the subtree of node k), no computation is necessary, because the G, Gpot values of k can be used instead. An additional computation is needed in any node n with zero-coefficients that has at least two non-zero error tree nodes beneath them in the error tree (in different subtrees). In this case, the G, Gpot values of node n needs to be calculated, using as input the G, Gpot values of its non-zero descendant tree nodes. It is easy to demonstrate that at most Nz−1 such nodes may exist. Thus, the GreedyRel algorithm needs to calculate the G, Gpot values in at most 0(2Nz−1)=O(Nz) nodes, thus yielding running time and space complexities of O(D×2D×(NzM+BMqlogmaxD)) and O(NZM) respectively. In order to implement the algorithm as described, the Nz coefficients needs to be sorted based on their postorder numbering in the error tree. This requires an additional O(Nz log Nz) time for the sorting process. However, this running time is often significantly smaller than the benefits of having running time and space dependencies based on Nz, rather than on N.
The processor 2130 cooperates with conventional support circuitry such as power supplies, clock circuits, cache memory and the like as well as circuits that assist in executing the software routines stored in the memory 2140. As such, it is contemplated that some of the steps discussed herein as software methods may be implemented within hardware, for example, as circuitry that cooperates with the processor 2130 to perform various method steps. The computer 2100 also contains input/output (I/O) circuitry that forms an interface between the various functional elements communicating with the computer 2100.
Although the computer 2100 is depicted as a general purpose computer that is programmed to perform various functions in accordance with the present invention, the invention can be implemented in hardware as, for example, an application specific integrated circuit (ASIC) or field programmable gate array (FPGA). As such, the process steps described herein are intended to be broadly interpreted as being equivalently performed by software, hardware, or a combination thereof.
The present invention may be implemented as a computer program product wherein computer instructions, when processed by a computer, adapt the operation of the computer such that the methods and/or techniques of the present invention are invoked or otherwise provided. Instructions for invoking the inventive methods may be stored in fixed or removable media, transmitted via a data stream in a broadcast media or other signal bearing medium, and/or stored within a working memory within a computing device operating according to the instructions.
While the foregoing is directed to various embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. As such, the appropriate scope of the invention is to be determined according to the claims, which follow.