A database is a collection of information. A relational database is a database that is perceived by its users as a collection of tables. Each table arranges items and attributes of the items in rows and columns respectively. Each table row corresponds to an item (also referred to as a record or tuple), and each table column corresponds to an attribute of the item (referred to as a field, an attribute type, or field type). To retrieve information from a database, the user of a database system constructs a query. A query contains one or more operations that specify information to retrieve from the database. The system scans tables in the database to execute the query.
A database system can optimize a query by arranging the order of query operations. In databases, summaries or synopses are used because it is unreasonable to store or scan the entire table during optimization. Query optimizers may use summaries to get quick-albeit approximate-cardinality estimates of certain columns to generate query execution plans based on a cost model. Data summaries that are efficient to produce and that adequately represent the initial data provide for effective optimization.
Features and advantages of examples of systems, methods and devices will become apparent by reference to the following detailed description and drawings.
Many applications, including digital image processing and database searching, for example, require access to large quantities of data. Because of the large data quantities involved, these applications may employ some form of data compression so as to reduce the necessary storage and bandwidth requirements for associated hardware, and allow data extraction for editing, processing, and targeting of particular devices and operations. Many compression/decompression mechanisms are “lossy,” meaning the decompressed data are not exactly the same as the initial data before compression. However, lossy compression/decompression mechanisms are able to achieve a much greater compression ratio than conventional lossless methods. Furthermore, the loss of data may have little affect on the specific application.
Compression mechanisms may use an encoding technique known as a wavelet transformation. Wavelets are mathematical functions that parse data into different frequency components and then compare each component with a resolution matched to its scale. Wavelets are better suited than traditional Fourier (e.g., discrete cosine transform or DCT) methods in analyzing physical situations where the signal contains discontinuities and sharp spikes, as often is the case in image processing. Wavelets also are useful when analyzing large quantities of structured and unstructured data, as might exist in a large database.
The basis functions of the wavelet transforms are small waves or wavelets developed to match signal discontinuities. The wavelet basis function of a wavelet has finite support of a specific width. Wider wavelets examine larger regions of the signal and resolve low frequency details accurately, while narrower wavelets examine a small region of the signal and resolve spatial details accurately. Wavelet-based compression has the potential for satisfactory compression ratios while being computationally efficient.
The wavelet transform uses short waves that start and stop with different time and spatial resolutions, while the DCT or the fast Fourier transform (FFT) have a set of fixed and well defined basis functions over an infinite support. For a specified transform, some wavelet mechanisms do not use a specific formula for the basis function but rather a set of mathematical requirements such as a smoothness condition and vanishing moment condition. Thus, a first step in wavelet analysis may be to determine and designate wavelet bases to use and that determination may depend on the specific application. Designing and finding such wavelet bases to be designated for use is not a trivial task because it is mathematically involved. Fortunately, numerous researchers have designed a number of specific wavelet basis functions, with Daubechies wavelets and Haas wavelets being two examples.
Database search applications may use data compression mechanisms to optimize search results, contend with limited bandwidth, and reduce the need for data storage, which in turn may reduce costs associated with the database searches. For example, with an ever increasing need for business intelligence to support critical decision-making, and for compliance with ever-increasing legal regimes as exemplified by the Sarbanes-Oxley legislation, it may be useful for companies to store and access (query) large amounts of structured and unstructured data. Data compression techniques may be used to improve the efficiency of information management with regards to storage, transmission, and analysis. Real-time business intelligence may put a heavy burden on the query optimization in database systems.
The algorithms used to implement compression in database search applications may be implemented in both software and hardware.
Generally, data processing system 10 may provide for compression of initial dataset 20. An example of a method for compressing the initial dataset is illustrated in the flow chart of
Other structures are possible with the database system 40. For example, computing platform 42 may operate autonomously of any direct human intervention to make queries 48 of the database 44. In other words, a suitably programmed processor may execute routine or periodic queries, or may execute episodic queries based on the occurrence of pre-specified events. In either case of the computing platform 42, the end result is a query 48 that is presented to components of the database search system 46.
The database search system 46 may include a query optimizer 50, a compression module 52, and an execution engine 54. The compression module 52 is described in detail with reference to
The compression module 52 may use wavelet-based compression techniques combined with selectable thresholding (i.e., thresholding based on a pre-determined level of accuracy) to enable the query optimizer 50 to construct an efficient query plan given specific inputs (the query 48) from the computing platform 42. The optimizer 50 may sample data in various columns or tables of the database 44 and then use the compression module 52 to perform a wavelet transform and threshold operation to determine an efficient query plan.
In
The wavelet transform may use a wavelet prototype function, often called a mother wavelet, and denoted herein by Ψ. To complete the specification of a wavelet, another term, called a father wavelet, or scaling function, and denoted herein by φ, may be used. The mother wavelet may be obtained by scaling and translating the father wavelet. Temporal analysis may be performed with a contracted, high-frequency version of the mother wavelet, while frequency analysis may be performed using a dilated, low-frequency version of the same mother wavelet. Because the initial signal or function can be represented in terms of a wavelet expansion, using coefficients in a linear combination of the wavelet function, data operations can be performed using just the corresponding wavelet coefficients. Thus, a wavelet expansion of a function may include a set of basis elements that are obtained through a series of dilation and translation operations. The resulting vectors may be orthogonal to one another over a compact support.
Dilating and translating the mother wavelet may result in a sequence of nested vector subspaces leading to multi-resolution analysis (MRA). Multi-resolution analysis may involve decomposing data at various levels known as resolutions. The data at a certain resolution are composed by combining its “cousins” at a higher level of resolution. In other words, the wavelet coefficients may be obtained at various levels, approximating the data with greater and greater precision as resolution increases. Conversely as the resolution decreases, the result may be smoothed versions of the signal, or data.
Compression may begin after the selection of a suitable wavelet basis by a database architect in advance of processing any queries. The applicable wavelet bases may be selected from a list of available wavelet bases. The wavelet basis may be used to model the attribute distribution f(x), which can be described by the equation
where the set {α,βjk} represents the set of wavelet coefficients. Applying the wavelet transform to a dataset X=[x1, x2, x3, . . . , xN] of size N, the wavelet coefficients can be generically represented by the set {ci}i=1N, letting m=log2(N). These coefficients may measure the degree of association between the attribute values and the wavelet basis. In many applications, the majority of wavelet coefficients will be negligible in magnitude and thus do not contribute to describing the data subjected to the wavelet transform. Therefore, these non-value-added coefficients can be discarded. The resulting set {ci}i=1k, k<<N may represent the compressed set of wavelet coefficients. The process of discarding small magnitude coefficients to achieve the compression is known as thresholding.
Hard thresholding (HRT) may be applied to the dataset X=[x1, x2, x3, . . . , xN] of size N, and may select from the dataset X the “k” coefficients that are largest in magnitude. The choice of the k coefficients may be in relation to a threshold “λ” computed based on the standard deviation (σ) of the detail coefficients (the βs) at the highest resolution (j=1). The threshold λ may be an estimate of the magnitude of noise. An estimate of σ is given by
where med(ci) is the median of the coefficients and N1 denotes the number of coefficients at the finest level. The threshold may be given by λ={circumflex over (δ)}√{square root over (2 log(N)/N)}, as described by Donoho and Johnstone in Ideal Spatial Adaptation by Wavelet Shrinkage, Stanford University, 1992. The coefficients (ci) in absolute value that exceed λ may be retained and those that are less than λ may be set to zero. In other words, the hard thresholding may return coefficients ciχ, where χ is the indicator function for |ci|>λ.
Soft thresholding (SFT) may be based on the idea of “wavelet shrinkage.” Similar to HRT, SFT sets to zero all those coefficients whose absolute values are smaller than λ, and then shrinks the coefficients that exceed λ in absolute value towards zero. The surviving coefficients may be given by c*i=sgn(ci)(|ci|−λ), where sgn( ) is the standard signum function. The term ci* may be computed if (|ci|−λ) is positive; otherwise its value may be set to zero. When {circumflex over (δ)} is the standard deviation of the noise coefficients, (|ci|−λ) are the de-noised values of the remaining coefficients.
Both hard and soft thresholding may be conservative and therefore retain coefficients that do not contribute significantly to the energy in the data. This is less desirable for compression as more non-zero coefficients have to be stored for decompression later. Both hard and soft thresholding techniques may discard coefficients based on the threshold λ and produce near zero error in reconstruction. However, the threshold λ is a fixed value. Flexibility is desirable in some applications because space may be a more important factor than perfect reconstruction.
This energy-based thresholding (EBT) approach may use the cumulative energy (squares of coefficients) to capture information in the data. The graphing parameters may be given by the set
where “i” indexes the ordered wavelet coefficients c(1)≧ . . . ≧c(k), and
is the cumulative energy up to the coefficient c(k). A representative plot is given in
Turning to
The input module 80 also may receive a value of desired accuracy for the query 48, expressed as a percentage value ε. This percentage value ε, or accuracy, may be used to determine the degree of compression to apply to the dataset X, in a manner analogous to picking a point on the graph of
In one example, coefficient generator 82 may include a bootstrap sample generator 96 and a wavelet coefficient generator 98. In this example, let X=[χ1, χ2, . . . , χN] be the initial dataset or signal. Bootstrap sample generator 96 may obtain a series of B bootstrap samples of the signal, each of size n:
where n<N.
From the bootstrap samples M1 of the signal, wavelet coefficient generator 98 may apply the wavelet transform to produce a series of wavelet coefficients:
In another example, coefficient generator 82 may include a wavelet coefficient generator 100 and a bootstrap sample generator 102. Wavelet coefficient generator 100 may apply the wavelet transform to initial dataset or signal X to obtain the coefficients: C=[C1, C2, . . . CN]. From the set of coefficients set C, bootstrap sample generator 102 may obtain B bootstrap samples each of size n:
Following generation of the set of M2 or M2 by bootstrap coefficient generator 82, multiplier 84 may square each coefficient, cij2. Ranking module 86 may order them in ascending order of magnitude given by: c(i1)2, c(i2)2, . . . , c(in)2 for each row i. As explained above, these may represent the energies in the sampled signal.
For each row of coefficients {c(ij)2}j=1n, i=1, 2, . . . , B, cumulative distribution generator 88 may compute an empirical cumulative distribution given by:
where c generically denotes a wavelet coefficient.
From the cumulative distribution, quantile generator may determine a quartile for each row. For example, the selected accuracy c may correspond to a quantile value of 10%, the 10th quantile [(Q(0.10)i)] may be determined. The 10th quantile, as an example, may identify the coefficients that contribute less than or equal to 10% of the energy. The 10% is chosen as an illustration. The user can select the appropriate quantile based on prior knowledge or data behavior. Repeating the determination of quantile for each bootstrap sample, i=1, 2, . . . , B, a set of quantiles, Q(0.10)1, Q(0.10)2, . . . , Q(0.10)B may be obtained.
Coefficient selector 92 may then compute an average quantile:
The average quantile,
Finally, initial signal X may be reconstructed by decompression module 94 by applying an inverse transform to the coefficient subset c′ (Hard) or c* (Soft) to obtain an approximate vector X*.
A summary of the foregoing process is illustrated as a method 110 in
The coefficients in the B sets of wavelet coefficients may be squared in a step 122 and the B sets of squared coefficients may be ordered in a step 124. The cumulative distribution function for each set of squared coefficients may then be obtained in a step 126. The quantile of each set of squared coefficients may then be determined in a step 128, with the quantile corresponding to the received accuracy level. An average quantile may then be determined in a step 130. In a step 132, the average quantile may be used as a threshold to delete low-energy coefficients from a set of wavelet coefficients obtained from the initial dataset. Hard or Soft types of thresholding may be used. A representation of the initial dataset may be obtained in a step 134 by reconstructing the dataset using the compressed wavelet coefficients when queries are processed.
The quality of the compression procedure may be determined by computing a mean square error (MSE)/distortion metric denoted by D and given by:
where the symbol T denotes the transpose operation.
In summary, quantile-based wavelet compression provides a reliable framework for accurately characterizing, compressing, and reconstructing the initial data vector X.
Beside using the bootstrap compression process for optimizing queries on a database, the data processing systems described may be used for datasets or signals occurring in other applications. For example, it may be used in a digital camera for in-camera image processing and storage, and for subsequent image transmission, if desired. As applied to a digital camera, the data input module 62 may provide an input data file X, and may include an analog signal capture mechanism and an analog-to-digital converter. A preprocessing module 64 may convert a digitized RGB signal into another format, such as Y, Cb, Cr (luminance and color difference signals). Wavelet transformation, compression, and decompression module 66 may apply a wavelet transform basis function to the Y, Cb, Cr data to produce a set of transform coefficients, and discard certain of the coefficients, thereby achieving a measure of compression. The retained coefficients may then be used to reconstruct an output data file X* (e.g., a thumbnail image), as well as a data file Y which may be an interpolated version of the output data file X*, which may be used to produce a color image that is viewable by a human user. Control data module 68 may be used to select the degree of compression used by the module 66. The application 70 may then simply be an in-camera memory that stores compressed images (e.g., the data file X*). The data output module 72 may be used to display the thumbnail image (data file X*) on, for example, an in-camera LCD display. Data output module 74 may be used to transmit the data file Y, such as when downloading file Y to a computer.