The present invention is related to the field of relational database technology. OLAP technology is commonly attributed with the ability to provide analysts with rapid access to summary, aggregated data views of a single large multi-dimensional database, and is recognized for its ability to provide knowledge representation and discovery in high-dimensional relational databases. OLAP tools can provide intuitive and graphical access to the massively complex set of possible summary views available in large relational structured data repositories. However, the ability to handle such data complexity also presents a wide-ranging, combinatorially vast space of options that can seem impossible to comprehend and/or analyze. Accordingly, there is a need for knowledge discovery techniques that guide users' knowledge discovery tasks and that assist in finding relevant patterns, trends, and anomalies.
Embodiments of the present invention address the challenge of navigating a combinatorially vast space of data views of a multi-dimensional database by casting the space of data views as a combinatorial object comprising all projections and subsets and by casting the discovery of analyst-significant data views as a search process over that object. Statistical information theoretical measures are provided with the object and are sufficient to support a combinatorial optimization process. Accordingly, users can be guided, or taken automatically, across a permutation of the dimensions by searching for successive data views having two or more dimensions.
As used herein, a multi-dimensional database comprises a plurality of records with dimensions and is stored on a memory device. An exemplary multi-dimensional database is an online analytical processing (OLAP) database. A data view can refer to a subset of dimensions and data records from a multi-dimensional database and can represent a portion of the database that is significant to an analyst. In some embodiments, the data view comprises at most two dimensions because analysts typically experience difficulty comprehending additional dimensions.
In a particular embodiment of the present invention, the method for discovering portions of a multi-dimensional database that are significant to an analyst is computer-implemented and includes specifying a data view having at least two dimensions and all records of the database. A plurality of operation iterations are then performed on the data view, wherein each iteration is a chain operation, a hop operation or an anti-hop operation. The operation iterations are ceased upon satisfaction of a termination criteria. Examples of the termination criteria can include, but are not limited to, a command from an analyst, a uniform distribution of all remaining records across all remaining dimensions, a lack of remaining dimensions, or a lack of remaining records. The resulting data view can then be presented to an analyst.
A chain operation can comprise calculating a chain statistical significance measure for each value of each of the dimensions in the data view, selecting one or more chain values for a dimension in the view, adding the chain values to a filter, and removing the dimension of the chain values from the view. Exemplary chain statistical significance measures can include, but are not limited to, Hellinger distance, Hellinger distance augmented by p-value significance, relative entropy, and generalized alpha divergence. In some embodiments, the selecting of one or more chain values occurs automatically based on the values having maximal chain statistical significance measures.
A hop operation can comprise calculating a hop statistical significance measure, relative to the dimensions in the view and constrained by the filter, for each of the dimensions that is neither in the data view nor in the filter. The hop operation can further comprise selecting a hop dimension from the dimensions that are not in the view or in the filter and adding the hop dimension to the data view. Exemplary hop statistical significance measures can include, but are not limited to, conditional entropy and model likelihood metric. In some embodiments, the selecting of a hop dimension occurs automatically based on the dimensions having minimal hop statistical significance measures.
An anti-hop operation can comprise calculating an anti-hop statistical significance measure, relative to other dimensions in the view and constrained by the filter, for each of the dimensions in the view. Exemplary anti-hop statistical significance measures can include, but are not limited to, relative entropy. The anti-hop operation can further comprise selecting an anti-hop dimension from the dimensions in the view and removing the anti-hop dimension from the view. In some embodiments, the selecting of an anti-hop dimension occurs automatically based on maximal relative entropy.
In a preferred embodiment, a hop operation and a chain operation are performed in alternating order.
Embodiments of the present invention can be utilized at various degrees of automation for the analyst user. For example, in some embodiments, the data view can be initially populated with dimensions arbitrarily rather than relying on an analyst to specify the initial dimensions. Similarly, prior to performing the plurality of operation iterations, an empty filter can be created and arbitrarily populated with values for a dimension. In another example, while the chain, hop, and anti-hop operations can proceed substantially automatically as describe above, the selection of one or more chain values, the selection of a hop dimension, or the selection of an anti-hop dimension can occur manually based on input from an analyst. When the selections are manual, the chain, hop, and/or anti-hop statistical significance measures can be considered by the analyst or they can be disregarded in favor of the analyst's knowledge or preference.
An analyst guided approach can involve the present invention presenting suggested options, which the analyst can accept or override with manual selections.
The purpose of the foregoing abstract is to enable the United States Patent and Trademark Office and the public generally, especially the scientists, engineers, and practitioners in the art who are not familiar with patent or legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of the technical disclosure of the application. The abstract is neither intended to define the invention of the application, which is measured by the claims, nor is it intended to be limiting as to the scope of the invention in any way.
Various advantages and novel features of the present invention are described herein and will become further readily apparent to those skilled in this art from the following detailed description. In the preceding and following descriptions, the various embodiments, including the preferred embodiments, have been shown and described. Included herein is a description of the best mode contemplated for carrying out the invention. As will be realized, the invention is capable of modification in various respects without departing from the invention. Accordingly, the drawings and description of the preferred embodiments set forth hereafter are to be regarded as illustrative in nature, and not as restrictive.
Embodiments of the invention are described below with reference to the following accompanying drawings.
a is a plot showing Hellinger distances of rows and columns against their marginals
b is a plot showing relative entropy of months against each other significant dimension, given the RPM role=ECCF.
The following description includes the preferred best mode of one embodiment of the present invention. It will be clear from this description of the invention that the invention is not limited to these illustrated embodiments but that the invention also includes a variety of modifications and embodiments thereto. Therefore the present description should be seen as illustrative and not limiting. While the invention is susceptible of various modifications and alternative constructions, it should be understood, that there is no intention to limit the invention to the specific form disclosed, but, on the contrary, the invention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention as defined in the claims.
The following description of the present invention uses a mathematical formalism that is similar to the mathematical tools required to analyze OLAP databases, but is different in a number of ways as well. For example, projections, I, on dimensions and restrictions, J, on records are combined into a lattice-theoretical object called a view, DI,J. Furthermore, OLAP concerns databases organized around collections of variables which can be distinguished as: dimensions, which have a hierarchical structure, and whose Cartesian product forms the data cube's schema; and measures, which can be numerically aggregated within different slices of that schema. The present description considers cubes with a single integral measure, which in some embodiments is the count of a number of records in the underlying database. However, any numerical measure could yield, through appropriate normalization, frequency distributions for use in the view discovery technique of the present invention.
The following examples and description are given in the context of a analyst and/or decision-maker responsible for analyzing a large relational database of records of events of personal vehicles, cargo vehicles, and others passing through radiation portal monitors (RPM) at US ports of entry. In OLAP database methodology, data cubes are multi-dimensional models of an underlying relational database. They are built by identifying a number of dimensions representing categories of interest from the database, each with a possibly hierarchical structure, and then forming their cross-product to represent all possible combinations of values of those dimensions, thus facilitating aggregation of critical quantities over multiple projections of interest. In this example database, the dimensions used included dimensions for multiple time representations, spatial hierarchies of collections of RPMs at different locations, and RPM attributes such as vendor. In this context, a vast collection of different views, focusing on different combinations of dimensions, and different subsets of records, are available to the user.
Operations that can be performed in the view lattice of data tensor cubes can be described according to the following. Let ={1, 2, . . . }, :={1, 2, . . . , N}. For some N∈, define a data cube as an N-dimensional tensor :=(X, , c where:
Let M:=Σx∈Xc(x) be the total number of records in the database. Then also has relative frequencies f on the cells, so that f:X→[0,1], where
and thus Σx∈Xf(x)=1. An example of a data tensor with simulated data for our RPM cube is shown in Table 1, for ={X1, X2, X3}={RPM Manufacturer, Location, Month}, with RPM Mfr={Ludlum, SAIC}, Location={New York, Seattle, Miami}, and Month={January, February, March, April}, so that N=3. The table shows the counts c(x), so that M=74, and the frequencies f(x).
At any time, it is possible to look at a projection of along a sub-cross-product involving only certain dimensions with indices I⊂. Call I a projector, and denote x↓I=xki∈I∈X↓I, where X↓I:=×i∈IXi, as a projected vector and data schema. One can write x↓i for x↓{i}, and for projectors I⊂I′ and vectors x,∈X, x↓I⊂↓I′ is used to mean ∀i∈I, x↓i=↓i.
Count and frequency functions convey to the projected count and frequency functions denoted c[I]: X↓I→ and f[I]:X↓I→[0,1], so that
c[I](x↓I)=c(x′) (1)
f[I](x↓I)=f(x′) (2)
and Σx↓I∈X↓I f[I](x↓I)=1. In other words, the counts (i.e., resp. frequencies) are added over all vectors in ∈X such that ↓I=x↓I. This is just the process of building the I-marginal over f, seen as a joint distribution over the Xi for i∈I.
Any set of record indices J⊂ is called a filter. Then the filtered count function can be considered cJ:X→{0, 1, . . . } and frequency function ƒJ:X→[0,1] whose values are reduced by the restriction in J⊂, now determining
M′:=Σ
x∈X
c
J(x)=|J|≦M. (3)
The frequencies fJ can be renormalized over the resulting M′ to derive
so that still Σx∈XfJ(x)=1. Finally, when both a selector I and filter J are available, then cJ[I]:X↓I→{0, 1, . . . }, fJ[I]:x↓I→[0,1] defined analogously, where now Σx↓∈X↓IfJ[I](x↓I)=1. Given a data cube , denote as a view of , restricting attention to just the J records projected onto just the I dimensions X↓I, and determining counts cJ[I] and frequencies fJ[I].
In a lattice theoretical context, each projector I⊂ can be cast as a point in the Boolean lattice BN of dimension N called a projector lattice. Similarly, each filter J⊂ is a point in a Boolean lattice BM called a filter lattice. Thus each view maps to a unique node in the view lattice :=×=2N×2M, the Cartesian product of the projector and filter lattices.
Operations on data views can then be defined as transitions from an initial view to another or , corresponding to a move in the view lattice B:
Projection: Removal of a dimension so that I′=I\{i} for some i∈I. This corresponds to moving a single step down in , and to marginalization in statistical analyses. This results in ∀x′↓I′∈X↓I′,
c
J
[I′](x′↓I′)=Σx↓I⊃x′↓I′cJ[I](x). (5)
This is also identified as an “anti-hop” operation.
Extension: Addition of a dimension so that I′=I∪{i} for some i∉I. This corresponds to moving a single step up in , which results in a desegregating or distributing of information about the I dimensions over the I′\I dimensions. Notationally, this is the converse of (5), so that ∀x↓I∈X↓I,
Σx′↓I′⊃x↓IcJ[I′](x′)=cJ[I](x↓I).
This is also identified as a “hop” operation.
Filtering: Removal of records by strengthening the filter, so that J′⊂J. This corresponds to moving potentially multiple steps down in .
Flushing: Addition of records by weakening (reversing, flushing) the filter, so that J′⊃J. This corresponds to moving potentially multiple steps up in .
Repeated view operations thus map to trajectories in B. Consider the example shown in
Regarding relational expressions and background filtering, typically M>>N, so that there are far more records than dimensions (in the present example, M=74 >3=N). In principle, filters J defining which records to include in a view can be specified arbitrarily, for example through any SQL or MDX where clause, or through OLAP operations like top n, including the n records with the highest value of some feature. In practice, filters are specified as relational expressions in terms of the dimensional values, as expressed in MDX where clauses. An example of a filter can include where RPM Mfr=“Ludlum” and (Month<=“February” and Month>=“January”), using chronological order on the Month variable to determine a filter J specifying just those 20 out of the total possible 74 records. For notational purposes, sometimes these relational expressions will be used to indicate the corresponding filters.
Note that each relational filter expression references a certain set of variables, in this case RPM Mfr and Month, denoted as R⊂. Compared to the projector I, R naturally divides into two groups of variables:
Foreground: Those variables in Rf:=R∩I which appear in both the filter expression and are included in the current projection.
Background: Those variables in Rb:=R\I which appear only in the filter expression, but are not part of the current projection.
The portions of filter expressions involving foreground variables restrict the rows and columns displayed in the OLAP tool. Filtering expressions can have many sources, such as Show Only or Hide. It is common in full (hierarchical) OLAP to select a collection of siblings within a particular sub-branch of a hierarchical dimension. For example for a spatial dimension, the user within an OLAP database software system, such as ProClarity, might select All→USA→California, or its children California→Cities, all siblings. But those portions of filter expressions involving background variables do not change which rows or columns are displayed, but only serve to reduce the values shown in cells. In ProClarity, these are shown in the Background pane.
Table 2 shows the results of four view operations from the example data in Table 1, including a projection I={1,2,3}I′={1,2}, a filter using relational expressions, and a filter using a non-relational expression. Table 2d shows a hybrid result of applying both the projector I′={1,2} and the relational filter expression where RPM Mfr=“Ludlum” and (Month<=“February” and Month>=“January”). Compare this to Table 2a, where there is only a quantitative restriction for the same dimensionality because of the use of a background filter. Here I={RPM Mfr, Location}, R={RPM Mfr, Month}, Rf={RPM Mfr}, Rb={Month}, M′=20.
In some instances, the filter J is fixed and the superscript on f is suppressed. The frequencies f:X→[0,1] represent joint probabilities f(x)=f(xk
can be defined. Individual vectors can be described as follows.
f[I1|I2](x) is the probability of the vector x↓I1∪I2 restricted to the I1∪I2 dimensions given that it is known that one can only choose vectors whose restriction to I2 is x↓I2. Note that f[I1|φ](x)=f[I1](x),f[φ|I2]≡1, and since f[I1|I2]=f[I1\I2|I2], in general assume that I1 and I2 are disjoint.
The concept of a view can then be extended to a conditional view as a view on , which is further equipped with the conditional frequency fJ[I1|I2]. Conditional views live in a different combinatorial structure than the view lattice . Describing I1|I2 and J in a conditional view requires three sets I1,I2∈ and J∈ with I1 and I2 disjoint. So define :=3[N]×2M where 3[N] is a graded poset with the following structure:
that is the i-element subsets of , into two parts where
An element in the poset 3[N] corresponds to an I1|I2 by letting I1 (resp. I2) be the elements to the left (resp. right) of the |. This poset is called 3[N] because it's size is 3N and it really corresponds to partitioning into three disjoint sets, the first being I1, the second being I2 and the third being \(I1∪I2). The structure 3[2] is shown in
For a view ∈B, which is identified with its frequency fJ[I], or a conditional view ∈A, which is identified with its conditional frequency fJ[I1|I2], the aim is measuring how “interesting” or “unusual” it is, as measured by departures from a null model. Such measures can be used for combinatorial search over the view structures B, A to identify noteworthy features in the data. The entropy of an unconditional view DI,J
H(fJ[I]):=−Σx∈X↓IfJ[I](x)log(fJ[I](x)).
is a well-established measure of the information content of that view. A view has maximal entropy when every slot has the same expected count. Given a conditional view , we define the conditional entropy, H(fJ[I1|I2]) to be the expected entropy of the conditional distribution fJ[I1|I2], which operationally is related to the unconditional entropy as
H(fJ[I1|I2]):=H(fJ[I1∪I2])−H(fJ[I2]).
Given two views of the same dimensionality I, but with different filters J and J′, the relative entropy (Kullback-Leibler divergence)
is a well-known measure of the similarity of fJ[I] to fJ′[I]. D is zero if and only if fJ[I]=fJ′[I], but it is not a metric because it is not symmetric, i.e., D(fJ[I]∥fJ′[I])≠D(fJ′[I]∥fJ[I]).
D is a special case of a larger class of a-divergence measures between distribution. Given two probability distributions P and Q, write the density with respect to the dominating measure μ=P=Q as p=dP/d(P+Q) and q=dQ/d(P+Q). For any a∈, the a-divergence is
a-divergence is convex with respect to both p and q, is non-negative, and is zero if and only p=q μ-almost everywhere. For a≠0,1, the a-divergence is bounded. The limit when a→1 returns the relative entropy between P and Q. There are other special cases that are of interest to us:
In particular the Hellinger metric √{square root over (D1/2)} is symmetric in both p and q, and satisfies the triangle inequality. We prefer the Hellinger distance over the relative entropy because it is a bonified metric and remains bounded. In our case and notation, we have the Hellinger distance as
Based on the data views, conditional views, and information measures described herein, a variety of user-guided, and/or automated, navigational tasks can be embodied by the present invention. For example, “drill-down paths” can be described as creating a series of views with projectors I1⊃I2⊃I3 of increasingly specified dimensional structure. In practice, many analysts are challenged by complex views of high dimensionality, while still needing to explore many possible data interactions. Accordingly, embodiments of the present invention can restrict analysts to two-dimensional views only, producing a sequence of projectors I1, I2, I3 where |Ik|=2 and |Ik∩Ik+1|=1, thus affecting a permutation of the variables Xi.
An arbitrary permutation of the i∈ can be assumed so that one can refer to the dimensions X1, X2, . . . , XN in order. The choice of the initial variables X1, X2 is a free parameter to the method, acting as a kind of “seed”.
One thing that is critical to note is the following. Consider a view which is then filtered to include only records for a particular member x0i
Notationally, it can be said that = Under the normal convention that 0·log(0)=0, information measures H and G above are insensitive to the addition of zeros in the distribution. This allows for a comparison of the view to any other view of dimensionality I\{i0}.
This is illustrated in Table 3 through the continuing example, now with the filter where Location=“Seattle”. Although formally still an RPM Mfr×Location×Month cube, in fact this view lives in the RPM Mfr×Month plane, and so can be compared to the RPM Mfr×Month marginal.
Finally, some caution is necessary when the relative entropy D(fJ[I]∥fJ′[I]) or Hellinger distance G(fJ[I],fJ′[I]) is calculated from data, as their magnitudes between empirical distributions is strongly influenced by small sample sizes. To counter spurious effects, in preferred embodiments, each calculated entropy can be supplemented with the probability that under the null distribution that the row has the same distribution as the marginal, of observing an empirical entropy larger or equal to actual value. When that probability is large, say greater than 5%, then its value can be considered spurious and be set to zero before proceeding with the algorithm.
In the instant example, a hop operation and a chain operation can be performed in alternating order (i.e., a hop-chain operation). One way of performing the hop-chain view discovery can be performed as described below.
1. Set the initial filter to J=. Set the initial projector I={1,2}, determining the initial view fJ[I] as just the initial X1×X2 grid.
2. For each row xk
G(fX
and retain the maximum row value G1:=maxx
G(fX
retaining the maximum value G2:=maxx
3. The user can be prompted to select either a row x01∈X1 or a column x02∈X2. Since G1 (resp. G2) represents the row (column) with the largest distance from its marginal, selecting the global maximum max(G1, G2) might be most appropriate; or this can be selected automatically. Letting x′0, be the selected value from the selected variable (row or column) i′∈I, then J′ is set to where Xi′=x′0, and this is placed in the background filter.
4. Let i″∈I be the variable not selected by the user, so that I={i′,i″}.
5. For each dimension i′″∈\I, that is, for each dimension which is neither in the background filter Rb={i′} nor retained in the view through the projector {i″}, calculate the conditional entropy of the retained view fJ′[{i″}] against that variable: H(fJ′[{i″}|{I′″}]).
6. The user is prompted to select a new variable i′″∈\I to add to the projector {i″}. Since
represents the variable with the most constraint against i″, that may be the most appropriate selection, or it can be selected automatically.
7. Let I′={i″,i′″}. Note that I′ is a sibling to I in , thus the name “hop-chaining”.
8. Let I′,J′ be the new I,J and go to step 2.
Keeping in mind the arbitrary permutation of the Xi, then the repeated result of applying this method is a sequence of hop-chaining steps in the view lattice, building up an increasing background filter:
I={1,2},J= 1
I′={2,3},J′=where X1=x01 2.
I″={3,4},J″=where X1=x01,X2=x02 3.
I′″={4,5},J′″=where X1=x01,X2=x02,X3=x03 4
In a particular example of the hop-chain operation, ProClarity® is used in conjunction with SQL Server Analysis Services (SSAS) 2005 and the R statistical platform v. 2.7 (see http://www.r-project.org). ProClarity® is a visual analytics tool that provides a flexible and friendly GUI environment with extensive API support which is used to gather current display contents and query context for row, column and background filter selections. R is currently used in either batch or interactive mode for statistical analysis and development. Microsoft Visual Studio .Net 2005® is used to develop plug-ins to ProClarity® to pass ProClarity® views to R for hop-chain calculations.
A first view of the data set used in the instant example is shown in
For purposes of this description, only a single step for the hop-chaining procedure against the alarm summary data cube is shown.
The distributions for roles seem to vary at most by overall magnitude, rather than shape, while the distributions for months appear almost identical. However,
a shows the Hellinger distances G(fx
The subsequent view for X2=Months×X3=Day of Month is then shown in
While a number of embodiments of the present invention have been shown and described, it will be apparent to those skilled in the art that many changes and modifications may be made without departing from the invention in its broader aspects.
This invention claims priority from U.S. Provisional Patent Application No. 61/262,403, entitled Methods for Discovering Significant Portions of a Multi-Dimensional Database, filed Nov. 18, 2009.
This invention was made with Government support under Contract DE-AC0576RL01830 awarded by the U.S. Department of Energy. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61262403 | Nov 2009 | US |