The present invention relates to data analysis and, more particularly, to situations in which a “conservation law” exists between related quantities.
Given a pair of numerical data sequences—call them a and b—and an ordered attribute t such as time, the quantities obey a conservation law when that the sum of the values in a up to T equals the sum of the values in b up to T, for all t=T. Thus, current flowing into and out of a circuit node obeys a conservation law because the amount of current flowing into the node equals the amount of current flowing out.
Thus in accordance with the present invention, we have recognized that even when two numerical sequences do not strictly obey a conservation law, it can be useful to carry out a conservation-law-based analysis of such data and, more specifically, to generate data indicating the degree to which the conservation law is or is not obeyed or satisfied. Examples of such pairs of sequences are a) the number of inbound packets at an IP router versus the number of outbound packets; b) the number of persons entering a building versus the number of persons leaving; and c) the dollar value of the charges run up by a group of credit card holders versus the payments made by those credit card holders
We have thus recognized that a conservation law is often an abstract ideal against which real data may be calibrated. Short-term deviations naturally occur from delays and measurement inaccuracies (credit card holders carry a balance; two people who enter a building at the same time may leave at different times, packets entering an IP router may be buffered therein and may thus encounter delays between the router input and output ports, etc.). Major violations may be caused by unusual phenomena or data quality problems; e.g., the reported number of people entering a building may be consistently higher than the number of people leaving (via the front door) if there exists an unmonitored side exit.
We have thus further recognized that it would be useful to discover for which portions, or “subsets,” of the data a conservation law approximately holds (or fails) and to summarize them in a semantically meaningful way. Such a summarization allows one to, for example, quantify the extent of missing or delayed events that the data represents.
To this end, and in accordance with an aspect of the invention, we define what we call a conservation dependency. A conservation dependency is, in essence, an underlying conservation law coupled with a tableau. The tableau provides information about the degree which particular subsets of the data do or do not satisfy the underlying conservation law, the former being referred to as a “hold tableau” and the latter being referred to as a “fail tableau.” Specifically, a hold tableau identifies subsets of the data (e.g., ranges of time for a time-ordered sequence of data) that satisfy the underlying conservation law to at least a specified degree, or “confidence.” By contrast, a “fail” the tableau identifies subsets of the that do not satisfy the underlying conservation law to at least a specified degree. Which type of tableau is the more useful for understanding properties of the data will depend on the nature of the data.
Consider, for example, sequences of credit card charges and payments over time for a given bank. In a general sense, the aggregate amount of charges over time tends to be equal to the aggregate amount of payments because most people pay their bills. Thus the conservation law “charges=payments” is an appropriate model to consider when analyzing such data. However, the aggregate of charges up to any given point in time are going to exceed the aggregate of payments up to that same point in time because of payment grace periods and other factors such as spending habits. Thus in this setting the conservation law “charges=payments” holds only approximately.
That being said, seeing, through a tableau such as that shown
More rigorously, assume a data set comprising a pair of numerical sequences a={a1, a2, . . . ai, . . . an} and b={b1, b2, . . . bi . . . bn} with ai, bi≧0 and for which the ith pair of values, (ai, bi), is associated with the ith value, ti, of an ordered attribute t={t1, t2, . . . ti, . . . tn}.
Given such a data set, the invention provides a tableau which comprises one or more subsets of values of the ordered attribute t that meet at least a first specified criterion, that criterion being is that, for at least a specified fraction ŝ of the data set, a confidence measure for the pairs of values associated with each subset in the tableau is a) at least equal to a confidence value ĉ (when we want to provide a hold tableau and b) no more than a confidence value ĉ (when we want to provide a fail tableau). The confidence measure for the pairs of values associated with each subset in the tableau is a measure of the degree to which those pairs of values deviate from an exact conservation law for the data in question.
In illustrative embodiments of the invention, the confidence measure for the pairs of values {(ai, bi), (ai+1, bi+1), . . . , (aj, bj)} for any interval {i, i+1, . . . , j}, is a function of an area between two curves A={A1, A2, . . . Ai, . . . An} from a where A0=0 and Ai=Σj≦Iaj and B={B1, B2, . . . Bi, . . . Bn} from b where B0=0 and Bi=Σj≦Ibj. That area, in particular is the area between a segment of curve A between Ai and Aj and a segment of curve B between Bi and Bj.
a) shows a fail tableau, pursuant to the principles of the present invention, for credit card data in New Zealand;
b) shows credit card charges and payments for the Bank of in New Zealand for Decembers between 1981-2008;
c) show Bank of New Zealand credit card charges/payments data over a period of years depicted in shows credit card charges and payments for the Bank of New Zealand for Januaries between 1981-2008;
a) through 6(c) illustrate the computer running time of algorithms implementing the present invention; and
Many constraints and rules have been used to represent data semantics and analyze data quality: functional, inclusion and sequential dependencies [2, 6, 11], association rules [1], etc. We identify a broad class of tools that are not covered by existing constraints, where a “conservation law” exists between related quantities. Given a pair of numerical sequences, call them a and b, and an ordered attribute t such as time, an exact conservation law states that the sum of the values in a up to T equals the sum of the values in b up to T, for all t=T. We propose a new class of tools—conservation dependencies—to express and validate such laws.
Conservation laws are obeyed to varying degrees in practice, with strong agreement in physics (current flowing in=current flowing out of a circuit), less so in traffic monitoring (number of inbound packets at an IP router=number of outbound packets; number of persons entering a building=number of persons leaving), and perhaps even less so in finance (charges=payments). Indeed, conservation laws are often abstract ideals against which real data may be calibrated. Short-term deviations naturally occur from delays and measurement inaccuracies (credit card holders carry a balance; two people who enter a building at the same time may leave at different times, etc.). Major violations may be caused by unusual phenomena or data quality problems; e.g., the reported number of people entering a building is consistently higher than the number of people leaving (via the front door) if there exists an unmonitored side exit. When it is unknown for which portions of the data the law approximately holds (or fails), discovering these subsets and summarizing them in a semantically meaningful way is important.
Consider sequences of credit card charges and payments over time. Because of payment grace periods and other factors such as spending habits, we expect the “charges=payments” law to hold only approximately. The degree (confidence) to which this law is satisfied should depend on the duration and magnitude of any imbalance, and can help understand customer patterns. Taking the ratio of total payments and total charges in a given time interval is insufficient as it does not take delays into account: a customer who regularly pays her bills would have the same confidence as one who carried a balance throughout the time interval and then paid all her bills at the end. Furthermore, since we may only have access to aggregated data due to privacy issues (e.g., total monthly charges and payments for groups of customers), it may not be possible to correlate individual charges with payments and calculate the average payment delay. Instead, at any time T, we assert that the cumulative charges and payments up to T cancel each other out.
In
We refer to the above measure as the balance model since it penalizes for the outstanding balance existing just before a given interval. We may also ignore prior imbalance as a way of dealing with permanently missing or lost values. For example, we can ask if a customer would have a good credit rating if the outstanding debt at the beginning of the interval were either forgiven or paid down, while penalizing new debt in the usual way. In
Our second example is network traffic monitoring, where the traffic entering a node (e.g., Internet router, road intersection, building), aggregated over the incoming edges (links, roads, doorways) should equal the traffic exiting the node, aggregated over the outgoing edges [12].
We remark that our confidence measure is different from that used for time series similarity. To achieve high confidence with respect to a conservation dependency, the two cumulative curves must track each other closely, but need not be similar in terms of the pattern and shape-based properties commonly used for matching (e.g., translation and scale invariance, warping). Conversely, two time series considered similar could violate the conservation law.
We formulate conservation dependencies (CDs) and propose measures to quantify the degree to which a CD is satisfied. We assume that the quantities participating in the CD, such as charges and payments, are obvious from the application. However, it is typically not obvious where in a large data set the CD holds (which is important for data mining and understanding data semantics) or fails (which is important for data quality analysis).
Definition 1 A (ĉ, ŝ)-hold tableau is a collection of subsets, each of confidence at least ĉ, whose union has size at least a fraction ŝ of the size of the universe. (In the simplest case, the subsets are just intervals of time.) We say simply hold tableau if ĉ, ŝ are understood. Similarly, a (ĉ, ŝ)-fail tableau is a collection of subsets, each of confidence at most ĉ, whose union has size at least a fraction ŝ of the size of the universe; we say fail tableau if ĉ, ŝ are understood. A tableau is either a hold tableau or a fail tableau.
Definition 2 The tableau discovery problem for CDs is to find a smallest (or almost smallest) hold or fail tableau for the given dataset.
Typically, ŝ and ĉ are supplied by the user or a domain expert. The minimality condition is crucial to producing easy-to-read, concise tableaux that capture the most significant patterns in the data.
There are two technical issues in tableau generation for CDs: identifying intervals with confidence above (resp., below) ĉ and constructing minimal tableaux using some of them. The second issue corresponds to the partial set cover problem. Prior work shows how to use the greedy algorithm for partial set cover to choose a collection of intervals, each of confidence above (resp., below) ĉ, whose union has size at least ŝn [11]. (The greedy algorithm is guaranteed, when the input is a collection of intervals, to generate a set of size at most a small constant times the size of the optimal such collection, but in practice, the algorithm produces solutions of size close to optimal.) The first issue is constraint-specific and requires novel algorithms, which we propose in this paper.
Let n be the length of each sequence participating in the given CD. We define a candidate interval [i, j], where 1≦i≦j≦n, as one whose confidence is above (resp., below) ĉ. Since larger intervals are better when constructing minimal tableaux
(fewer are needed to cover the desired fraction of the data), we are only interested in left-maximal candidate intervals, i.e., for each left endpoint, the longest possible right endpoint satisfying confidence. There are at most n maximal candidates, but an exhaustive algorithm tests all Θ(n2) possible intervals. We propose approximation algorithms that run in O(1/εn log n) time (under a reasonable assumption), and return maximal intervals with confidence above a slightly lower threshold ĉ/1+ε for hold tableaux, or below a slightly higher threshold of (1+ε)ĉ for fail tableaux. The running time of our algorithms depends on the area under the cumulative curves, but we will show that they are over an order of magnitude faster on real data sets than an exhaustive solution, even for small ε.
We also consider the CD tableau discovery problem when many pairs of sequences are given (e.g., one per customer or one per network node), each labeled with attributes such as customer marital status and city of residence, or router name, location and model. Here, each tableau entry is an interval on the ordered attribute, plus a pattern on the label attributes that identifies a cluster of pairs of sequences. For example, Table 1 identifies two subsets of a sequence of inbound and outbound network traffic: Router100 (of type Edge) measurements taken between January 1 and 15, and measurements from all Backbone routers (the “*” denotes all values of a given attribute) between January 12 and 31.
Finally, we illustrate the utility of CDs and their discovered tableaux, and the efficiency of our algorithms, on a variety of real data sets. These include the credit card and building entrance/exit data shown in
The remainder of this paper is organized as follows. Section 2 discusses related work. In Section 3, we give formal problem statements. In Section 4, we present the tableau algorithms. Section 5 presents experimental results, and Section 6 concludes the paper.
Our work is most closely related to existing work on tableau discovery for functional [4, 7, 10] and sequential dependencies [11]. While we borrow the general idea of tableau discovery, we propose novel constraint and confidence metrics, and novel algorithms for efficiently identifying candidate intervals. In particular, even if the interval finding algorithm from [11] could be adapted to CDs, it would not be applicable because our confidence definitions violate one of its assumptions, namely that the confidence of two highly overlapping intervals of similar size be similar (to see this, construct an interval I′ by starting with I and adding a single arbitrarily large “charge” without a corresponding “payment”).
Concepts similar to conservation laws have been discussed in the context of network traffic analysis [9, 12], clustering data for deduplication [3], and consistent query answering under “aggregation constraints” [5]. This paper is the first to address confidence metrics and tableau discovery for CDs.
The data mining literature investigated finding intervals that satisfy various properties, e.g., mining support-optimized association rules [8], where the objective is to find intervals of an array having a mean value above a specified minimum. That solution is not compatible with our confidence definitions, but can express two related metrics. The first metric divides the total “payments” by the total “charges” in a given interval. As discussed in Section 1.1, this definition does not account for delays. The second definition divides the total cumulative payments by the total cumulative charges in a given interval, which amounts to piecewise computation of the areas under the cumulative curves, where the “baseline” is always at zero. However, intervals near the end of the sequence have relatively larger areas under the cumulative curves, thereby skewing the resulting confidence (in contrast, our measures use variable baselines). In Section 5, we experimentally show that these measures are not effective at detecting violations in the conservation law.
Let a=a1, a2, . . . , an and b=b1, b2, . . . , bn, with ai, b2≧0, be two sequences with respect to some ordered attribute such as time; we assume uniform spacing on this attribute between adjacent indices. When a and b are governed by a conservation law, a can be thought of as containing counts of responses to events whose counts are contained in b (e.g., a=payments and b=charges). Given an interval {i, i+1, . . . , j}, which we write as [i, j], let a[i, j] denote the subsequence of a that includes entries in the positions i through j inclusive. We also define the derived cumulative time series A=(A0, A1, . . . , An) from a where A0=0 and Ai=Σj≦iaj. B is defined analogously based on b. We assume B dominates A, that is, Bi≧Ai for all i. Even if this is not the case, there are various (domain-specific) ways of preprocessing to satisfy this assumption; e.g., in the credit card example, we can account for prepayment toward future charges by setting A′l:=min {Al, Bl} and B′l:=Bl for all 1≦l≦n.
We define the confidence of a given CD within the scope of interval I=[i, j] in terms of the area between the curves A and B in this interval, normalized by the area under B, down to a “baseline,” to be formally defined below. This gives the “divergence” of A with respect to B; confidence is then defined as the complement.
Definition 3 Given two cumulative sequences A and B, the confidence con f(I) of A with respect to B in interval I=[i, j](1≦i≦j≦n) is defined as
Note that 0≦con f(I)≦1. Here Ai−1, the cumulative amount from the response curve up to but not including i, is the baseline from which we measure area under B, so that the gap between A and B to just before i is taken into account. Clearly, intervals with different starting points may have different baselines. Alternatively, this formula can be written as the ratio of the area under A to the area under B, in the interval [i, j], using the same baseline Ai−1:
We call the above confidence measure the balance model (and will denote it as con fb when there is ambiguity as to which confidence function is used) because it penalizes for the balance Bi−1−Ai−1 existing just before i. We may wish to compensate for this balance, so that only events (and responses) that occur within [i, j] affect confidence. However, only disparities not due to delay should be compensated for since resolution of such delays may occur within [i, j]. Therefore, we use the smallest difference between A and B in the suffix from i to n as an estimate of loss, since this difference is guaranteed not to be due to delay. Thus, the total compensation Δi at i is defined as mini≦l≦n{Bl−Al}.
There are two natural ways to discount an outstanding balance at i. We can give a credit to A to inject responses to events, or we can debit B to cancel unmatched events. In the former (referred to as the credit model), A is credited by Δi at i. In the latter (referred to as the debit model), B is debited by Δi at i. Either way, note that our choice of Δ ensures that B still dominates A.
Choosing one model over another depends on the aspects that one wishes to highlight. We can use the credit model if we suspect missing data in A, or wish to calculate what the confidence would have been had the unmatched events seen responses at i. The debit model is more appropriate when events in B may have been spuriously reported, or when we wish to calculate the confidence had those events not occurred.
Definition 4 Given two cumulative sequences A and B, the confidence con fc(I) of A with respect to B in interval I=[i, j], with credit applied at i, is defined as
Definition 5 Given two cumulative sequences A and B, the confidence con fd(I) of A with respect to B in interval I=[i, j], with debit applied at i, is defined as
We state the following simple claim without proof. Given an interval I, con fb(I)≦con fd(I)≦con fc(I).
In practice, there may be natural delays between events and responses, e.g., credit card payments can be made up to a month after a bill arrives. When delays are constant and persistent, their impact on confidence can be removed by simple preprocessing: we set A′i:=Ai+s, for a time shift of s, and compute the confidence using curves A′ and B. Finding the right shift length is an important problem but is outside the scope of this paper; we assume that s, if any, is given.
We now define the interval candidate generation problem, which finds a set of left-maximal intervals (for each i, the largest j approximately satisfying confidence).
Definition 6 The CD Interval Discovery Problem is, given two derived cumulative sequences A and B with Ai≧Bi for all i, and confidence threshold ĉ, to find all maximal intervals (if any exist), all of whose confidence is at least ĉ (for hold tableau), resp., at most ĉ (for fail tableau), as measured by either confb, con fc, or con fd.
We then use the partial interval cover algorithm from [11] to construct a minimal tableau from the discovered maximal intervals (recall Definition 2).
So far, we assumed that the input consists of a single pair of sequences. Due to space constraints, we discuss interval and tableau generation for multiple pairs of sequences in Appendix B.
In this section, we present novel algorithms to (almost) generate, for each i, either a largest j≧i such that the confidence of [i, j] is at least ĉ, in the case of a hold tableau, or, in the case of a fail tableau, a largest j≧i such that the confidence of [i, j] is at most ĉ, under the balance, credit, and debit models. (The “almost” is explained below.) We begin with hold tableaux, exhibiting later the changes necessary to handle fail tableaux.
We are given two integral sequences, a=a1, a2, . . . , an and b=b1, b2, . . . , bn, as well as the confidence threshold parameter ĉ. Using linear time for preprocessing, the cumulative time series A and B can be obtained. A naive algorithm may consider all quadratically-many intervals and compute the confidence for each chosen interval individually to check if it satisfies the threshold ĉ. This leads to a Θ(n3)-time algorithm. However, again by simple linear-time preprocessing, the time complexity of this naive exhaustive search can be reduced to Θ(n2). For large datasets a quadratic-time algorithm is infeasible to run. Hence, our goal is to design scalable algorithms in the same spirit as in previous works on other kinds of dependencies and association rules [8, 10, 11].
The intervals [i, j] that we report have confidence at least ĉ/(1+ε), though maybe not at least the target ĉ. We make this approximation to guarantee fast running time. However, if the longest interval beginning at i and having confidence at least ĉ is [i, j*], the algorithm will report an interval [i, j′] with j′≧j* having confidence at least ĉ/(1+ε). We give one simple generic algorithm, that works for all three models, balance, credit, and debit, and much more.
We have 0=A0≦A1≦A2≦A3≦ . . . ≦An and 0=B0≦B1≦B2≦B3≦ . . . ≦Bn, satisfying Ai≦Bi for all i. From these two sequences, two more integral sequences, H1A, H2A, H3A, . . . , HnA and H1B, H2B, H3B, . . . , HnB, are defined, in a problem-dependent way. These sequences must satisfy Al−HiA≧0 and Bl−HiB≧0, for all l≧i.
The reader can verify that these HiA and HiB values correspond to the baseline values already discussed earlier, and (hence) that Al−HiA≧0 and Bl−HiB≧0 for all l≧i. In all three cases, all the n HiA and n HiB values can be computed in O(n) time.
Definition 7 Define areaA(i, j)=Σl=ij(Al−HiA) and areaB(i, j)=Σl=ij(Bl−HiB). (Note that the subscript on the H terms is i, not l.) Recall that the confidence con f(i, j)=areaA(i, j)/areaB(i, j), provided the denominator is positive.
The algorithm is very simple (see Appendix C for pseudocode):
Since the algorithm seeks the largest ril, an efficient heuristic is to try the largest possible ril first, then the second largest, and so on, stopping as soon as it finds an interval of confidence at least ĉ/(1+ε).
We prove the following theorem in Appendix D.
Assuming that the areaB(1, n) values are only polynomially large in n (or can be scaled downward along with the A's so as to become so), the lg areaB(1, n) factor in the running time is only logarithmic in n. Indeed, if areaB(1, n) were exponential in n, one would need n bits of storage just to represent one areaB value, and arithmetic on such numbers could not be done in constant time.
The next theorem, which we prove in Appendix E, proves that the values produced by the algorithm are accurate.
To explain why we want j′≧j*, suppose an optimal tableau uses, say, m intervals, each of the form [i, j*], each of confidence at least ĉ, whose union covers at least a specified fraction, ŝ of {1, 2, . . . , n}. We may assume that each j* is largest such that [i, j*] has confidence at least ĉ. By the “no-false-negatives” property, the algorithm will generate an interval [i, j′], j′≧j*, so that [i, j′]⊃[i, j*], and hence there will exist a tableau of at most m intervals [i, j′] produced by the algorithm (with intervals having confidence at least ĉ/(1+ε), not ĉ).
For fail tableaux, we want, ideally, to generate intervals [i, j′], where j′ is largest such that con f(i, j′)≦ĉ (as opposed to con f(i, j′)≧ĉ, in the case of hold tableau). We instead generate intervals with confidence at most ĉ(1+ε). It is important to note that while we now want confidence bounded above by ĉ(or ĉ(1+ε)), if the optimal interval is [i,j*], we still need j′≧j*, not j′≦j*. The reason is that, once again, we want to know that if the optimal tableau consists of m intervals, then there is a collection of m algorithm-generated intervals of equally high support.
For fail tableaux, instead of using the generic algorithm above, which involves areaB(i, j), the generic algorithm for fail tableaux uses areaA(i, j). We will need to treat the balance and debit models differently from the credit model. We start with the former two.
Here is the algorithm to choose intervals for fail tableaux in the balance or debit model (see Appendix F for pseudocode):
We prove an analogue of Theorem 2 in Appendix G.
For running time, we state the following analogue of Theorem 1, which follows from the monotonicity of HiA in the balance and debit models.
The algorithm for fail tableaux using the balance and debit models relied on monotonicity of HiA, which, in the credit model, equals Ai−1−mink≧i{Bk−Ak} and which is provably not monotonic. The solution is to use the breakpoints sil defined for the balance model! Let us define areaAb(i, j) and areaAc(i, j) to be areaA(i, j) in the balance and credit model, respectively. Specifically, areaAb(i, j)=Σl=ij[Al−Ai−1] and areaAc(i, j)=Σl=ij[Al−(Ai−1−mink≧i{Bk−Ak})]. Define areaBc(i, j) to be (as expected) the area for B between i and j in the credit model (specifically, areaBc(i, j)=Σl=ij[B1−Ai−1]). Confidence con fc(i, j) is then defined to be areaAc(i, j)/areaBc(i, j). The algorithm is (see Appendix F for pseudocode):
The next two results, proved in Appendix H and I, respectively, characterize the efficiency and correctness of the above algorithm.
We now show the effectiveness of conservation dependencies in capturing potential data quality issues and mining interesting patterns. We also investigate the trade-off between performance and tableau quality with respect to ε, and demonstrate the scalability of the proposed algorithm. Choosing an appropriate confidence threshold is domain-specific and outside the scope of this work; we experimented with different values of ĉ for the purpose of these experiments. Experiments were performed on a 2.2 GHz dual-core Pentium PC with 4 GB of RAM. All the algorithms were implemented in C. We used the following data sources.
NZ-Credit-Card has a confidence close to one, so the entire sequence is reported in the hold tableau (this is with the payments curve shifted ahead by 1 month to compensate for the standard credit card grace period).
The above result suggests that December charges were higher than December payments, but January payments were higher than January charges.
We also tested the interval-finding algorithm from [8]. Recall from Section 2 that we can use this algorithm on sequences of either instantaneous values or cumulative amounts. In both cases, the hold tableau contains the entire data set with ĉ close to 1. With ĉ=0.8 on instantaneous values, the fail tableau contains a single interval of length one (January 1981). Since the magnitudes of monthly charges and payments have increased over time, this result reflects that the difference between charges and payments was proportionately larger in 1981. With ĉ=0.9, we get only the three small intervals: January-March 1981, December 2003 and December 2008.If, instead, we use cumulative amounts, the fail tableau contains only January-February 1981 with ĉ=0.8, and January-May 1981 with ĉ=0.9. To explain this, recall from Section 2 that this measure uses a baseline of zero for each interval, meaning that intervals starting later in the sequence end up with artificially high confidences that are well above ĉ. and therefore are not selected for the fail tableau.
This data set exhibits a persistent violation of the conservation law (recall
Next, we examine the network monitoring data for data quality problems of the form illustrated in
We now zoom in on Router-7. It appears that the links which were not being monitored up to time 3610 started being monitored afterward. To confirm this, we single out the curves corresponding to this router and show two hold tableaux in Table 4. Interestingly, only three short intervals have confidence above 0.99, suggesting that even if all links are monitored correctly, small violations of the conservation law are normal. These could happen for many reasons: delays at the router, corrupted packets getting dropped at the router, etc. Using ĉ=0.9 yields a longer interval that only slightly
overlaps with the “bad” interval from the fail tableau.
Since our algorithms test a relaxed confidence threshold, clearly it is possible that left-maximal intervals returned by our algorithm may not exist in the exact set of left-maximal intervals. We now examine the impact of the relaxation factor ε on how these intervals differ in practice. Using the People-Count data set with the credit model, we generated hold intervals using a variety of values for ĉ greater than 0.99, and fail intervals using a variety of values for ĉ less than 0.8. We then measured how well these intervals overlapped with those from the exact set, with overlap computed using the Jaccard coefficient. That is, for each I generated by our algorithm, we found I* from the exact set maximizing
Table 5 summarizes the results for fail intervals using ĉ=0.8 as the average Jaccard coefficient value. We obtain coefficients close to one, indicating that each approximate interval highly overlaps with at least one exact interval. Similar results were obtained for hold intervals and with other choices of ĉ.
We proposed conservation dependencies that express conservation laws between pairs of related quantities. We presented several ways of quantifying the extent to which conservation laws hold using various confidence measures, and gave efficient approximation algorithms for the tableau discovery problem, i.e., finding subsets that satisfy (or fail) a supplied conservation dependency given some confidence threshold. Using real data sets, we demonstrated order-of-magnitude performance improvements, and the utility of tableau discovery for conservation dependencies in the context of data mining and data quality analysis. The reported tableaux are concise, easy to understand, and suggest interesting subsets of the data for further analysis.
This paper dealt with tableau discovery in the off-line model, where the data are available beforehand. An interesting direction for future work is to study on-line tableau discovery, where we incrementally maintain a compact tableau over a given conservation dependency as new data arrive.
We now give worked examples based on
Suppose that the interval in question is [3, 8] (technically, [3, 9)), as illustrated in
Now, con fb is the area between the cumulative payments and the baseline divided by the area between the cumulative bills and the baseline. This works out to
To compute con fc, we need to shift the cumulative payment curve up by seven, as illustrated on the left of
To compute con fd, we shift the cumulative bills curve down by seven, as shown on the right of
Now, we consider the two related confidence metrics that the interval-finding algorithm from [8] can compute (recall Section 2). The first metric simply adds up the individual bills and payments within the given interval. The total payments in the interval [3, 8] are 6+8+7+4+3+20=48 and the total bills are 11+13+6+6+5+9=50. The resulting confidence is
As already mentioned, this confidence metric does not account for delays. In our example, this gives a higher confidence than all of our three models because it does not capture the fact that a large payment of 20 was made at the end of the interval to cover several outstanding bills. The second possible metric divides the area under the cumulative payment curve by the area under the cumulative bills curve (without the notion of a baseline). This gives
which is higher than our con fb. As already mentioned, if we do not take baselines into account, we overestimate the confidence of intervals that do not start at 1; the later the starting point, the more severe the overestimate.
B Tableau Discovery with Multiple Pairs of Sequences
In Sections 3 and 4, we addressed the tableau discovery problem for conservation dependencies on a single pair of sequences. We now discuss the case in which many pairs of sequences are given in the input. In the credit card example, there may be millions of users for whom we have separate charge and payment time series; in the network traffic example, a different pair of incoming and outgoing traffic measurements may be given for each router. As before, the objective will be to generate a minimal tableau that covers some fraction of the data, using subsets that all exceed (in case of a hold tableau) or fall below (in case of a fail tableau) a confidence threshold.
With a single pair of sequences, the only subsets (patterns) that were allowed in the tableaux were intervals on the ordered attribute. We now extend the allowed pattern space so that we can represent intervals in semantically meaningful clusters of pairs of sequences. We assume that each pair of sequences in the input is associated with a set L of label attributes, e.g., age group, gender and marital status for credit card customers, or router name, location and type for network monitoring. With each pair of sequences, we associate a descriptor tuple P, with a schema consisting of the set of label attributes L. Let P[k] be the projection of P onto the attribute k.
Definition 8 A tableau pattern p is a tuple of size [L]+1 with a schema consisting of the label attributes and the ordered attribute t, such as time. For each label attribute k ∈ L, the value taken on by p, call it p[k], is a constant from the domain of k or a special symbol “*”. For the ordered attribute, p[t] may be an arbitrary interval.
Definition 9 A tableau pattern p matches a descriptor tuple P if for all k ∈ L such that p[k]≠*, p[k]=P[k].
Thus, a tableau pattern p identifies an interval within one or more pairs of sequences that match p's labeling attributes. For example, two patterns consisting of L={Router type, Router name} and t=time interval were shown in Table 1. Note that the “*” symbol acts as a wildcard and matches all values of the given labeling attribute. Also, note that patterns may overlap.
Having defined the space of possible patterns, we now show how to compute the confidence of any such pattern. Observe that a pattern selects an interval from a cluster of pairs of sequences. Intuitively, we calculate the confidence (with respect to a conservation dependency) of such an interval by adding up all the corresponding cumulative sequences in the cluster, and transforming them into one new pair of “joint” cumulative sequences. Formally, for each pair of cumulative sequences A(k) and B(k), whose descriptor tuples match the given tableau pattern, we derive a pair of superposed sequences A and B, where Al:=ΣkAl(k) and Bl:=ΣkBl(k). The confidence of the given pattern then corresponds to con fb, con fc and con fd computed on A and B within the interval specified in the pattern. In other words, the resulting confidence is the “average” confidence over all the pairs of sequences (in the given interval) that match the given tableau pattern.
We are now ready to state the minimal tableau discovery problem for conservation dependencies when multiple pairs of sequences are provided in the input:
Definition 10 Let m be the number of pairs of sequences, each of length n, believed to obey a conservation law. Let ŝ and ĉ be user-supplied support and confidence thresholds, respectively. The minimal tableau discovery problem is to find the fewest patterns of the form described above, whose union has size at least ŝmn, such that the confidence of each pattern is above (hold tableau) or below (fail tableau) ĉ, with confidence as defined above.
An exhaustive algorithm for computing a minimal tableau in this situation is to examine all Θ(n2) intervals for each possible pattern on the label set L (i.e., each possible combination of wildcards and constants), and then run a greedy partial set cover algorithm using all the candidate intervals as input. We can reduce the number of intervals to examine by re-using the algorithms proposed in Section 4. Furthermore, we can combine our algorithms with the on-demand tableau discovery algorithm for conditional functional dependencies (CFDs) that was proposed in [10]. The idea is to examine the most general patterns on the label attributes, starting with all-stars, and try more specific patterns (by replacing one “*” at a time with a constant) only when the general pattern does not satisfy the confidence requirement.
We now give the proof of Theorem 1, which states that the running time of our hold tableau interval selection algorithm, for the balance, credit and debit models, is O(n log1+ε areaB(1, n)), which is O(n lg areaB(1, n)) if ε≦1.
First, note that computing areaA(i, j), areaB(i, j), and con f(i, j) are constant-time operations, since we can precompute Sj=Σl=1jAl and Tj=Σl=1jBl. Then:
areaA(i, j)=(Sj−Si−1)−(j−i+1)HiA,
areaB(i, j)=(Tj−Ti−1)−(j−i+1)HiB, and
con f(i, j)=areaA(i, j)/areaB(i, j).
Now, the only remaining issue is how to compute the ril's quickly. For this, we need to make an assumption regarding the HiA's and HiB's.
Lemma 7 Suppose ri−1,l≦ril for all i, l. Then the total time to compute all ril's is O(n log1+ε areaB(1, n)).
Proof. A candidate integer x equals ril if and only if areaB(i, x)≦(1+ε)l and areaB(i, x+1)>(1+ε)l (or x=n).
We show how, given l≦[log1+ε areaB(1, n)], to compute all n values r1l, r2l, r3l, . . . , rnl in a total of O(n) time. (Hence the total running time, over all values of l, will be O(n log1+ε areaB(1, n).)
We start with x=1 and increment x until r1l=x is found. When it is, we try to find r2l, starting with x=r1l (which is safe because r1l≦r2l), in each step testing x and incrementing x, until r2l is found. Once it is, we try to find r3l, starting with x=r2l (which is safe because r2l≦r3l), in each step testing x and incrementing x, until r3l is found. We continue in this way, starting the search for ri+1,l at the value ril just found. We continue in this way until rnl is found.
The key point is that x is never decreased. The number of times x is increased cannot exceed n, and in iterations in which x is not increased, we change from seeking ril to seeking ri+1,l; this can happen at most n times, making for a total of at most 2n iterations. ▪
Lemma 8 If HiB≧Hi−1B for all i, then ri−1,l ≦ril for all i, l.
Proof. In general, ril is the maximum j such that Σl=ij(Bl−HiB)≦(1+ε)l and ri−1,l is the maximum j such that Σl=i−1j(B1−HPi−1B)≦(1+ε)l. Hence Σl=ir
Lemma 9 For the balance, credit, and debit models, HiB≧Hi−1B.
Proof. For the balance and credit models, since HiB=Ai−1, we need simply show that Ai−1≧Ai−2, which is obvious since A is nondecreasing. For the debit model, HiB=Ai−1+mink≧i{Bk−Ak}. From the fact that Ai−2≦Ai−1 and mink≧i−1{Bk−Ak}≦mink≧i{Bk−Ak}, we infer that Hi−1B≦HiB. ▪
Theorem 1 now follows immediately from Lemmas 7, 8, and 9.
Next, we prove Theorem 2, which guarantees that our hold tableaux interval selection algorithm (1) returns no false positives, and (2) returns no false negatives.
The first part is trivial, as it is obvious from the algorithm that if the output includes an interval [i, j], then con f(i, j)≧ĉ/(1+ε). Modulo the distinction between ĉ and ĉ/(1+ε), there are no “false positives.”
For the second part, define h such that (1+ε)h−1<areaB(i, j*)≦(1+ε)h. Since rih is the largest index j such that areaB(i, j)≦(1+ε)h, and areaB(i, j*)≦(1+ε)h, it follows that rih≧j*. The algorithm did compute the confidence of interval [i, rih]. We now prove that con f(i, rih)≧ĉ/(1+ε), and hence the algorithm will report interval [i, rih] (or a longer interval, also of confidence at least ĉ/(1+ε)).
Now areaB(i, rih)≦(1+ε)h and areaB(i, j*)>(1+ε)h−1. Therefore areaB(i, rih)/areaB(i, j*)<1+ε. It follows that
Here, we prove Theorem 3, which guarantees that our fail tableau interval selection algorithm for the balance and debit models (1) returns no false positives, and (2) returns no false negatives.
The first part is trivial, as it is obvious from the algorithm that if the output includes an interval [i, j], then con f(i, j)≦ĉ(1+ε).
For the second part, define h such that (1+ε)h−1<areaA(i, j*)≦(1+ε)h. Since sih is the largest index j such that areaA(i, j)≦(1+ε)h, and areaA(i, j*)≦(1+ε)h, it follows that sih≧j*. The algorithm did compute the confidence of interval [i, sih]. We now prove that con f(i, sih)≦ĉ(1+ε), and hence the algorithm will report interval [i, sih] (or a longer interval, also of confidence at most ĉ(1+ε)).
Now areaA(i, sih)≦(1+ε)h and areaA(i, j*)>(1+ε)h−1. Therefore areaA(i, sih)/areaA(i, j*)<1+ε. It follows that
Next, we prove Theorem 5, which applies to the fail tableau interval selection algorithm. First we prove that si−1,l≦sil for all i, l; then we prove that the overall running time is O(n log1+ε areaAc(1, n)).
Because sil uses the balance model in its definition, that si−1≦sil follows from the monotonicity of HiA in the balance model.
It is obvious that the running time is O(n log1+ε areaAb(1, n)); what is not obvious is that the running time is O(n log1+ε areaAc(1, n)). Yet
Finally, we prove Theorem 6, which guarantees that our fail tableau interval selection algorithm for the credit model (1) returns no false positives, and (2) returns no false negatives.
The first part is still trivial, as it is obvious from the algorithm that if the output includes an interval [i, j], then con fc(i, j)≦ĉ(1+ε).
For the second part, define h such that (1+68 )h−1<areaAb(i, j*)≦(1+ε)h. Since sih is the largest index j such that areaAb(i, j)≦(1+ε)h, and areaAb(i, j*)≦(1+ε)h, it follows that sih≧j*. The algorithm did compute the confidence of interval [i, sih]. We now prove that con fc(i, sih)≦ĉ(1+ε), and hence the algorithm will report interval [i, sih] (or a longer interval, also of con fc-confidence at most ĉ(1+ε)).
We now prove that areaAb(i, sih)≦(1+ε)areaAb(i, j*) and sih−i+1≦(1+ε)(j*−i+1), which, as we will see, will complete the proof.
First, areaAb(i, sih)≦(1+ε)h and areaAb(i, j*)>(1+ε)h−1. Therefore areaAb(i, sih)/areaAb(i, j*)<1+ε.
Second, because Al is nondecreasing in l, and j*≦sih, the average value of Al over the interval [i, j*] is at most the average value of Al over the interval [i, sih]. Therefore
It follows that sih−i+1<(1+ε)(j*−i+1).
The rest is smooth sailing. We have
The foregoing merely illustrates the principles of the invention. For example, the invention is illustrated in the context of two data sequences a={a1, a2, . . . ai, . . . an} and b={b1, b2, . . . bi . . . bn} for which the ith pair of values, (ai, bi), is associated with the ith value, ti, of an ordered attribute t={t1, t2, . . . ti, . . . tn]. However, the principles of the invention can be extended to the contexts in which there are more than two data sequences, so long as the data in the two sequences can be transformed into two sequences that are expected to obey a conservation law. Such an application might be, for example, where one desires to analyze the entry/exit data of a building into which people enter through a main door, exit through that main door, and also may exit through an exit-only side door.
It will thus be appreciated that those skilled in the art will be able to devise various alternative implementations which, even if not shown or described herein, embody the principles of the invention and thus are within their spirit and scope.