Conservation dependencies

Information

  • Patent Application
  • 20120130935
  • Publication Number
    20120130935
  • Date Filed
    November 23, 2010
    14 years ago
  • Date Published
    May 24, 2012
    12 years ago
Abstract
Given a set of data for which a conservation law is an appropriate characterization, “hold” and/or “fail” tableaux are provided for the underlying conservation law, thereby providing a conservation dependency whereby portions of the data for which the law approximately holds or fails can be discovered and summarized in a semantically meaningful way.
Description
BACKGROUND AND SUMMARY

The present invention relates to data analysis and, more particularly, to situations in which a “conservation law” exists between related quantities.


Given a pair of numerical data sequences—call them a and b—and an ordered attribute t such as time, the quantities obey a conservation law when that the sum of the values in a up to T equals the sum of the values in b up to T, for all t=T. Thus, current flowing into and out of a circuit node obeys a conservation law because the amount of current flowing into the node equals the amount of current flowing out.


Thus in accordance with the present invention, we have recognized that even when two numerical sequences do not strictly obey a conservation law, it can be useful to carry out a conservation-law-based analysis of such data and, more specifically, to generate data indicating the degree to which the conservation law is or is not obeyed or satisfied. Examples of such pairs of sequences are a) the number of inbound packets at an IP router versus the number of outbound packets; b) the number of persons entering a building versus the number of persons leaving; and c) the dollar value of the charges run up by a group of credit card holders versus the payments made by those credit card holders


We have thus recognized that a conservation law is often an abstract ideal against which real data may be calibrated. Short-term deviations naturally occur from delays and measurement inaccuracies (credit card holders carry a balance; two people who enter a building at the same time may leave at different times, packets entering an IP router may be buffered therein and may thus encounter delays between the router input and output ports, etc.). Major violations may be caused by unusual phenomena or data quality problems; e.g., the reported number of people entering a building may be consistently higher than the number of people leaving (via the front door) if there exists an unmonitored side exit.


We have thus further recognized that it would be useful to discover for which portions, or “subsets,” of the data a conservation law approximately holds (or fails) and to summarize them in a semantically meaningful way. Such a summarization allows one to, for example, quantify the extent of missing or delayed events that the data represents.


To this end, and in accordance with an aspect of the invention, we define what we call a conservation dependency. A conservation dependency is, in essence, an underlying conservation law coupled with a tableau. The tableau provides information about the degree which particular subsets of the data do or do not satisfy the underlying conservation law, the former being referred to as a “hold tableau” and the latter being referred to as a “fail tableau.” Specifically, a hold tableau identifies subsets of the data (e.g., ranges of time for a time-ordered sequence of data) that satisfy the underlying conservation law to at least a specified degree, or “confidence.” By contrast, a “fail” the tableau identifies subsets of the that do not satisfy the underlying conservation law to at least a specified degree. Which type of tableau is the more useful for understanding properties of the data will depend on the nature of the data.


Consider, for example, sequences of credit card charges and payments over time for a given bank. In a general sense, the aggregate amount of charges over time tends to be equal to the aggregate amount of payments because most people pay their bills. Thus the conservation law “charges=payments” is an appropriate model to consider when analyzing such data. However, the aggregate of charges up to any given point in time are going to exceed the aggregate of payments up to that same point in time because of payment grace periods and other factors such as spending habits. Thus in this setting the conservation law “charges=payments” holds only approximately.


That being said, seeing, through a tableau such as that shown FIG. 5, the degree (confidence) to which the conservation law “charges=payments” is or is not satisfied can help the bank understand customer patterns. Specifically, the appearance in a fail tableau of particular time periods (corresponding to subsets of the data) for which the data's confidence is below some level ĉ means that the conservation law is relatively unsatisfied during those time periods, meaning, in turn, that during those time periods the total outstanding balance owed to the bank is relatively high. Thus seeing from the fail tableau that the conservation law “charges=payments” is relatively unsatisfied during the holiday shopping season in November and December would be a confirmation that that is a time when people regularly fall behind on payments. Or seeing from a fail tableau that, for a given confidence level ĉ, there are more time periods in recent years versus less recent years where outstanding balances are high would suggest that people are having an increasing difficult time keeping up with their debts.


More rigorously, assume a data set comprising a pair of numerical sequences a={a1, a2, . . . ai, . . . an} and b={b1, b2, . . . bi . . . bn} with ai, bi≧0 and for which the ith pair of values, (ai, bi), is associated with the ith value, ti, of an ordered attribute t={t1, t2, . . . ti, . . . tn}.


Given such a data set, the invention provides a tableau which comprises one or more subsets of values of the ordered attribute t that meet at least a first specified criterion, that criterion being is that, for at least a specified fraction ŝ of the data set, a confidence measure for the pairs of values associated with each subset in the tableau is a) at least equal to a confidence value ĉ (when we want to provide a hold tableau and b) no more than a confidence value ĉ (when we want to provide a fail tableau). The confidence measure for the pairs of values associated with each subset in the tableau is a measure of the degree to which those pairs of values deviate from an exact conservation law for the data in question.


In illustrative embodiments of the invention, the confidence measure for the pairs of values {(ai, bi), (ai+1, bi+1), . . . , (aj, bj)} for any interval {i, i+1, . . . , j}, is a function of an area between two curves A={A1, A2, . . . Ai, . . . An} from a where A0=0 and Aij≦Iaj and B={B1, B2, . . . Bi, . . . Bn} from b where B0=0 and Bij≦Ibj. That area, in particular is the area between a segment of curve A between Ai and Aj and a segment of curve B between Bi and Bj.





DRAWINGS


FIG. 1 is a chart of illustrative data to which the present invention can be applied, the chart showing aggregated credit card charges and payments over a period of time in New Zealand, taken as an example;



FIG. 2 is a conceptual diagram showing other illustrative data to which the present invention can be applied, specifically counts of incoming and outgoing traffic of a node, such as a packet router;



FIG. 3 is a chart of illustrative data to which the present invention can be applied, the chart showing cumulative entrance and exit counts for a building;



FIG. 4 illustrates confidence measures used in the illustrative embodiments;



FIG. 5(
a) shows a fail tableau, pursuant to the principles of the present invention, for credit card data in New Zealand;



FIG. 5(
b) shows credit card charges and payments for the Bank of in New Zealand for Decembers between 1981-2008;



FIG. 5(
c) show Bank of New Zealand credit card charges/payments data over a period of years depicted in shows credit card charges and payments for the Bank of New Zealand for Januaries between 1981-2008;



FIGS. 6(
a) through 6(c) illustrate the computer running time of algorithms implementing the present invention; and



FIG. 7 shows a system in which the present invention is illustratively implemented.





DETAILED DESCRIPTION
1 INTRODUCTION

Many constraints and rules have been used to represent data semantics and analyze data quality: functional, inclusion and sequential dependencies [2, 6, 11], association rules [1], etc. We identify a broad class of tools that are not covered by existing constraints, where a “conservation law” exists between related quantities. Given a pair of numerical sequences, call them a and b, and an ordered attribute t such as time, an exact conservation law states that the sum of the values in a up to T equals the sum of the values in b up to T, for all t=T. We propose a new class of tools—conservation dependencies—to express and validate such laws.


Conservation laws are obeyed to varying degrees in practice, with strong agreement in physics (current flowing in=current flowing out of a circuit), less so in traffic monitoring (number of inbound packets at an IP router=number of outbound packets; number of persons entering a building=number of persons leaving), and perhaps even less so in finance (charges=payments). Indeed, conservation laws are often abstract ideals against which real data may be calibrated. Short-term deviations naturally occur from delays and measurement inaccuracies (credit card holders carry a balance; two people who enter a building at the same time may leave at different times, etc.). Major violations may be caused by unusual phenomena or data quality problems; e.g., the reported number of people entering a building is consistently higher than the number of people leaving (via the front door) if there exists an unmonitored side exit. When it is unknown for which portions of the data the law approximately holds (or fails), discovering these subsets and summarizing them in a semantically meaningful way is important.


1.1 MOTIVATING EXAMPLES

Consider sequences of credit card charges and payments over time. Because of payment grace periods and other factors such as spending habits, we expect the “charges=payments” law to hold only approximately. The degree (confidence) to which this law is satisfied should depend on the duration and magnitude of any imbalance, and can help understand customer patterns. Taking the ratio of total payments and total charges in a given time interval is insufficient as it does not take delays into account: a customer who regularly pays her bills would have the same confidence as one who carried a balance throughout the time interval and then paid all her bills at the end. Furthermore, since we may only have access to aggregated data due to privacy issues (e.g., total monthly charges and payments for groups of customers), it may not be possible to correlate individual charges with payments and calculate the average payment delay. Instead, at any time T, we assert that the cumulative charges and payments up to T cancel each other out.


In FIG. 1, we transform the (non-negative, but not necessarily non-decreasing) monthly charges and payments from the Reserve Bank of New Zealand, aggregated over all customers (www.rbnz.govt.nz/statistics/monfin), into (non-decreasing) cumulative totals. A 12-month excerpt is shown, during which there is a gap between the cumulative charges and payments at the end of every month; the initial gap corresponds to past unpaid charges. The duration and magnitude of violations in the illustrated interval (April-July) is related to the area between the charge and payment curves as a fraction of the area under the charge curve. However, the height of the curves at the beginning of an interval, and therefore the area underneath, depends on the total charges and payments from the past. In particular, the confidence of the illustrated interval would be above zero even if no payments were made (i.e., if the payment curve followed the horizontal line labeled “baseline,” whose height corresponds to the height of the payment curve at the beginning of the interval). Instead, we take the area between the curves divided by the area between the top curve and the baseline.


We refer to the above measure as the balance model since it penalizes for the outstanding balance existing just before a given interval. We may also ignore prior imbalance as a way of dealing with permanently missing or lost values. For example, we can ask if a customer would have a good credit rating if the outstanding debt at the beginning of the interval were either forgiven or paid down, while penalizing new debt in the usual way. In FIG. 1, this corresponds to either shifting the charge curve down to the payment curve or shifting the payment curve up to the charge curve, so that both curves start at the same point at the beginning of the interval. We call the former the debit model, since it removes a charge equal to the outstanding balance, and the latter the credit model, since it credits a payment to cover the outstanding balance. We will formalize the models and give examples in Section 3.


Our second example is network traffic monitoring, where the traffic entering a node (e.g., Internet router, road intersection, building), aggregated over the incoming edges (links, roads, doorways) should equal the traffic exiting the node, aggregated over the outgoing edges [12]. FIG. 2 illustrates a node with four bidirectional edges, whose weights correspond to inbound and outbound traffic aggregated over some time interval. Observe that the conservation law only holds at the node level, not for individual edges. That is, the law can only be observed by first (topologically) aggregating multiple measurements to form a and b. Prolonged violations may indicate problems with the measuring infrastructure: sensors may be malfunctioning, a new link may be attached to a router, but is not being monitored, etc. Temporary violations are also useful to report as they may identify interesting or suspicious events causing delays at the node. As in the previous example, there is usually no one-to-one mapping between individual inbound and outbound packets: sensors report aggregated counts, network traffic summaries are aggregated due to privacy and efficiency issues, etc.



FIG. 3 plots two curves over a 30-day interval (48 “ticks” per day): the cumulative number of persons entering and exiting the front door of a building at the University of California, Irvine (archive.ics.uci.edu/ml/datasets/CalIt2+Building+People+Counts), with counts reported by an optical sensor every half hour. As before, we have transformed the inputs into cumulative counts. There are five noticeable “steps” per week, corresponding to activity during weekdays and little traffic during weeknights and weekends. Interestingly, the cumulative curves continually diverge over time, indicating a persistent violation (perhaps there exists an unmonitored side exit). As before, the confidence of the illustrated interval is related to the area between the entrance and exit curves and the area between the entrance curve and the baseline. However, the credit and debit models are more appropriate here, so as to compensate for the gradually increasing divergence from the conservation law. In the balance model, intervals near the end of the sequences would have disproportionately low confidence because the accrued gap keeps growing over time.


We remark that our confidence measure is different from that used for time series similarity. To achieve high confidence with respect to a conservation dependency, the two cumulative curves must track each other closely, but need not be similar in terms of the pattern and shape-based properties commonly used for matching (e.g., translation and scale invariance, warping). Conversely, two time series considered similar could violate the conservation law.


1.2 CONTRIBUTIONS AND ORGANIZATION

We formulate conservation dependencies (CDs) and propose measures to quantify the degree to which a CD is satisfied. We assume that the quantities participating in the CD, such as charges and payments, are obvious from the application. However, it is typically not obvious where in a large data set the CD holds (which is important for data mining and understanding data semantics) or fails (which is important for data quality analysis).


Definition 1 A (ĉ, ŝ)-hold tableau is a collection of subsets, each of confidence at least ĉ, whose union has size at least a fraction ŝ of the size of the universe. (In the simplest case, the subsets are just intervals of time.) We say simply hold tableau if ĉ, ŝ are understood. Similarly, a (ĉ, ŝ)-fail tableau is a collection of subsets, each of confidence at most ĉ, whose union has size at least a fraction ŝ of the size of the universe; we say fail tableau if ĉ, ŝ are understood. A tableau is either a hold tableau or a fail tableau.


Definition 2 The tableau discovery problem for CDs is to find a smallest (or almost smallest) hold or fail tableau for the given dataset.


Typically, ŝ and ĉ are supplied by the user or a domain expert. The minimality condition is crucial to producing easy-to-read, concise tableaux that capture the most significant patterns in the data.



FIG. 5A in Section 5 shows a fail tableau for the credit card data from FIG. 1, with ĉ=0.8. It suggests that people regularly fall behind on payments during the holiday shopping season in November and December. In Table 2, we show a fail tableau for the building entrance data from FIG. 3 using the credit model with ĉ=0.9. This tableau identifies temporary violations, corresponding to events that occurred in the building, when a group of people entered the building and stayed until the event was over. Notably, these findings were not obvious from manual inspection of FIGS. 1 and 3, or from the raw counts—the instantaneous charges and payments or entrances and exits are almost never equal, and it is not clear if or when they cancel each other out.


There are two technical issues in tableau generation for CDs: identifying intervals with confidence above (resp., below) ĉ and constructing minimal tableaux using some of them. The second issue corresponds to the partial set cover problem. Prior work shows how to use the greedy algorithm for partial set cover to choose a collection of intervals, each of confidence above (resp., below) ĉ, whose union has size at least ŝn [11]. (The greedy algorithm is guaranteed, when the input is a collection of intervals, to generate a set of size at most a small constant times the size of the optimal such collection, but in practice, the algorithm produces solutions of size close to optimal.) The first issue is constraint-specific and requires novel algorithms, which we propose in this paper.


Let n be the length of each sequence participating in the given CD. We define a candidate interval [i, j], where 1≦i≦j≦n, as one whose confidence is above (resp., below) ĉ. Since larger intervals are better when constructing minimal tableaux









TABLE 1







Example tableau on network traffic data











Router type
Router name
Time interval







Edge
Router100
 Jan 1-Jan 15



Backbone
*
Jan 12-Jan 31











(fewer are needed to cover the desired fraction of the data), we are only interested in left-maximal candidate intervals, i.e., for each left endpoint, the longest possible right endpoint satisfying confidence. There are at most n maximal candidates, but an exhaustive algorithm tests all Θ(n2) possible intervals. We propose approximation algorithms that run in O(1/εn log n) time (under a reasonable assumption), and return maximal intervals with confidence above a slightly lower threshold ĉ/1+ε for hold tableaux, or below a slightly higher threshold of (1+ε)ĉ for fail tableaux. The running time of our algorithms depends on the area under the cumulative curves, but we will show that they are over an order of magnitude faster on real data sets than an exhaustive solution, even for small ε.


We also consider the CD tableau discovery problem when many pairs of sequences are given (e.g., one per customer or one per network node), each labeled with attributes such as customer marital status and city of residence, or router name, location and model. Here, each tableau entry is an interval on the ordered attribute, plus a pattern on the label attributes that identifies a cluster of pairs of sequences. For example, Table 1 identifies two subsets of a sequence of inbound and outbound network traffic: Router100 (of type Edge) measurements taken between January 1 and 15, and measurements from all Backbone routers (the “*” denotes all values of a given attribute) between January 12 and 31.


Finally, we illustrate the utility of CDs and their discovered tableaux, and the efficiency of our algorithms, on a variety of real data sets. These include the credit card and building entrance/exit data shown in FIGS. 1 and 3, as well as network traffic summaries and job submission logs on a computing cluster.


The remainder of this paper is organized as follows. Section 2 discusses related work. In Section 3, we give formal problem statements. In Section 4, we present the tableau algorithms. Section 5 presents experimental results, and Section 6 concludes the paper.


2 RELATED WORK

Our work is most closely related to existing work on tableau discovery for functional [4, 7, 10] and sequential dependencies [11]. While we borrow the general idea of tableau discovery, we propose novel constraint and confidence metrics, and novel algorithms for efficiently identifying candidate intervals. In particular, even if the interval finding algorithm from [11] could be adapted to CDs, it would not be applicable because our confidence definitions violate one of its assumptions, namely that the confidence of two highly overlapping intervals of similar size be similar (to see this, construct an interval I′ by starting with I and adding a single arbitrarily large “charge” without a corresponding “payment”).


Concepts similar to conservation laws have been discussed in the context of network traffic analysis [9, 12], clustering data for deduplication [3], and consistent query answering under “aggregation constraints” [5]. This paper is the first to address confidence metrics and tableau discovery for CDs.


The data mining literature investigated finding intervals that satisfy various properties, e.g., mining support-optimized association rules [8], where the objective is to find intervals of an array having a mean value above a specified minimum. That solution is not compatible with our confidence definitions, but can express two related metrics. The first metric divides the total “payments” by the total “charges” in a given interval. As discussed in Section 1.1, this definition does not account for delays. The second definition divides the total cumulative payments by the total cumulative charges in a given interval, which amounts to piecewise computation of the areas under the cumulative curves, where the “baseline” is always at zero. However, intervals near the end of the sequence have relatively larger areas under the cumulative curves, thereby skewing the resulting confidence (in contrast, our measures use variable baselines). In Section 5, we experimentally show that these measures are not effective at detecting violations in the conservation law.


3 PRELIMINARIES

Let a=custom-charactera1, a2, . . . , ancustom-character and b=custom-characterb1, b2, . . . , bncustom-character, with ai, b2≧0, be two sequences with respect to some ordered attribute such as time; we assume uniform spacing on this attribute between adjacent indices. When a and b are governed by a conservation law, a can be thought of as containing counts of responses to events whose counts are contained in b (e.g., a=payments and b=charges). Given an interval {i, i+1, . . . , j}, which we write as [i, j], let a[i, j] denote the subsequence of a that includes entries in the positions i through j inclusive. We also define the derived cumulative time series A=(A0, A1, . . . , An) from a where A0=0 and Aij≦iaj. B is defined analogously based on b. We assume B dominates A, that is, Bi≧Ai for all i. Even if this is not the case, there are various (domain-specific) ways of preprocessing to satisfy this assumption; e.g., in the credit card example, we can account for prepayment toward future charges by setting A′l:=min {Al, Bl} and B′l:=Bl for all 1≦l≦n.


3.1 CONFIDENCE MEASURES

We define the confidence of a given CD within the scope of interval I=[i, j] in terms of the area between the curves A and B in this interval, normalized by the area under B, down to a “baseline,” to be formally defined below. This gives the “divergence” of A with respect to B; confidence is then defined as the complement.


Definition 3 Given two cumulative sequences A and B, the confidence con f(I) of A with respect to B in interval I=[i, j](1≦i≦j≦n) is defined as






1
-






l
=
1

j



(


B
l

-

A
l


)






l
=
i

j



(


B
l

-

A

i
-
1



)



.





Note that 0≦con f(I)≦1. Here Ai−1, the cumulative amount from the response curve up to but not including i, is the baseline from which we measure area under B, so that the gap between A and B to just before i is taken into account. Clearly, intervals with different starting points may have different baselines. Alternatively, this formula can be written as the ratio of the area under A to the area under B, in the interval [i, j], using the same baseline Ai−1:











l
=
1

j



(


A
l

-

A

i
-
1



)






l
=
1

j



(


B
l

-

A

i
-
1



)



.




We call the above confidence measure the balance model (and will denote it as con fb when there is ambiguity as to which confidence function is used) because it penalizes for the balance Bi−1−Ai−1 existing just before i. We may wish to compensate for this balance, so that only events (and responses) that occur within [i, j] affect confidence. However, only disparities not due to delay should be compensated for since resolution of such delays may occur within [i, j]. Therefore, we use the smallest difference between A and B in the suffix from i to n as an estimate of loss, since this difference is guaranteed not to be due to delay. Thus, the total compensation Δi at i is defined as mini≦l≦n{Bl−Al}.


There are two natural ways to discount an outstanding balance at i. We can give a credit to A to inject responses to events, or we can debit B to cancel unmatched events. In the former (referred to as the credit model), A is credited by Δi at i. In the latter (referred to as the debit model), B is debited by Δi at i. Either way, note that our choice of Δ ensures that B still dominates A.


Choosing one model over another depends on the aspects that one wishes to highlight. We can use the credit model if we suspect missing data in A, or wish to calculate what the confidence would have been had the unmatched events seen responses at i. The debit model is more appropriate when events in B may have been spuriously reported, or when we wish to calculate the confidence had those events not occurred.


Definition 4 Given two cumulative sequences A and B, the confidence con fc(I) of A with respect to B in interval I=[i, j], with credit applied at i, is defined as











l
=
1

j



(


A
l

-

A

i
-
1


+

Δ
i


)






l
=
i

j



(


B
l

-

A

i
-
1



)



.




Definition 5 Given two cumulative sequences A and B, the confidence con fd(I) of A with respect to B in interval I=[i, j], with debit applied at i, is defined as











l
=
i

j



(


A
l

-

A

i
-
1



)






l
=
i

j



(


B
l

-

A

i
-
1


-

Δ
i


)



.




We state the following simple claim without proof. Given an interval I, con fb(I)≦con fd(I)≦con fc(I).



FIGS. 4A and 4B show sequences of charges and payments, zooming in on I=[3, 8] (technically, I=[3, 9) since we want to include the contribution of the divergence at time 8). On the left, we compute con fc(I) by shifting the cumulative payment curve up by seven, which is the smallest difference between the cumulative charges and payments in the suffix. On the right, con fd(I) moves the cumulative bill curve down by seven. In Appendix A, we work out that con fb(I)=0.64, con fd(I)=0.79 and confc(I)=0.83, and compare these numbers to the confidence according to the algorithm from [8] (recall Section 2).


In practice, there may be natural delays between events and responses, e.g., credit card payments can be made up to a month after a bill arrives. When delays are constant and persistent, their impact on confidence can be removed by simple preprocessing: we set A′i:=Ai+s, for a time shift of s, and compute the confidence using curves A′ and B. Finding the right shift length is an important problem but is outside the scope of this paper; we assume that s, if any, is given.


3.2 TABLEAU GENERATION

We now define the interval candidate generation problem, which finds a set of left-maximal intervals (for each i, the largest j approximately satisfying confidence).


Definition 6 The CD Interval Discovery Problem is, given two derived cumulative sequences A and B with Ai≧Bi for all i, and confidence threshold ĉ, to find all maximal intervals (if any exist), all of whose confidence is at least ĉ (for hold tableau), resp., at most ĉ (for fail tableau), as measured by either confb, con fc, or con fd.


We then use the partial interval cover algorithm from [11] to construct a minimal tableau from the discovered maximal intervals (recall Definition 2).


So far, we assumed that the input consists of a single pair of sequences. Due to space constraints, we discuss interval and tableau generation for multiple pairs of sequences in Appendix B.


4 ALGORITHMS

In this section, we present novel algorithms to (almost) generate, for each i, either a largest j≧i such that the confidence of [i, j] is at least ĉ, in the case of a hold tableau, or, in the case of a fail tableau, a largest j≧i such that the confidence of [i, j] is at most ĉ, under the balance, credit, and debit models. (The “almost” is explained below.) We begin with hold tableaux, exhibiting later the changes necessary to handle fail tableaux.


4.1 HOLD TABLEAU INTERVAL SELECTION

We are given two integral sequences, a=custom-charactera1, a2, . . . , ancustom-character and b=custom-characterb1, b2, . . . , bncustom-character, as well as the confidence threshold parameter ĉ. Using linear time for preprocessing, the cumulative time series A and B can be obtained. A naive algorithm may consider all quadratically-many intervals and compute the confidence for each chosen interval individually to check if it satisfies the threshold ĉ. This leads to a Θ(n3)-time algorithm. However, again by simple linear-time preprocessing, the time complexity of this naive exhaustive search can be reduced to Θ(n2). For large datasets a quadratic-time algorithm is infeasible to run. Hence, our goal is to design scalable algorithms in the same spirit as in previous works on other kinds of dependencies and association rules [8, 10, 11].


The intervals [i, j] that we report have confidence at least ĉ/(1+ε), though maybe not at least the target ĉ. We make this approximation to guarantee fast running time. However, if the longest interval beginning at i and having confidence at least ĉ is [i, j*], the algorithm will report an interval [i, j′] with j′≧j* having confidence at least ĉ/(1+ε). We give one simple generic algorithm, that works for all three models, balance, credit, and debit, and much more.


We have 0=A0≦A1≦A2≦A3≦ . . . ≦An and 0=B0≦B1≦B2≦B3≦ . . . ≦Bn, satisfying Ai≦Bi for all i. From these two sequences, two more integral sequences, custom-characterH1A, H2A, H3A, . . . , HnAcustom-character and custom-characterH1B, H2B, H3B, . . . , HnBcustom-character, are defined, in a problem-dependent way. These sequences must satisfy Al−HiA≧0 and Bl−HiB≧0, for all l≧i.

    • 1. In the balance model, HiA=HiB=Ai−1;
    • 2. in the credit model, HiA=Ai−1−mink≧i{Bk−Ak} and HiB=Ai−1; and
    • 3. in the debit model, HiA=Ai−1 and HiB=Ai−1+mink≧i{Bk−Ak}.


The reader can verify that these HiA and HiB values correspond to the baseline values already discussed earlier, and (hence) that Al−HiA≧0 and Bl−HiB≧0 for all l≧i. In all three cases, all the n HiA and n HiB values can be computed in O(n) time.


Definition 7 Define areaA(i, j)=Σl=ij(Al−HiA) and areaB(i, j)=Σl=ij(Bl−HiB). (Note that the subscript on the H terms is i, not l.) Recall that the confidence con f(i, j)=areaA(i, j)/areaB(i, j), provided the denominator is positive.


The algorithm is very simple (see Appendix C for pseudocode):

    • 1. For each i=1, 2, 3, . . . , n, for each l=0,1, 2, 3, . . . , [ log1+ε areaB(i, n)], do:
      • Calculate ril, the largest j such that areaB(i, j)≦(1+ε)l. (How to do so quickly is explained in Appendix D.)
    • 2. For each i=1, 2, 3, . . . , n, do:
      • Calculate con f(i, ril) for all l, and output interval [i, ril] for the largest ril such that con f(i, ril)≧ĉ/(1+ε), if such an ril exists. (How to do so quickly is explained in Appendix D.)


Since the algorithm seeks the largest ril, an efficient heuristic is to try the largest possible ril first, then the second largest, and so on, stopping as soon as it finds an interval of confidence at least ĉ/(1+ε).


We prove the following theorem in Appendix D.

  • Theorem 1 The total running time of the given algorithm for the balance, credit, or debit models is O(n logl+ε areaB(1,n)), which is O(1/εn lg areaB(1,n)) if ε≦1.


Assuming that the areaB(1, n) values are only polynomially large in n (or can be scaled downward along with the A's so as to become so), the lg areaB(1, n) factor in the running time is only logarithmic in n. Indeed, if areaB(1, n) were exponential in n, one would need n bits of storage just to represent one areaB value, and arithmetic on such numbers could not be done in constant time.


The next theorem, which we prove in Appendix E, proves that the values produced by the algorithm are accurate.

  • Theorem 2 1. (No false positives) If the algorithm outputs an interval [i, j], then con f(i, j)≧ĉ/(1+ε).
    • 2. (No false negatives) Given i, let j*≧i be largest such that con f(i, j*)≧ĉ(if such a j*≧i exists). Then the algorithm produces an interval [i, j′], j′≧j*, having confidence at least ĉ/(1+ε).


To explain why we want j′≧j*, suppose an optimal tableau uses, say, m intervals, each of the form [i, j*], each of confidence at least ĉ, whose union covers at least a specified fraction, ŝ of {1, 2, . . . , n}. We may assume that each j* is largest such that [i, j*] has confidence at least ĉ. By the “no-false-negatives” property, the algorithm will generate an interval [i, j′], j′≧j*, so that [i, j′][i, j*], and hence there will exist a tableau of at most m intervals [i, j′] produced by the algorithm (with intervals having confidence at least ĉ/(1+ε), not ĉ).


4.2 FAIL TABLEAU INTERVAL SELECTION

For fail tableaux, we want, ideally, to generate intervals [i, j′], where j′ is largest such that con f(i, j′)≦ĉ (as opposed to con f(i, j′)≧ĉ, in the case of hold tableau). We instead generate intervals with confidence at most ĉ(1+ε). It is important to note that while we now want confidence bounded above by ĉ(or ĉ(1+ε)), if the optimal interval is [i,j*], we still need j′≧j*, not j′≦j*. The reason is that, once again, we want to know that if the optimal tableau consists of m intervals, then there is a collection of m algorithm-generated intervals of equally high support.


For fail tableaux, instead of using the generic algorithm above, which involves areaB(i, j), the generic algorithm for fail tableaux uses areaA(i, j). We will need to treat the balance and debit models differently from the credit model. We start with the former two.


4.2.1 FAIL TABLEAUX IN THE BALANCE AND DEBIT MODELS

Here is the algorithm to choose intervals for fail tableaux in the balance or debit model (see Appendix F for pseudocode):

    • 1. For each i=1, 2, 3, . . . , n, for each l=0, 1, 2, 3, . . . , [ log1+ε areaA(i, n)], do:
      • Calculate sil, the largest j such that areaA(i, j)≦(1+ε)l.
    • 2. For each i=1, 2, 3, . . . , n, do:
      • Calculate con f(i, sil) for all l, and output interval [i, sil] for the largest sil such that con f(i, sil)≦ĉ(1+ε), if such an sil exists.


We prove an analogue of Theorem 2 in Appendix G.

  • Theorem 3 1. (No false positives) If the algorithm outputs an interval [i, j], then con f(i, j)≦ĉ(1+ε).
    • 2. (No false negatives) Given i, let j*≧i be largest such that con f(i, j*)≦ĉ (if such a j*≧i exists). Then the algorithm produces an interval [i, j′], j′≧j*, having confidence at most ĉ(1+ε).


For running time, we state the following analogue of Theorem 1, which follows from the monotonicity of HiA in the balance and debit models.

  • Theorem 4 The total running time of the algorithm to select intervals for fail tableaux in the balance and debit models is O(n log1+ε areaA(1, n)).


4.2.2 FAIL TABLEAUX IN THE CREDIT MODEL

The algorithm for fail tableaux using the balance and debit models relied on monotonicity of HiA, which, in the credit model, equals Ai−1−mink≧i{Bk−Ak} and which is provably not monotonic. The solution is to use the breakpoints sil defined for the balance model! Let us define areaAb(i, j) and areaAc(i, j) to be areaA(i, j) in the balance and credit model, respectively. Specifically, areaAb(i, j)=Σl=ij[Al−Ai−1] and areaAc(i, j)=Σl=ij[Al−(Ai−1−mink≧i{Bk−Ak})]. Define areaBc(i, j) to be (as expected) the area for B between i and j in the credit model (specifically, areaBc(i, j)=Σl=ij[B1−Ai−1]). Confidence con fc(i, j) is then defined to be areaAc(i, j)/areaBc(i, j). The algorithm is (see Appendix F for pseudocode):

    • 1. For each i=1, 2, 3, . . . , n, for each l=0, 1, 2, 3, . . . , [log1+ε areaAb(i, n)], do:
      • Calculate sil as in the algorithm above for fail tableaux in the balance model, specifically, sil is the largest j such that areaAb(i, j)≦(1+ε)l.
    • 2. For each i=1, 2, 3, . . . , n, do:
      • Calculate con fc(i, sil) for all l, and output interval [i, sil] for the largest sil such that con fc(i, sil)≦ĉ(1+ε), if such an sil exists.


The next two results, proved in Appendix H and I, respectively, characterize the efficiency and correctness of the above algorithm.

  • Theorem 5 The overall running time is O(n log1+ε areaAb(1,n)), which is O(n log1+ε areaAc(1, n)).
  • Theorem 6 1. (No false positives) If the algorithm outputs an interval [i, j], then con fc(i, j)≦ĉ(1+ε).
    • 2. (No false negatives) Given i, let j*≧i be largest such that con fc(i, j*)≦ĉ (if such a j*≧i exists). Then the algorithm produces an interval [i, j′], j′≧j*, having confidence con fc(i, j′)≦ĉ(1+ε).


5 EXPERIMENTS

We now show the effectiveness of conservation dependencies in capturing potential data quality issues and mining interesting patterns. We also investigate the trade-off between performance and tableau quality with respect to ε, and demonstrate the scalability of the proposed algorithm. Choosing an appropriate confidence threshold is domain-specific and outside the scope of this work; we experimented with different values of ĉ for the purpose of these experiments. Experiments were performed on a 2.2 GHz dual-core Pentium PC with 4 GB of RAM. All the algorithms were implemented in C. We used the following data sources.

    • People-Count (archive.ics.uci.edu/ml/machine-learning-databases/event-detection/CalIt2.data): counts of persons entering and exiting the front door of a building at the University of California, Irvine, as measured by an optical sensor. The entrance and exit sequences have 5040 data points each (one measurement every half an hour for 15 weeks). We also use a list of events scheduled in this building during this time period (archive.ics.uci.edu/ml/machine-learning-databases/event-detection/CalIt2.events).
    • NZ-Credit-Card (www.rbnz.govt.nz/statistics/monfin/): monthly aggregated credit card charges and payments reported by the Reserve Bank of New Zealand from January 1981 to August 2009.
    • TCP packet traces (ita.ee.lbl.gov/html/traces.html): we use the DEC-PKT-3 trace; n=177802. Here, there is a conservation law between the number of SYN packets (requests to start a TCP connection) and FIN plus RST packets (connection termination).
    • Network Monitoring: counts of incoming and outgoing traffic per router for several hundred routers, collected from a large network every five minutes for roughly 2 weeks (3800 measurements per router).
    • Job Log (gwa.ewi.tudelft.nl/pmwiki/pmwiki.php?n=Workloads.Gwa-t-1): trace of 1124772 jobs submitted to a grid computing cluster, coarsened to a time sequence at regular intervals. Here the conservation law is between the number of submitted jobs and the number of completed jobs.


5.1 UTILITY OF CONSERVATION DEPENDENCIES
Balance Model, New Zealand NZ-Credit-Card Data

NZ-Credit-Card has a confidence close to one, so the entire sequence is reported in the hold tableau (this is with the payments curve shifted ahead by 1 month to compensate for the standard credit card grace period). FIG. 5A shows the fail tableau with ĉ=0.8, using the balance model to find time periods with high outstanding debt. We see more intervals from the recent years, suggesting that the total outstanding balance has risen. Also, the consistent appearance of the interval November-December suggests that people charge more than they pay during the holiday shopping season in November and December, at least between 2004 and 2007. Perhaps not surprisingly, November-December of 2008 was not reported, when credit card debt was low due to dampened consumption during an economic recession [13]. There are no tableau intervals ending in January, indicating that unpaid charges from November and December were paid by January (i.e., adding January to a low-confidence November-December interval boosts confidence above 0.8, else intervals of the form November-January would have been included in the fail tableau instead).


The above result suggests that December charges were higher than December payments, but January payments were higher than January charges. FIGS. 5B and 5C confirms this, plotting charges and payments made only in Decembers (resp., Januaries) of each year. In the December plot, charges dominate payments, especially in the last five years. In the January plot, payments dominate charges.


We also tested the interval-finding algorithm from [8]. Recall from Section 2 that we can use this algorithm on sequences of either instantaneous values or cumulative amounts. In both cases, the hold tableau contains the entire data set with ĉ close to 1. With ĉ=0.8 on instantaneous values, the fail tableau contains a single interval of length one (January 1981). Since the magnitudes of monthly charges and payments have increased over time, this result reflects that the difference between charges and payments was proportionately larger in 1981. With ĉ=0.9, we get only the three small intervals: January-March 1981, December 2003 and December 2008.If, instead, we use cumulative amounts, the fail tableau contains only January-February 1981 with ĉ=0.8, and January-May 1981 with ĉ=0.9. To explain this, recall from Section 2 that this measure uses a baseline of zero for each interval, meaning that intervals starting later in the sequence end up with artificially high confidences that are well above ĉ. and therefore are not selected for the fail tableau.


Credit Model, People-Count Data

This data set exhibits a persistent violation of the conservation law (recall FIG. 3). As a result, only short intervals from the beginning of the sequence appear in hold tableaux. For fail tableaux in the balance model, all later intervals have low confidence from accrued imbalances (likely due to people occasionally exiting through a side door and therefore not being recorded properly). Instead, we use the credit model to ignore past imbalance (i.e., assume that people currently in the building have left). During July and August, sixteen events were scheduled in this building. In the time intervals surrounding these events, we expect a temporary discrepancy between the number of people entering the building and the number of people leaving. We generated a fail tableau with ĉ=0.6 and found that they correspond to these known events. Note that since we are using the credit model, any reported intervals have low confidence due to divergence within them and not due to imbalances carried over from the past. In Table 2, we report the maximal intervals corresponding to five days in August during which at least one known event took place. Our intervals closely match the known events, whose durations are shown on the left. On August 4th, in addition to the two intervals spanning the event, we also report the interval 12:30-14:30. This is likely because of the imbalance of traffic occurring during lunch time. To validate this, we generated maximal intervals for other days in this data set, when no event was scheduled, and found that either no intervals were returned, or some intervals in between 11:30 and 15:00 were obtained.









TABLE 2







Selected events (left) and corresponding fail tableau intervals from the


same day (right) in People-Count data, using the credit model with ĉ = 0.6










Event date and time
Tableau interval(s) from the same day







August 2, 15:30-16:30
15:30-16:00



August 4, 16:30-17:30
12:30-14:30, 16:00-17:00, 16:30-17:30



August 9, 08:00-16:00
06:30-17:00



August 9, 11:00-14:00



August 12, 08:00-11:00
06:30-12:00



August 18, 08:00-17:00
06:00-13:30, 12:00-16:30, 13:30-17:30



August 18, 18:00-20:30
17:00-19:00, 17:30-20:30










Debit Model, Network Monitoring Data

Next, we examine the network monitoring data for data quality problems of the form illustrated in FIG. 2, where some links are not monitored, causing an imbalance between the incoming and outgoing traffic passing through a router. The entire data set has a confidence of approximately 0.9, and contains several hundred pairs of sequences, one for each router, labeled with the router name and type (see Appendix B for more details on tableau discovery with multiple pairs of time series). We use the debit model for fail tableaux, which is appropriate here since it subtracts incoming packets whose outgoing counterparts have not been measured. Table 3 shows a fail tableau with ĉ=0.5, on router type, name, and time interval (with time represented as integers from one to 3800). Note that all but one violating router report an imbalance throughout all 3800 time coordinates.


We now zoom in on Router-7. It appears that the links which were not being monitored up to time 3610 started being monitored afterward. To confirm this, we single out the curves corresponding to this router and show two hold tableaux in Table 4. Interestingly, only three short intervals have confidence above 0.99, suggesting that even if all links are monitored correctly, small violations of the conservation law are normal. These could happen for many reasons: delays at the router, corrupted packets getting dropped at the router, etc. Using ĉ=0.9 yields a longer interval that only slightly









TABLE 3







Fail tableau for the network monitoring data set,


using the debit model with ĉ = 0.5











Type
Name
Interval







Core
Router-1
1-3800



Core
Router-10
1-3800



Core
Router-12
1-3800



Edge
Router-6
1-3800



Hub
Router-25
1-3800



Edge
Router-7
1-3610

















TABLE 4







Hold tableaux for Router-7 using the debit model










confidence above 0.99
confidence above 0.9







3650-3660
3530-3800



3660-3680



3790-3800











overlaps with the “bad” interval from the fail tableau.


5.2 QUALITY OF APPROXIMATION

Since our algorithms test a relaxed confidence threshold, clearly it is possible that left-maximal intervals returned by our algorithm may not exist in the exact set of left-maximal intervals. We now examine the impact of the relaxation factor ε on how these intervals differ in practice. Using the People-Count data set with the credit model, we generated hold intervals using a variety of values for ĉ greater than 0.99, and fail intervals using a variety of values for ĉ less than 0.8. We then measured how well these intervals overlapped with those from the exact set, with overlap computed using the Jaccard coefficient. That is, for each I generated by our algorithm, we found I* from the exact set maximizing










I


I
*







I


I
*





.




Table 5 summarizes the results for fail intervals using ĉ=0.8 as the average Jaccard coefficient value. We obtain coefficients close to one, indicating that each approximate interval highly overlaps with at least one exact interval. Similar results were obtained for hold intervals and with other choices of ĉ.









TABLE 5







Interval overlap using People-Count data










ε













0.001
0.002
0.005
0.01

















Jaccard
0.9986
0.9987
0.979
0.9519










5.3 PERFORMANCE AND SCALABILITY


FIGS. 6A through 6C compares the running time of our algorithms against an exhaustive algorithm with quadratic running time. We only report the wallclock time for generating candidate maximal intervals for a hold tableau, and exclude the preprocessing and tableau construction times, which are both linear. Time is measured in seconds using the Unix clock command. On the left, we plot the performance for several values of ε on various prefixes of Job-Log data, using the balance model and a ĉ slightly higher than the confidence value of the entire data. Our algorithm scales gracefully, and, even for small values of ε such as 0.001, is an order of magnitude faster. The middle figure plots the running times to generate hold tableau intervals using DEC-PKT-3 data for various confidence models and values of ε; the figure on the right repeats this for fail tableaux. Again, we use a confidence threshold that is slightly higher than the confidence of the overall data set. As before, we observe an order-of-magnitude performance improvement, even for small ε. In all three plots, note the linear dependence of running time on 1/ε over positive epsilon.


6 CONCLUSIONS

We proposed conservation dependencies that express conservation laws between pairs of related quantities. We presented several ways of quantifying the extent to which conservation laws hold using various confidence measures, and gave efficient approximation algorithms for the tableau discovery problem, i.e., finding subsets that satisfy (or fail) a supplied conservation dependency given some confidence threshold. Using real data sets, we demonstrated order-of-magnitude performance improvements, and the utility of tableau discovery for conservation dependencies in the context of data mining and data quality analysis. The reported tableaux are concise, easy to understand, and suggest interesting subsets of the data for further analysis.


This paper dealt with tableau discovery in the off-line model, where the data are available beforehand. An interesting direction for future work is to study on-line tableau discovery, where we incrementally maintain a compact tableau over a given conservation dependency as new data arrive.


APPENDIX
A Worked Examples of Calculating the Confidence of Conservation Dependencies

We now give worked examples based on FIG. 4, of calculating the confidence in different models, including those supported by the interval finding algorithm from [8].


Suppose that the interval in question is [3, 8] (technically, [3, 9)), as illustrated in FIG. 4. The area under the cumulative bills curve within this interval is 29+42+48+54+59+68=300. The area under the cumulative payments curve in this interval is 19+27+34+38+41+61=220. Note that the baseline, i.e., the cumulative payments just before the start of the interval, is 13. Thus, the area under the baseline in this interval is 6*13=78.


Now, con fb is the area between the cumulative payments and the baseline divided by the area between the cumulative bills and the baseline. This works out to








220
-
78


300
-
78


=

0.64
.





To compute con fc, we need to shift the cumulative payment curve up by seven, as illustrated on the left of FIG. 4. This increases the area between cumulative payments and the baseline by 7*6=42. Thus, we need to add 42 to the numerator that we used for con fb, while keeping the same denominator. This gives







conf
c

=



220
-
78
+
42


300
-
78


=


184
222

=

0.83
.







To compute con fd, we shift the cumulative bills curve down by seven, as shown on the right of FIG. 4. This decreases the area between cumulative bills and the baseline by 7*6=42, so we subtract 42 from the denominator that we used for con fb, while keeping the same numerator. We get







conf
d

=



220
-
78


300
-
78
-
42


=


142
180

=

0.79
.







Now, we consider the two related confidence metrics that the interval-finding algorithm from [8] can compute (recall Section 2). The first metric simply adds up the individual bills and payments within the given interval. The total payments in the interval [3, 8] are 6+8+7+4+3+20=48 and the total bills are 11+13+6+6+5+9=50. The resulting confidence is







48
50

=

0.96
.





As already mentioned, this confidence metric does not account for delays. In our example, this gives a higher confidence than all of our three models because it does not capture the fact that a large payment of 20 was made at the end of the interval to cover several outstanding bills. The second possible metric divides the area under the cumulative payment curve by the area under the cumulative bills curve (without the notion of a baseline). This gives








200
300

=
0.73

,




which is higher than our con fb. As already mentioned, if we do not take baselines into account, we overestimate the confidence of intervals that do not start at 1; the later the starting point, the more severe the overestimate.


B Tableau Discovery with Multiple Pairs of Sequences


In Sections 3 and 4, we addressed the tableau discovery problem for conservation dependencies on a single pair of sequences. We now discuss the case in which many pairs of sequences are given in the input. In the credit card example, there may be millions of users for whom we have separate charge and payment time series; in the network traffic example, a different pair of incoming and outgoing traffic measurements may be given for each router. As before, the objective will be to generate a minimal tableau that covers some fraction of the data, using subsets that all exceed (in case of a hold tableau) or fall below (in case of a fail tableau) a confidence threshold.


With a single pair of sequences, the only subsets (patterns) that were allowed in the tableaux were intervals on the ordered attribute. We now extend the allowed pattern space so that we can represent intervals in semantically meaningful clusters of pairs of sequences. We assume that each pair of sequences in the input is associated with a set L of label attributes, e.g., age group, gender and marital status for credit card customers, or router name, location and type for network monitoring. With each pair of sequences, we associate a descriptor tuple P, with a schema consisting of the set of label attributes L. Let P[k] be the projection of P onto the attribute k.


Definition 8 A tableau pattern p is a tuple of size [L]+1 with a schema consisting of the label attributes and the ordered attribute t, such as time. For each label attribute k ∈ L, the value taken on by p, call it p[k], is a constant from the domain of k or a special symbol “*”. For the ordered attribute, p[t] may be an arbitrary interval.


Definition 9 A tableau pattern p matches a descriptor tuple P if for all k ∈ L such that p[k]≠*, p[k]=P[k].


Thus, a tableau pattern p identifies an interval within one or more pairs of sequences that match p's labeling attributes. For example, two patterns consisting of L={Router type, Router name} and t=time interval were shown in Table 1. Note that the “*” symbol acts as a wildcard and matches all values of the given labeling attribute. Also, note that patterns may overlap.


Having defined the space of possible patterns, we now show how to compute the confidence of any such pattern. Observe that a pattern selects an interval from a cluster of pairs of sequences. Intuitively, we calculate the confidence (with respect to a conservation dependency) of such an interval by adding up all the corresponding cumulative sequences in the cluster, and transforming them into one new pair of “joint” cumulative sequences. Formally, for each pair of cumulative sequences A(k) and B(k), whose descriptor tuples match the given tableau pattern, we derive a pair of superposed sequences A and B, where Al:=ΣkAl(k) and Bl:=ΣkBl(k). The confidence of the given pattern then corresponds to con fb, con fc and con fd computed on A and B within the interval specified in the pattern. In other words, the resulting confidence is the “average” confidence over all the pairs of sequences (in the given interval) that match the given tableau pattern.


We are now ready to state the minimal tableau discovery problem for conservation dependencies when multiple pairs of sequences are provided in the input:


Definition 10 Let m be the number of pairs of sequences, each of length n, believed to obey a conservation law. Let ŝ and ĉ be user-supplied support and confidence thresholds, respectively. The minimal tableau discovery problem is to find the fewest patterns of the form described above, whose union has size at least ŝmn, such that the confidence of each pattern is above (hold tableau) or below (fail tableau) ĉ, with confidence as defined above.


An exhaustive algorithm for computing a minimal tableau in this situation is to examine all Θ(n2) intervals for each possible pattern on the label set L (i.e., each possible combination of wildcards and constants), and then run a greedy partial set cover algorithm using all the candidate intervals as input. We can reduce the number of intervals to examine by re-using the algorithms proposed in Section 4. Furthermore, we can combine our algorithms with the on-demand tableau discovery algorithm for conditional functional dependencies (CFDs) that was proposed in [10]. The idea is to examine the most general patterns on the label attributes, starting with all-stars, and try more specific patterns (by replacing one “*” at a time with a constant) only when the general pattern does not satisfy the confidence requirement.


C Algorithm for Generating Intervals for Hold Tableaux

















Input: a = <a1,a2, . . . ,an>, b = <b1,b2, . . . ,bn> and ĉ



Output: set of candidate intervals C



C := ;



A0 := 0; B0 := 0;



for i from 1 to n do



 Ai := Ai−1 + ai; Bi := Bi−1 + bi;



for i from n down to 1 do



 compute HiA and HiB depending on model;



lmax := ┌log1+εareaB(1,n)┐;



initialize r1,r2, . . . ,rlmax to 1;



for each i from 1 to n do



 jmax := 0;



 for each l from 1 to lmax do



  while areaB(i,rl + 1) ≦ (1 + ε)l do



   rl++;



   if rl ≧ n then



    break;



  if areaB(i,rl) = 0 then



   conf(i,rl) := 1;



  else



   conf(i,rl) := areaA(i,rl)/areaB(i,rl);



  if conf(i,rl) ≧ ĉ/(1 + ε)



   then jmax := rl;



 if jmax > 0 then



  C := C ∪ {[i,jmax]};










D Proof of Theorem 7

We now give the proof of Theorem 1, which states that the running time of our hold tableau interval selection algorithm, for the balance, credit and debit models, is O(n log1+ε areaB(1, n)), which is O(n lg areaB(1, n)) if ε≦1.


First, note that computing areaA(i, j), areaB(i, j), and con f(i, j) are constant-time operations, since we can precompute Sjl=1jAl and Tjl=1jBl. Then:





areaA(i, j)=(Sj−Si−1)−(j−i+1)HiA,





areaB(i, j)=(Tj−Ti−1)−(j−i+1)HiB, and





con f(i, j)=areaA(i, j)/areaB(i, j).


Now, the only remaining issue is how to compute the ril's quickly. For this, we need to make an assumption regarding the HiA's and HiB's.


Lemma 7 Suppose ri−1,l≦ril for all i, l. Then the total time to compute all ril's is O(n log1+ε areaB(1, n)).


Proof. A candidate integer x equals ril if and only if areaB(i, x)≦(1+ε)l and areaB(i, x+1)>(1+ε)l (or x=n).


We show how, given l≦[log1+ε areaB(1, n)], to compute all n values r1l, r2l, r3l, . . . , rnl in a total of O(n) time. (Hence the total running time, over all values of l, will be O(n log1+ε areaB(1, n).)


We start with x=1 and increment x until r1l=x is found. When it is, we try to find r2l, starting with x=r1l (which is safe because r1l≦r2l), in each step testing x and incrementing x, until r2l is found. Once it is, we try to find r3l, starting with x=r2l (which is safe because r2l≦r3l), in each step testing x and incrementing x, until r3l is found. We continue in this way, starting the search for ri+1,l at the value ril just found. We continue in this way until rnl is found.


The key point is that x is never decreased. The number of times x is increased cannot exceed n, and in iterations in which x is not increased, we change from seeking ril to seeking ri+1,l; this can happen at most n times, making for a total of at most 2n iterations. ▪


Lemma 8 If HiB≧Hi−1B for all i, then ri−1,l ≦ril for all i, l.


Proof. In general, ril is the maximum j such that Σl=ij(Bl−HiB)≦(1+ε)l and ri−1,l is the maximum j such that Σl=i−1j(B1−HPi−1B)≦(1+ε)l. Hence Σl=iri−1,l(Bl−Hi−1B)≦(1+ε)l. If HiB≧Hi−1B, then Σl=iri−1,l(Bl−HiB)≦(1+ε)l, from which it follows that ril≧ri-1,l. ▪


Lemma 9 For the balance, credit, and debit models, HiB≧Hi−1B.


Proof. For the balance and credit models, since HiB=Ai−1, we need simply show that Ai−1≧Ai−2, which is obvious since A is nondecreasing. For the debit model, HiB=Ai−1+mink≧i{Bk−Ak}. From the fact that Ai−2≦Ai−1 and mink≧i−1{Bk−Ak}≦mink≧i{Bk−Ak}, we infer that Hi−1B≦HiB. ▪


Theorem 1 now follows immediately from Lemmas 7, 8, and 9.


E Proof of Theorem 8

Next, we prove Theorem 2, which guarantees that our hold tableaux interval selection algorithm (1) returns no false positives, and (2) returns no false negatives.


The first part is trivial, as it is obvious from the algorithm that if the output includes an interval [i, j], then con f(i, j)≧ĉ/(1+ε). Modulo the distinction between ĉ and ĉ/(1+ε), there are no “false positives.”


For the second part, define h such that (1+ε)h−1<areaB(i, j*)≦(1+ε)h. Since rih is the largest index j such that areaB(i, j)≦(1+ε)h, and areaB(i, j*)≦(1+ε)h, it follows that rih≧j*. The algorithm did compute the confidence of interval [i, rih]. We now prove that con f(i, rih)≧ĉ/(1+ε), and hence the algorithm will report interval [i, rih] (or a longer interval, also of confidence at least ĉ/(1+ε)).







conf


(

i
,

r
ih


)


=




area
A



(

i
,

r
ih


)




area
B



(

i
,

r
ih


)








area
A



(

i
,

j
*


)




area
B



(

i
,

r
ih


)



.






Now areaB(i, rih)≦(1+ε)h and areaB(i, j*)>(1+ε)h−1. Therefore areaB(i, rih)/areaB(i, j*)<1+ε. It follows that








conf


(

i
,

r
ih


)






area
A



(

i
,

j
*


)




(

1
+
ε

)




area
B



(

i
,

j
*


)





=



1

1
+
ε




conf


(

i
,

j
*


)






1

1
+
ε





c
^

.








F Algorithm for Generating Intervals for Fail Tableaux

















Input: a = <a1,a2, . . . ,an>, b = <b1,b2, . . . ,bn> and ĉ



Output: set of candidate intervals C



C := ;



A0 := 0; B0 := 0;



for i from 1 to n do



 Ai := Ai−1 + ai; Bi := Bi−1 + bi;



for i from n down to 1 do



 compute HiA and HiB depending on model;



lmax := ┌log1+εareaA(1,n)┐;



initialize r1,r2, . . . ,rlmax to 1;



for each i from 1 to n do



 jmax := 0;



 for each l from 1 to lmax do



  while areaA(i,rl + 1) ≦ (1 + ε)l do



   rl++;



   if rl ≧ n then



    break;



  if areaB(i,rl) = 0 then



   conf(i,rl) := 1;



  else



   conf(i,rl) := areaA(i,rl)/areaB(i,rl);



  if conf(i,rl) ≦ ĉ(1 + ε)



   then jmax := rl;



 if jmax > 0 then



  C := C ∪ {[i,jmax]};










G Proof of Theorem 9

Here, we prove Theorem 3, which guarantees that our fail tableau interval selection algorithm for the balance and debit models (1) returns no false positives, and (2) returns no false negatives.


The first part is trivial, as it is obvious from the algorithm that if the output includes an interval [i, j], then con f(i, j)≦ĉ(1+ε).


For the second part, define h such that (1+ε)h−1<areaA(i, j*)≦(1+ε)h. Since sih is the largest index j such that areaA(i, j)≦(1+ε)h, and areaA(i, j*)≦(1+ε)h, it follows that sih≧j*. The algorithm did compute the confidence of interval [i, sih]. We now prove that con f(i, sih)≦ĉ(1+ε), and hence the algorithm will report interval [i, sih] (or a longer interval, also of confidence at most ĉ(1+ε)).







conf


(

i
,

s
ih


)


=




area
A



(

i
,

s
ih


)




area
B



(

i
,

s
ih


)








area
A



(

i
,

s
ih


)




area
B



(

i
,

j
*


)



.






Now areaA(i, sih)≦(1+ε)h and areaA(i, j*)>(1+ε)h−1. Therefore areaA(i, sih)/areaA(i, j*)<1+ε. It follows that








conf


(

i
,

s
ih


)






(

1
+
ε

)




area
A



(

i
,

j
*


)





area
B



(

i
,

j
*


)




=



(

1
+
ε

)



conf


(

i
,

j
*


)






(

1
+
ε

)




c
^

.








H Proof of Theorem 11

Next, we prove Theorem 5, which applies to the fail tableau interval selection algorithm. First we prove that si−1,l≦sil for all i, l; then we prove that the overall running time is O(n log1+ε areaAc(1, n)).


Because sil uses the balance model in its definition, that si−1≦sil follows from the monotonicity of HiA in the balance model.


It is obvious that the running time is O(n log1+ε areaAb(1, n)); what is not obvious is that the running time is O(n log1+ε areaAc(1, n)). Yet











area
A
b



(

1
,
n

)


=






l
=
1

n



[


A
l

-

A

i
-
1



]















l
=
1

n



[


A
l

-

(


A

i
-
1


-


min

k

i




{


B
k

-

A
k


}



)


]








=





area
A
c



(

1
,
n

)


.









I Proof of Theorem 12

Finally, we prove Theorem 6, which guarantees that our fail tableau interval selection algorithm for the credit model (1) returns no false positives, and (2) returns no false negatives.


The first part is still trivial, as it is obvious from the algorithm that if the output includes an interval [i, j], then con fc(i, j)≦ĉ(1+ε).


For the second part, define h such that (1+68 )h−1<areaAb(i, j*)≦(1+ε)h. Since sih is the largest index j such that areaAb(i, j)≦(1+ε)h, and areaAb(i, j*)≦(1+ε)h, it follows that sih≧j*. The algorithm did compute the confidence of interval [i, sih]. We now prove that con fc(i, sih)≦ĉ(1+ε), and hence the algorithm will report interval [i, sih] (or a longer interval, also of con fc-confidence at most ĉ(1+ε)).











conf
c



(

i
,

s
ih


)


=



area
A
c



(

i
,

s
ih


)




area
B
c



(

i
,

s
ih


)









=




area
A
b



(

i
,

s
ih


)


+


(


s
ih

-
i
+
1

)



Δ
i





area
B
c



(

i
,

s
ih


)













(


where






Δ
i


=


min

k

i




{


B
k

-

A
k


}



)






area
A
b



(

i
,

s
ih


)


+


(


s
ih

-
i
+
1

)



Δ
i





area
B
c



(

i
,

j
*


)










(


because






j
*




s
ih


)

.




We now prove that areaAb(i, sih)≦(1+ε)areaAb(i, j*) and sih−i+1≦(1+ε)(j*−i+1), which, as we will see, will complete the proof.


First, areaAb(i, sih)≦(1+ε)h and areaAb(i, j*)>(1+ε)h−1. Therefore areaAb(i, sih)/areaAb(i, j*)<1+ε.


Second, because Al is nondecreasing in l, and j*≦sih, the average value of Al over the interval [i, j*] is at most the average value of Al over the interval [i, sih]. Therefore











l
=
i


j
*




A
l




j
*

-
i
+
1








l
=
i


s
ih




A
l




s
ih

-
i
+
1








and





hence












l
=
i


j
*




[


A
l

-

A

i
-
1



]




j
*

-
i
+
1










l
=
i


s
ih




[


A
l

-

A

i
-
1



]




s
ih

-
i
+
1


.




From






this


,

we





have











area
A
b



(

i
,

j
*


)




j
*

-
i
+
1






area
A
b



(

i
,

s
ih


)




s
ih

-
i
+
1



,





so





that










s
ih

-
i
+
1



j
*

-
i
+
1






area
A
b



(

i
,

s
ih


)




area
A
b



(

i
,

j
*


)



<

1
+

ε
.






It follows that sih−i+1<(1+ε)(j*−i+1).


The rest is smooth sailing. We have











conf
c



(

i
,

s
ih


)









(

1
+
ε

)




area
A
b



(

i
,

j
*


)



+


[


(

1
+
ε

)



(


j
*

-
i
+
1

)


]



Δ
i





area
B
c



(

i
,

j
*


)









=





(

1
+
ε

)




area
A
c



(

i
,

j
*


)





area
B
c



(

i
,

j
*


)









=




(

1
+
ε

)




conf
c



(

i
,

j
*


)














(

1
+
ε

)




c
^

.










REFERENCES



  • [1] R. Agrawal, T. Imielinski, A. Swami: Mining Association Rules between Sets of Items in Large Databases. SIGMOD 1993: 207-216.

  • [2] L. Bravo, W. Fan, S. Ma: Extending Dependencies with Conditions. VLDB 2007: 243-254.

  • [3] S. Chaudhuri, A. Das Sarma, V. Ganti, R. Kaushik: Leveraging aggregate constraints for deduplication. SIGMOD 2007: 437-448.

  • [4] F. Chiang, R. Miller: Discovering data quality rules. PVLDB 1(1): 1166-1177 (2008).

  • [5] S. Flesca, F. Furfaro, F. Parisi: Consistent Query Answers on Numerical Databases Under Aggregate Constraints. DBPL 2005: 279-294.

  • [6] W. Fan, F. Geerts, X. Jia, A. Kementsietsidis: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2): (2008).

  • [7] W. Fan, F. Geerts, L. Lakshmanan, M. Xiong: Discovering Conditional Functional Dependencies. ICDE 2009: 1231-1234.

  • [8] T. Fukuda, Y. Morimoto, S. Morishita, T. Tokuyama: Mining Optimized Association Rules for Numeric Attributes. PODS 1996: 182-191.

  • [9] L. Golab, T. Johnson, N. Koudas, D. Srivastava, D. Toman: Optimizing away joins on data streams. SSPS 2008: 48-57.

  • [10] L. Golab, H. Karloff, F. Korn, D. Srivastava, B. Yu: On generating near-optimal tableaux for conditional functional dependencies. PVLDB 1(1): 376-390 (2008).

  • [11] L. Golab, H. Karloff, F. Korn, A. Saha, D. Srivastava: Sequential Dependencies. PVLDB 2(1): 574-585 (2009).

  • [12] F. Korn, S. Muthukrishnan, Y. Zhu: Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases. VLDB 2003: 536-547.

  • [13] M. Slade: Dramatic slowdown in credit card debt. New Zealand Herald, 1 May 2009. www.nzherald.co.nz.




FIG. 7 shows a system 70 in which the present invention is illustratively implemented. The system includes a computer or workstation 71 having a hard disk or other non-transitory computer-readable storage medium 72 and a processor 73. Among the information stored in medium 72 is a database 711 containing data records and a body of software, denoted as tableau generating software 713. That software, more particularly, comprises machine-readable instructions that, when executed by processor 73, cause computer or workstation 71 to carry implement the invention. Computer or workstation 71 also includes other conventional elements (not shown) that those skilled in the art will recognize as being deemed desirable to implement the invention as described herein.


The foregoing merely illustrates the principles of the invention. For example, the invention is illustrated in the context of two data sequences a={a1, a2, . . . ai, . . . an} and b={b1, b2, . . . bi . . . bn} for which the ith pair of values, (ai, bi), is associated with the ith value, ti, of an ordered attribute t={t1, t2, . . . ti, . . . tn]. However, the principles of the invention can be extended to the contexts in which there are more than two data sequences, so long as the data in the two sequences can be transformed into two sequences that are expected to obey a conservation law. Such an application might be, for example, where one desires to analyze the entry/exit data of a building into which people enter through a main door, exit through that main door, and also may exit through an exit-only side door.


It will thus be appreciated that those skilled in the art will be able to devise various alternative implementations which, even if not shown or described herein, embody the principles of the invention and thus are within their spirit and scope.

Claims
  • 1. A computer-implemented method comprising carrying out a conservation-law-based analysis of data that do not strictly obey a conservation law.
  • 2. The method of claim 1 wherein said method includes generating data indicating a selected one of a) the degree to which the data obeys a strict conservation law and b) the degree to which the data does not obey said strict conservation law.
  • 3. The method of claim 2 wherein the data comprises a set of numerical sequences a={a1, a2, . . . ai, . . . an} and b={b1, b2, . . . bi . . . bn} with ai, bi, . . . ≧0, for which an ith set of values, (ai, bi) is associated with the ith value, ti, of an ordered attribute t={t1, t2, . . . ti, . . . tn}, and wherein the strict conservation law is that the sums of the values in one or more of the sequences up to a value T of the ordered attribute t equals the sums of the values in the others of the sequences for all t=T.
  • 4. A computer-implemented method comprising generating a tableau for a data set, wherein the data set comprises a pair of numerical sequences a={a1, a2, . . . ai, . . . an} and b={b1, b2, . . . bi . . . bn} with ai, bi≧0 and for which the ith pair of values, (ai, bi), is associated with the ith value, ti, of an ordered attribute t={t1, t2, . . . ti, . . . tn},wherein the tableau comprises one or more subsets of values of the ordered attribute t that meet at least a first specified criterion,wherein the first specified criterion is that, for at least a specified fraction ŝ of the data set, a confidence measure for the pairs of values associated with each subset in the tableau is a selected one of being equal to a) at least a confidence value ĉ and b) no more than a confidence value ĉ,wherein said confidence measure for the pairs of values associated with each subset in the tableau is a measure of the degree to which those pairs of values deviate from an exact conservation law.
  • 6. The method of claim 4wherein for any interval {i, i+1, . . . , j}, the confidence measure for the pairs of values {(ai, bi), (a,i+1, bi+1), . . . , (aj, bj)} is a function of an area between two curves A={A1, A 2, . . . Ai, . . . An} from a where A0=0 and Ai=Σj≦Iaj and B={B1, B 2, . . . Bi, . . . Bn} from b where B0=0 and Bi=Σj≦Ibj,and wherein said area is the area between a segment of curve A between Ai and Aj and a segment of curve B between Bi and Bj.
  • 7. A non-transitory computer-readable medium having stored thereon instructions which, when implemented by a processor, generate a tableau for a data set, wherein the data set comprises a pair of numerical sequences a={a1, a2, . . . ai, . . . an} and b={b1, b2, . . . bi . . . bn} with ai, bi≧0 and for which the ith pair of values, (ai, bi), is associated with the ith value, of an ordered attribute t={t1, t2, . . . ti, . . . tn),wherein the tableau comprises one or more subsets of values of the ordered attribute t that meet at least a first specified criterion,wherein the first specified criterion is that, for at least a specified fraction ŝ of the data set, a confidence measure for the pairs of values associated with each subset in the tableau is a selected one of being equal to a) at least a confidence value ĉ and b) no more than a confidence value ĉ,wherein said confidence measure for the pairs of values associated with each subset in the tableau is a measure of the degree to which those pairs of values deviate from an exact conservation law.
  • 8. The non-transitory computer-readable medium of claim 7 wherein said exact conservation law is that the sum of the values in the sequence a up to a value T of the ordered attribute t equals the sum of the values in sequence b for all t=T.
  • 9. The non-transitory computer-readable medium of claim 8wherein for any interval {i, i+1, . . . , j}, the confidence measure for the pairs of values {(ai, bi), (ai+1, bi+1), . . . , (aj, bj)} is a function of an area between two curves A={A1, A2, . . . Ai, . . . An} from a where A0=0 and Ai=Σj≦Iaj and B={B1, B2, . . . Bi, . . . Bn} from b where B0=0 and Bi=Σj≦Ibj,and wherein said area is the area between a segment of curve A between Ai and Aj and a segment of curve B between Bi and Bj.