The present invention relates to analysis of computerized data streams in general, and in particular to a computerized method for detecting change points in data streams.
Modern computing technology enables to gather and process large quantities of data in a variety of fields such as finance, commerce, operations etc. In some cases, efficient and quick analysis of such high speed data streams can be very valuable in order to detect a change in trends or condition as early as possible. Click-through stream mining in e-commerce, where the goal of the application is to predict shopping behavior or the effect of advertising, is one notable example. Additional examples of high speed data streams include computerized production environment monitoring applications whose goal is failure detection, traffic monitoring applications that give driving recommendations or on-line alerts, and power grid applications for detecting changes in load profiles and forecast. In all those scenarios analysis is best done on-line, at the speed at which the data is arriving, as a delay in analysis would often translate into a delayed response which can be costly.
In almost each of these scenarios, the data streams are affected in one way or another by human behavior, which itself changes in response to the physical world (time of day or season), fashion, fads, psychological reasons, action by trendsetters, current events, or the economy. Any data stream analysis algorithm must therefore take into account and respond to the non-stationary nature of data distribution.
Furthermore, in many application domains, the change in the underlying distribution of the data is the most interesting event of all. In e-commerce, it can be the result of a change in the competitive scenario. In computerized environment monitoring, it can signal the spread of a new type of failure—such as a new computer virus. Lastly, in stock trading it may signal the move from a bull to a bear market or vice versa. Changes in the mechanism which generates the data are denoted concept drifts. They are especially important because they evoke a need for new responses, different from those dictated by models which were learned before the change occurred.
Most data streams mining algorithms acknowledge the need to handle concept drifts. Two approaches are prevalent: One is to discard old observations. The other is to relearn the model, or parts of the model, when a concept drift becomes evident. However, most data stream mining algorithms rely on a decline in the performance of the model as an indication for concept drift detection. This method, while sometimes effective, has no statistical backing and therefore can be expected to yield inferior results comparing to statistical based change point detection algorithms.
From a statistical point of view, the change point detection problem can be solved optimally by computing the prefix of the current sequence of samples which maximizes the probability that the suffix was sampled from a different distribution. This can be done subject to a set of assumptions on the distribution of the samples (e.g., that it is Normal) and of changes (e.g., that their arrival rate is Poissonian). This approach is, however, impractical for a large number of samples. The state of the art in statistical change point detection on data streams is therefore to use the Page-Hinkley test (PHT), whose run-time is linear in the number of samples. In a streaming setup that would mean maintaining a test statistic of constant size and performing O(1) updates to it per new sample. Naturally, run-time performance like this can only be achieved at a significant cost in terms of false alarm rate, the number of samples needed to detect a change, and the accuracy at which the change point is detected.
The present invention relates to an alternative to PHT which relies on the best practice of solving the more informed problem of testing whether two sets of samples were derived from the same distribution. The algorithms of the invention make use of the unique convergence properties of two sample tests to probabilistically find the point which maximizes their value. That point closely approximates the change point. As both analysis and experiments show, the probabilistic algorithm of the invention maintains just O (1) candidate change points and their related aggregate information. Therefore, it only requires O (1) update operations per new sample, which is comparable with PHT. However, because the two sample tests used by the invention are much more powerful than PHT, and because the probabilistic algorithm of the invention does not degrade that power significantly, the algorithm of the invention is far better than PHT both in terms of false negative to false positive rate and in terms of the accuracy at which it locates to the change point. This superiority is further exemplified in a simplistic application in which the algorithm monitors the mean of a piece-wise stationary data stream at far better accuracy than the one achieved using PHT or others previous approaches.
Notations
Let Xn={x0, x1, . . . , xn} be a prefix of an open-ended stream of samples such that xiεD. For each point i in the prefix denote the samples x0, . . . , xi-1 the head of the prefix and the samples xi, . . . , xn the tail of the prefix. When for some point in the stream the head and the tail follow different distributions that point is denoted xc.
All of the tests described herein measure a test statistic on the stream and indicate a change whenever that statistic exceeds a user provided constant λ. The timeliness of a test is the minimal n larger than c at which the test statistic exceeds λ. The run length of a test is the n for which the test statistic first exceeds A even though no change occurred (i.e., n<c). Since the run length is dependent on random variations in the data we usually refer to the average run length (ARL), which is its average over multiple executions. In all of the algorithms discussed herein the test indicates not only the fact of the change but also the point xmax at which it suspects the change occurred. The difference of that point from the actual change point, |max−c|, is the test accuracy.
Let f be a two sample test statistic, we denote fi (n) the same test statistic as applied to the head and the tail of a prefix of size n, relative to the ith point. We notice here that because fi (n) is not independent of either fi (n−1) or fj (n) for j≠i the original statistical meaning of f is lost. The test statistics retain, however, important convergence properties, as discussed further below.
The Page-Hinkley Test (PHT)
The Page-Hinkley test (PHT) is based on a concept of log-likelihood ratio. The key statistical property of this ratio is that a change in the mean of the data is reflected as a change in the sign of the mean value of the log-likelihood ratio. That is, the ratio exhibits a negative drift before the change, and a positive drift after the change. This difference in behavior is the key to detect the change.
PHT assumes that the observed samples follow a normal distribution. It also assumes that the true mean μ before change is known. This is usually not the case in real-life data, but it is possible to estimate the mean by averaging the observed samples.
Let μn denote the sample mean of the samples x0, x1, . . . , xn. PHT involves a cumulative variable
defined as the difference between the observed samples xiε{} and their sample mean μn cumulated up to step n, where δ is a minimum change magnitude to be detected which is selected a priori. The minimum value
of this variable is also computed and updated on-line. The difference between the variable and its minimum value, Un−mn, is the test statistic that is monitored. When this difference is greater than the given threshold λ, the test alerts that an increase in the mean has occurred. Increasing λ causes fewer false alarms, but might delay or miss altogether the detection of some change points. Given that a change is detected, the estimated change point, xmax, is the sample at which the minimum value mn was last obtained.
Since the mean can either decrease or increase, PHT can be executed twice to detect changes in both directions (see Alg.1).
The χ Two-Sample Test
The χ2 two-sample test is a standard statistical tool for comparing two samples over the same categorical domain C. For two samples, one of size S, with Si samples in every category Ciε and the other of size R with Ri samples respectively in every category Ciε the χ2 test requires that a simple statistic, Eq. 1, be computed.
The predominant characteristic of the χ2 test is that if the two samples are derived from the same (unknown) distribution, the statistic, itself a random variable, follows a known distribution—the χ2 distribution with −1 degrees of freedom. If, on the other hand, the two samples come from distributions in which the mean of some categories are different, then the statistic tends to grow as the two samples grow.
When applied to the head and the tail of the prefix of a stream, as denoted above, the χ2 test statistic, χi2, can be rewritten according to Eq. 1 as:
For simplifying the explanation, we consider below the simple case in which there are only two categories. Applying the χ2 test for more than two categories directly generalizes the method of the invention, and can be applied by any person skilled in the art.
The Student's Two-Sample t-Test
Like the two sample χ2 test, the Student's two-sample t-test determines if the mean has changed between two samples. However, Student's t-test applies to real valued samples rather than categorical ones. Let nS, {circumflex over (X)},S, and νS be the number of samples, the sample mean, and the unbiased estimator of the variance of one sample, and let nR, and {circumflex over (X)},R be the same aggregates for the other sample, respectively. The Student's t-test statistic is:
When the test is applied to the head and the tail of a prefix of a stream Ti can be written as:
The aggregates i, {circumflex over (X)},S, and νS require no update when a new sample is taken. The aggregates n, {circumflex over (X)},R and νR can be updates incrementally by using the aggregates sumRn and sumRn2. The sample mean
where sumRn=sumRn−1+xn. The unbiased estimator of the variance νR=
where
sumRn2=sumRn−12+xn2.
The test is considered valid when each sample is indeed random, the samples are independent, and the samples follow a normal distribution with an unknown mean.
The predominant characteristic of Student's t-test is that if both samples are derived from the same unknown distribution, then the test statistic has a known distribution—Student's t distribution with the degrees of freedom calculated using
If, on the other hand, the two samples come from distributions in which the mean is different, then the value computed by the test statistic tends to grow with every increase in sample sizes.
Confidence Intervals on the Mean
Let R be a sample of size n which follows the binomial distribution Bin (n, p). If {circumflex over (p)} is the sample mean of R, then the normal approximation interval estimates that, with probability greater than 1−α, the value of p is in the range
Here, Z1−α/2 denotes the 1−α/2 percentile of a standard normal distribution N (0, 1).
If R follows the normal distribution N (μ, σ2), and {circumflex over (p)} and sd are the unbiased estimators of the mean and the standard deviation of R the approximation interval estimates that with probability greater than 1−α the value of the actual mean μ is in the range:
Here, t*1−α/2 denotes the 1−α/2 percentile of Student's t distribution.
It is an object of the present invention to present a computerized method for detecting a change point in a data stream.
It is another object of the present invention to present a computerized method for detecting a change point in a data stream by using a two-sample test on candidate points of the data stream.
The present invention thus relates to a computerized method for detecting a change point in a data stream by testing whether two sets of samples from the data stream were derived from the same distribution, wherein the test uses the unique convergence properties of the two sample tests to probabilistically find the point which maximizes their value, said point closely approximating the change point.
In some embodiments, the test used is the χ2 two-sample test.
In some embodiments, the method comprises the steps of:
(i) maintaining a list of candidate change points in the data stream, and relevant aggregate information;
(ii) adding each new point in the data stream as candidate;
(iii) computing an upper bound and a lower bound on the long term value of the χ2 two-sample test for every candidate in the list;
(iv) purging from the list candidates whose long term upper bound value is lower than the long term lower bound values of other candidates, with high probability; and
(v) indicating a change point when one candidate exceeds a given threshold.
In some embodiments, the relevant aggregate information comprises the number of points, number of occurrence of data from different categories or other statistics which can be incrementally updated with every new sample.
In some embodiments, the test used is the Student's t-test.
In some embodiments, the method comprises the steps of:
(i) maintaining a list of candidate change points in the data stream, and relevant aggregate information;
(ii) adding each new point in the data stream as candidate;
(iii) computing an upper bound and a lower bound on the long term value of the Student's-t two-sample test for every candidate in the list;
(iv) purging from the list candidates whose long term upper bound value is lower than the long term lower bound values of other candidates, with high probability; and
(v) indicating a change point when the test value for one candidate exceeds a given threshold.
In some embodiments, the aggregate relevant information comprises the number of point, sum of data, sum of the square of the data or other statistics which can be incrementally updated with every new sample.
In some embodiments, the test used is the mean estimation algorithm.
In some embodiments, the method comprises the steps of:
(i) maintaining the sum of the data and number of samples;
(ii) updating the said sum and number with every new data;
(iii) removing from said sum and number the sum and number of the data in the first set of the data for the candidate which indicates a change;
(iv) using the current sum and number to compute the average which is the estimation for the mean; and
(v) indicating a change point when the test value for one candidate exceeds a given threshold.
In some embodiments, the test used is any two-sample test.
In another aspect, the present invention relates to a non-transitory computer-usable medium having computer readable instructions stored thereon for execution by a processor to perform a computerized method for detecting a change point in a data stream by testing whether two sets of samples from the data stream were derived from the same distribution, wherein the test uses the unique convergence properties of the two sample tests to probabilistically find the point which maximizes their value, said point closely approximating the change point.
a-3b are graphs of a typical experiment,
a-7b are graphs showing the cost average of the ProTO-χ2 experiment of
a-10b are graphs of the cost average of ProTO-T, showing that ProTO-T uses less than one thousand candidates before the change.
In the following detailed description of various embodiments, reference is made to the accompanying drawings that form a part thereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.
Convergence Properties of χi2(n) and Ti(n)
Below, the long-term behavior of χi2(n) and Ti(n) are observed as n grows toward infinity and it will also be shown how to induce an upper and a lower bound for the value to which both χi2(n) and Ti(n) will converge. The expected dominance of the change point statistic is also analyzed.
An Upper and a Lower Bound for the Projected Test Statistic
Assume that the samples of a stream follow the Bernoulli distribution and that the sample mean of the head of a point i is {circumflex over (q)} while the actual mean of its tail is p. The χ2 test statistic for a point i has a useful convergence property: Since the sample mean of the tail tends to p as n grows, χi2(n) will eventually tend to a constant which only depends on the difference of {circumflex over (q)} from p and on the size of the head:
Similarly, if the samples of a stream follow the normal distribution and the sample mean of the head of a point i is {circumflex over (X)}s while the actual mean of its tail is μR the Student's t-test statistic for a point i will eventually tend to a constant:
Eq. 7 and Eq. 8 induce an upper and a lower bound for the value to which χi2(n) and Ti(n) will converge respectively. If at sample n the sample mean of the head of a point i is {circumflex over (q)} and the average of its tail is {circumflex over (q)}n, then by replacing p with the confidence interval in Eq. 5 we gain a confidence interval on the limit of χi2(n). As a result, the maximal expected value (i.e., the upper bound), χiu, of χi2(n) is
The minimal expected value (i.e. the lower bound), χil of χi2(n) has two different cases. If
then it might be as low as zero. Otherwise it is Eq. 10:
Similarly, if at sample n {circumflex over (X)}S and νS are the sample mean and the unbiased estimator of the variance of the head respectively and {circumflex over (X)},n and sdn are the average and the standard deviation of the tail respectively, then replacing μR with the confidence interval in Eq. 6 we gain a confidence interval on the limit of Ti(n). As a result, the maximal expected value, Tiu, of Ti(n) is
The minimal expected value, Til, of Ti(n) has two different cases. If {circumflex over (X)},Sε
then it might be as low as zero. Otherwise it is Eq. 12:
Expected Dominance of the Change Point Statistic
Consider a sequence of samples coming from a piecewise stationary random source. Assume that this random source is binomial and at time c there is a change. Assume also that the samples before time c follows the binomial distribution Bin (c, q) and samples that come after time c follows the binomial distribution Bin (n−c, p). Consider three different points in that sequence: the change point c, c+m, and c−m (see
Assume that at sample n the sample mean of the head for point c is {circumflex over (q)} while its sample mean of its tail is {circumflex over (p)}. Since {circumflex over (p)} tends to p as n grows, χC2(n) will eventually tend to a constant according to Eq. 7.
Similarly, assume that the sample mean of the head for point c+m is qc+m while the sample mean of its tail is pc+m. As can be seen in
Since, pc+m tends to p as n grows, χc+m2(n) will eventually tend to a constant:
Similarly, assume that the sample mean of the head for point c−m is qc+m while the sample mean of its tail is pc−m. As can be seen in
Now, consider the chances that χc2(n) is dominated by either χc+m2(n) or χc−m2(n). For this to happen, Eq. 13 should be greater than Eq. 7. The resulting inequality has two roots: the first root occurs when {tilde over (p)} is greater than
Using the Hoeffding bound, it can be shown that this probability can be bounded from above by Eq. 15, which decreases exponentially as the proportion of c2 to m increases. It follows that if the change occurs after a significant number of samples, then the change point statistic is likely to eventually dominate nearby points.
The second root occurs when {tilde over (p)} is lower than p−½(p−{circumflex over (q)}) Using the Hoeffding bound, it can shown that this probability can be bounded by Eq. 16, which decreases exponentially as m increases. Again, the chances that the change point statistic will dominate that of nearby points are overwhelming.
Point c−m can be similarly analyzed. Note that the first m samples in the tail of the point c−m follow the distribution Bin (m, p) and the c−m samples in its head follow the distribution Bin (c−m, q). Consider the chances that χc2(n) is dominated by χc−m2(n). For this to happen, Eq. 14 should be greater than Eq. 7. The resulting inequality has two roots: the first root occurs when
Similarly, the second root occurs when
Expected Dominance When No Change Occurs
Our analysis is also valid when no change occurs on the distribution of the random source. In this case, the greater the length of the head of a point, the closer {circumflex over (q)} is to p. Consider, instead of c, the point max for which |{circumflex over (q)}−p| is maximal. Now, Eqs. 15 to 18 can all equally be applied to the difference between χmax2 (n) and χmax−m2(n), χmax+m2(n) with same consequences. It follows that even when no change occurs, one point is likely to dominate.
The analysis provided here has two limitations: first, it considers a single pair of points when in reality there are multiple interdependent points. Dependency among points could mean that if one point's statistic overshadow the statistic of c, so will the statistics of other points. However, central to our purpose is that the chances that any point would ever dominate the one which has the maximal χ2 value diminish exponentially with the distance between those points. Second, the analysis provided here is limited to the simpler test—the —2. Nonetheless, our experiments reveal no real difference between Student's t-test and the χ2 test and thus hint the analysis might hold for that test as well.
Change Point Detection Using the χ2 Two-Sample Test
The Probabilistic Test Optimization algorithm, ProTO-χ2, (see Alg. 2) maintains a set of candidate change points C. Every candidate iεC has two pairs of aggregates: Si0 and Si1 for the head, and Ri0 and Ri1 for the tail. At every new sample xn, the algorithm increases either Ri0 or Ri1 for every candidate iεC, depending if xn is zero or one. Then, the algorithm recalculates χi2(n) according to Eq. 2, and recalculates χil, and χiu according to Eq. 10 and Eq. 9, respectively, with
The last step taken after every new sample xn is to update the candidate set. A new candidate is first added to C, whose tail aggregate is zero and whose head aggregates are the sums of the respective head and tail aggregates of one of the first candidate in C. Then, the algorithm reviews the candidate set and purges unneeded candidates according to the following criteria: Let max denote the candidate whose statistic, χmax2(n), is the highest among those in C. Also, let red denote the candidate whose lower bound statistic, χredl, is the highest lower bound in C. As can be seen in
ProTO-χ2 retains any candidate iεC whose χiu is greater than χredl, as these are the candidates whose χi2(n) might eventually exceed that of both candidates max and red. All the other candidates in C are then discarded. ProTO-χ2 also checks whether the candidate max has passed the threshold λ. If it has, an alert is indicated with the suspected change point indicated to be max.
Change Point Detection Using the Student's t-Test
ProTO-T (see Alg. 3) is very similar to ProTO-χ2. The main difference is in the aggregates it maintains for every candidate, and the statistic computed for everyone. Every candidate iεC has two pairs of aggregates: sumSi and sumSi2 for the head, and sumRi and sumRi2 for the tail. At the arrival of new sample xn all the aggregates in the tail of candidate i are updated as follows: sumRi←sumRi+xn and sumRi2←sumRi2+(xn)2. Similar to ProTO-χ2, ProTO-T also recalculates for every candidate i, Ti(n) according to Eq. 4, and recalculates, Til, and Tiu according to Eq. 12 and Eq. 11, respectively, with {circumflex over (X)},n≐{circumflex over (X)},R and sdn≐√{square root over (νR)}.
At every new sample xn, ProTO-T also creates a new candidate and adds it to the set C. The tail aggregates of the new candidate are empty and its head aggregates is the sums of the respective head and tail aggregates of the first candidate in C, which are computed as follows: sumSi←sumSfirst+sumRfirst and sumSi2←sumSfirst2+sumRfirst2 (it should be noted that the sum of sumSi and sumRi is the same for all i, as is the sum of sumSi2 and sumRi2). Then, the algorithm locates the candidates max, with the maximal Tmax(n) value, and red, whose Tredl is maximal, and purges redundant candidates in the same way ProTO-χ2 does. Finally, ProTO-T indicates a change at max if Tmax(n) surpasses λ.
The Mean Estimation Algorithm
Computation of the mean in various scenarios is often used as a toy example, a demonstrator, in data mining. Valuable in itself, this example is also strongly related to a family of clustering algorithms—k-means. In the context of change point detection, we are interested in the benefits of ProTO for mean estimation in piecewise stationary streams. Building on the algorithmic framework of ProTO, the ProTO-Mean algorithm computes an approximation of the mean as the average of all samples seen since the last change.
The main difference between the ProTO-T and the ProTO-Mean algorithms is on line 5: whenever an alert is identified, the ProTO-Mean algorithm treats all of the samples that preceded the indicated change point as if they came from a different distribution. Thus, candidates generated before the indicated change point are discarded. Candidates generated at and after the suspected change point must have the aggregates of the samples gathered before the change point discarded from their head. Since these aggregates are exactly the head aggregates of the candidate which produced the alert, the ProTO-Mean algorithm simply deducts the head aggregates of max from the head aggregates of every candidate. Since ProTO-Mean treats all candidates that preceded at and after the suspected change point as if they created after the suspected change point, it deducts max from every candidate i (see line 5(b)iii). Furthermore, the output of ProTO-Mean is the percentage of the sample mean of the head and the sample mean of the tail of any candidate (see, Alg. 4).
ProTO-Mean can be compared with an adaptation of PHT for mean estimation. Whenever an alert is indicated, the PHT-Mean algorithm treats all of the samples that preceded the indicated change point as if they came from a different distribution. Thus, PHT-Mean is restarted whenever a change is detected. The output of PHT-Mean is the percentage of the sample mean μn (see Alg. 5).
Experimental Validation
In this section, we conducted a series of experiments comparing the average run length, the accuracy, the timeliness and the cost of ProTO to those of PHT.
Typical Experiment
In a typical experiment with the ProTO-T algorithm, random data is sampled from a standard normal distribution for 20,000 samples. Then, at sample 20,000, the mean of the random source is changed by Δ=0:5%. As
b describes the same typical experiment with PHT. As the figure shows, the PHT statistic value, (Un-mn), is generally lower than 20 until sample 20,000, when it begins climbing. At sample 21,500, the PHT statistic value crosses the chosen alert threshold λ. As in the previous experiment, increasing λ would reduce the number of false alarm (two false alarm are evident: in sample 9,500 and in sample 17,000), but would also delay detection of the change.
The accuracy of the change time estimation is also interesting. For PHT, 500 samples separate sample 19,500, in which the last minimum value mn was obtained, and the change point. In comparison, for ProTO-T the candidate with the maximal statistic value which first crosses the chosen alert threshold is the one created at sample 20,006.
The cost of the ProTO-T is proportional to the number of candidate change points it maintains. Since that number has random properties, it is presented in terms of its cumulative distribution.
The performance of a change point detection is measured in terms of its timeliness (when, if ever, it detects the change), accuracy (how closely it points to the change point) and cost (in our case, the number of candidates it maintains). However, timeliness and accuracy must be presented relative to the rate of false positive. This is because they can easily be traded against a higher rate of false positives. Thus, in our performance measurement the full range of the tradeoff of accuracy vs. ARL and timeliness vs. ARL is investigated. Similarly, the cost of the algorithm can be reduced at the expense of accuracy and timeliness and thus our results present that tradeoff. In the performance graphs we also added a line indicating the performance point achieved at the reasonable average costs. We prefer this presentation to the three dimensional graphs (e.g., Accuracy vs. ARL vs. Cost) otherwise required.
Experiment with ProTO-χ2
In the following experiment, random data is sampled for every controlled data stream from the same binomial distribution for 200,000 samples. Then, at sample 200,000, the mean of the random source is changed by Δ. We ran the ProTO-χ2 over one hundred different controlled data streams for each certain Δ.
Complementing this view is the cost average of the ProTO-χ2. As
Because the magnitude of change Δ does not affect the cost, we report here only the cost average for Δ=1%. We can see that the accuracy average deteriorates as the cost average decreases. This is because the ProTO-χ2 retains fewer candidates; thus, it is less likely that one of them would points accurately to the change point. The timeliness average also deteriorates as the cost average decreases, for the same reason.
The horizontal solid line in
ProTO-T and PHT
In the following experiment, we compared ProTO-T with PHT. Our results show that ProTO-T outperforms PHT in the proportion of both accuracy and timeliness to ARL. We also show that the cost of ProTO-T is asymptotic to that of PHT, which is constant per new data sample. What is notable here is that ProTO-T provided better accuracy and timeliness for an acceptable cost.
In this experiment, random data is sampled for every controlled data stream from the same standard normal distribution, for 200,000 samples. Then, at sample 200,000, the mean of the random source is changed by Δ. We ran the experiment over one hundred different controlled data streams for each Δ.
Complementing this view is the cost average of ProTO-T. As
The horizontal solid line in
Mean Monitoring
We compared the ProTO-Mean algorithm to PHT-Mean. Analysis of the utility of the algorithm becomes much simpler when it is given a specific application. Here, the utility metric can be taken directly from the application domain. Furthermore, cases in which the algorithm fails to detect a change altogether or falsely alarms have a simple, measurable, effect on performance. The utility metric of the mean estimation algorithm is measured by the distance of the estimated mean from the actual mean.
A typical experiment with the mean estimation algorithm is presented in
ProTO-Mean and PHT-Mean can be further compared with a trivial algorithm for mean estimation which it maintains a sliding window with fixed size. On every new sample it recalculates the average from the last samples seen in that window.
Complementing this view is the cost distribution of the ProTO-Mean algorithm. As
Appendix: General Applicability of the ProTO Algorithm
The ProTO algorithmic framework might be applicable to many statistical two-sample tests.
We have shown, by way of example, how to apply the framework to the χ2 two-sample test and to the Student's two-sample t-test. However, many two-sample tests determine whether there is a difference between the two samples based on the same idea: the convergence of the test statistic value is very different for two samples from the same unknown distribution than for two samples from different, unknown distributions. A person skilled in the art will immediately perceive how to apply the algorithms of the invention to other two-sample tests. Several examples follow:
The parametric two-sample Z-test compares the means of the two samples to determine whether there is a difference between the two samples. If the two samples are derived from the same normal distribution, then the test statistic value has a known distribution—the normal distribution. If, however, the two samples come from different distributions, then the test statistic value tends to a constant as one of the samples grows. The Z-test statistic is
where
The two-sample Kolmogorov-Smirnov test (KS-test) is used to test whether two samples come from the same distribution. The two-sample KS-test uses the maximal distance between cumulative frequency distributions of the two samples as the test statistic. The KS-test statistic is
where F1,n and F2,n′ are the empirical distribution functions of the first and the second sample respectively. If the two samples are derived from the same unknown distribution, then the test statistic value has a known specific distribution—the Kolmogorov distribution. Otherwise, it tends to a constant as one of the samples grows.
The two-sample F-test is designed to test whether the two samples have the same variance. It does this by considering a decomposition of the variability in terms of sums of squares. The F-test statistic is defined as the ratio of two scaled sums of squares reflecting different sources of variability and is computed as F=
where S12 is the larger sample variance and S22 is the smaller sample variance. If the two samples have the same variance, then the test statistic value has a known specific distribution—the F-distribution. Otherwise, it tends to a constant as one of the samples grows.
Resource Optimization
Our approach to the problem of delayed detection is to dynamically manage both the number of windows and their sizes. We decide to stop collecting statistics for some time-windows based on the estimated probability that they will be the first to alert on a change. In this way the computational cost of our approach is variate. Approaches e.g., Kifer et al. and PHT have a constant computational cost which might be preferred over a variate cost. By choosing to ignore a large number of time-windows we manage to limit the computational cost to a constant, which is equivalent to PHT.
In further research two improvements to the basic ProTO algorithms will be tested. One is to purge the time-window whose upper bound statistic is the lowest whenever the number of the current time-windows exceeds a user predefined constant (see early results below). The other is to induce confidence bounds on the difference between the test statistics of two time-windows instead of bounding a single test statistics of one time-window. Such an improvement makes the bounds tighter and therefore the cost is reduced (see early results below).
Early Results in Change Point Detection in Multidimensional Streams
The statistical two-sample test called Hotelling's 2 designates for detecting changes in the mean of multidimensional data streams.
Two-Sample Hotelling's 2 Test
Consider that the observations in the prefix follow a multivariate normal distribution Z˜Np(μ, Σ) where μ is the mean vector and Σ is the covariance matrix. Let
The two-sample Hotelling's 2 is the multivariate analog of the two-sample t-test in uni-variate statistics. It is used in order to compare two populations which determined if the mean vector has changed between two samples. Let n1,
The predominant characteristic of Hotelling's 2 test is that if both samples are derived from the same multivariate normal distribution Z˜Np(μ, Σ) with unknown μ and Σ, then the test statistic is χ2 distributed with p degrees of freedom. If, on the other hand, the two samples come from distributions in which the mean vector is different, then the value computed by the test statistic will no longer distributed as χ2 and its value will be significantly larger. The test holds for large sample size such that n1+n2−p>40.
When the test is applied to the head and the tail of a prefix of a stream, 2 (n) can be written as:
As new observation arrive, xn, the aggregates i,
where τn=τn−1+xn. The unbiased sample covariance matrix is computed as
where ωn=ωn−1+xnx′n.
Simultaneous Confidence Intervals for the Mean
Simultaneous confidence intervals are a group of intervals where each interval contains an individual component of mean vector with a 100(1−α)% confidence. It is assumed that there is a multivariate normal population Z˜Np(μ, Σ). A random sample of n multivariate observations is collected, where n−p>40. Based on the sample data,
where Skk are the (k, k) elements of the sample covariance matrix. Here, χα, ρ2 denotes the α percentile of the χ2 distribution.
Algorithmic Improvements
Maintaining a User Predefined Constant Number of Time-Windows
We choose, without loss of generality, to apply the algorithmic improvement within the ProTO-T framework. The improvement considers maintaining a user predefined constant number of time-windows. Similar to ProTO-T, ProTO-FC (see Alg. 6), maintains a set of time-windows C. Every time-window iεC has two pairs of aggregates: sumSi and sumSi2 for the head, and sumRi and sumRi2 for the tail. At the arrival of new sample xn, all the aggregates in the tail of time-window i are updated as follows: sumRi←sumRi+xn and sumRi2←sumRi2+(xn)2. It also recalculates for every time-window i, Ti(n) according to the following Eq.
and recalculates Tiu according to the following Eq.
with {circumflex over (X)}n≐{circumflex over (X)}R and sdn≐√{square root over (νR)}.
At every new sample xn, ProTO-FC also creates a new time-window and adds it to the set C. The tail aggregates of the new time-window are empty and its head aggregates are the sums of the respective head and tail aggregates of the first time-window in C, which are computed as follows: sumSi←sumSfirst+sumRfirst and sumSi2←sumSfirst+sumRfirst2. Then, the algorithm locates the time-window max with the maximal |Tmax(n)| value. It also locates γ, whose Tγu is the minimal.
Unlike ProTO-T, ProTO-FC purges the time-window whose upper bound statistic is the lowest, Tγu, whenever the number of the current time-windows, |C|, exceeds a user provided constant, η. Finally, ProTO-FC indicates a change at max if |Tmax(n)| surpasses λ.
Inducing Confidence Bounds on the Difference Between the Test Statistics of Two Time-Windows
Consider two different points in the stream 1st and 2nd where 2nd>1st without loss of generality. We look at the long-term behavior of 2nd−1st2(n) as n grows toward infinity and also how to induce an upper and lower bound for the value to which 2nd−1st2(n) will converge. Let h1 and
Furthermore, let L be the distance (i.e. the number of observations) between these two points. Therefore, the average of those L observations,
Let also, □ be the difference between the sample mean vector of the head of 1st and the sample mean vector of the tail of 2nd (i.e., □=
and the sample mean vector of the tail of 2nd
□ is a variable which monitors the true change in the mean of the data distribution while monitors the noise. As a result,
Eq. 24 induces an upper and a lower bound for the value to which 2nd−1st2(n) will converge. Replacing both □ and φ with the simultaneous confidence intervals in Eq. 24 gives us simultaneous confidence intervals on the limit of 2nd−1st2(n). Let
for k={I, j}. As a result, the maximal expected value (i.e., the upper bound), 2nd−1stu, of 2nd−1st2(n) is
Similarly, the minimal expected value (i.e. the lower bound), T2nd−1stl, of 2nd−1st2(n) is:
A possible improvement considers inducing confidence bounds on the difference between the test statistics of two time-windows instead of bounding a single test statistics of one time-window. Here, we choose to use Hotelling's 2 test as a plug-in for our algorithm for detecting changes in uni-variate streams (i.e., p=1). Note that in this case, Eq. 25 can be written as:
Similarly, Eq. 26 can be written as:
ProTO-2 (see Alg. 7) maintains a set of time-windows C. Every time-window iεC has two pairs of aggregates: ih and ωih for the head, and it and ωit for the tail. At the arrival of new observation, xn, all the aggregates in the tail of time-window i are updated as follows: it=it+xn and ωit=ωit+(xn)2. Then, the algorithm recalculates 2(n) according to Eq. 20.
The last step taken after every new observation xn, is to update the time-window set. A new time-window is first added to C, whose tail aggregates are zero and whose head aggregates are the sums of the respective head and tail aggregates of any one of the time-windows in C. Note that the sum of ih and it is the same for all i, as is the sum of ωih and ωit. For instance, let φ be the first time-window in C and therefore its head aggregates are computed as follows: ih←ψh+ψt and ωih←ψh+ωψt.
The method in which ProTO-2 reviews the time-windows set and purges the unneeded time-windows is different from that of ProTO-T: For each pair of time-windows, 1st and 2nd in C, calculate the bounds 2nd−1stl and 2nd−1stu according to Eqs. 27 and 28 respectively. If 2nd−1stl is lower than zero, remove time-window 2nd from C. Moreover, if 2nd−1stl is greater than zero then remove time-window 1st from C. Lastly, the algorithm also checks whether the time-window max has passed the threshold λ. If it has, an alert is indicated with the suspected change point indicated to be max.
Many alterations and modifications may be made by those having ordinary skill in the art without departing from the spirit and scope of the invention. Therefore, it must be understood that the illustrated embodiment has been set forth only for the purposes of example and that it should not be taken as limiting the invention as defined by the following invention and its various embodiments.
Therefore, it must be understood that the illustrated embodiment has been set forth only for the purposes of example and that it should not be taken as limiting the invention as defined by the following claims. For example, notwithstanding the fact that the elements of a claim are set forth below in a certain combination, it must be expressly understood that the invention includes other combinations of fewer, more or different elements, which are disclosed in above even when not initially claimed in such combinations. A teaching that two elements are combined in a claimed combination is further to be understood as also allowing for a claimed combination in which the two elements are not combined with each other, but may be used alone or combined in other combinations. The excision of any disclosed element of the invention is explicitly contemplated as within the scope of the invention.
The words used in this specification to describe the invention and its various embodiments are to be understood not only in the sense of their commonly defined meanings, but to include by special definition in this specification structure, material or acts beyond the scope of the commonly defined meanings. Thus if an element can be understood in the context of this specification as including more than one meaning, then its use in a claim must be understood as being generic to all possible meanings supported by the specification and by the word itself
The definitions of the words or elements of the following claims are, therefore, defined in this specification to include not only the combination of elements which are literally set forth, but all equivalent structure, material or acts for performing substantially the same function in substantially the same way to obtain substantially the same result. In this sense it is therefore contemplated that an equivalent substitution of two or more elements may be made for any one of the elements in the claims below or that a single element may be substituted for two or more elements in a claim. Although elements may be described above as acting in certain combinations and even initially claimed as such, it is to be expressly understood that one or more elements from a claimed combination can in some cases be excised from the combination and that the claimed combination may be directed to a sub-combination or variation of a sub-combination.
Insubstantial changes from the claimed subject matter as viewed by a person with ordinary skill in the art, now known or later devised, are expressly contemplated as being equivalently within the scope of the claims. Therefore, obvious substitutions now or later known to one with ordinary skill in the art are defined to be within the scope of the defined elements.
The claims are thus to be understood to include what is specifically illustrated and described above, what is conceptually equivalent, what can be obviously substituted and also what essentially incorporates the essential idea of the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IL2011/000764 | 9/27/2011 | WO | 00 | 6/14/2013 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2012/042521 | 4/5/2012 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20070082075 | Xu | Apr 2007 | A1 |
20110014625 | Belinsky et al. | Jan 2011 | A1 |
20110045053 | Shen et al. | Feb 2011 | A1 |
Entry |
---|
Kifer, D. et al. “Detecting change in data streams.” Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB)—vol. 30, pp. 180-191. 2004. |
Keogh et al; “An online algorithm for segmenting time series” Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference, pp. 289-296. (2001). |
Charu C. Aggarwal et al; “A framework for diagnosing changes in evolving data streams” Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD '03, pp. 575-586. (2003). |
Daniel Kifer et al; “Detecting change in data streams” Proceedings of the Thirtieth international conference on Very large data bases—vol. 30, pp. 180-191. (2004). |
Murad Badarna et al; “Detecting Mean Changes in Data Streams”Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference, pp. 568-572. (2011). |
International Search Report for PCT Patent Application No. PCT/IL2011/000764, Filed on Sep. 27, 2011. |
Number | Date | Country | |
---|---|---|---|
20130262368 A1 | Oct 2013 | US |
Number | Date | Country | |
---|---|---|---|
61386752 | Sep 2010 | US |