Companies or other organizations often gather data into data repositories, such as databases or data warehouses, for analysis to discover hidden data attributes, trends, patterns, or other characteristics. Such analysis is referred to as data mining, which is performed by companies or other organizations for planning purposes, for better understanding of customer behavior, or for other purposes.
It is often useful to detect for a “structural” or “systematic” change in observed data from a particular data source or database. A “systematic” or “structural” change in data results from some change in a particular system that produced the data, where such change results from an underlying change in the system rather than from changes due to normal operation of the system. The term “systematic change” is often used in the industry context, whereas the term “structural change” is often used in the economics context. In this description, the terms “systematic change” and “structural change” are interchangeably used and refer to any change in data that results from a change in the system that produced the data.
Detecting a systematic change of data involves change-point detection, which identifies the point in time of the change. Conventionally, change-point detection has employed a model that assumes a constant mean for observed data values before the change, a different constant mean for the observed data values after the change, and a constant variance for the observed data values. A shift in the calculated constant means or constant variance has conventionally been used as an indication that a systematic change has occurred.
Some other forms of change-point algorithms detect change points based on comparing aggregate values (computed from aggregations of data values) against a threshold. With such algorithms, a change point can be detected based on the crossing of the threshold by the aggregate values. However, it is often difficult to accurately set an optimal threshold value. An incorrectly set threshold may result in inaccurate or late detection of a change point.
Some embodiments of the invention are described with reference to the following figures:
The one or plural CPUs 102 are coupled to a storage 104 (which can include volatile memory, non-volatile memory, and/or a mass storage device). The computer 110 also includes a database management module 106 that is executable on the one or plural CPUs 102. Alternatively, the database management module 106 can be executable on a computer that is separate from the computer 110 on which the change-point detection module 100 is executed. The database management module 106 manages the access (read or write) of data stored in a database 112. The database 112 can be implemented in storage device(s) connected to the computer 110, or alternatively, the database 112 can be implemented in a server or other computer coupled over a data network, such as data network 114.
The computer 110 communicates over the data network 114 through a network interface 116. Example devices or systems that are coupled to the data network 114 include a client 118 and one or plural data sources 120. The data sources 120 (which can be associated with different organizations, departments within an organization, or other types of entities) are able to collect data that is then transmitted over the data network 114 and through the computer 110 for storing in the database 112.
The change-point detection module 100 checks for a systematic change in data stored in the database 112. Examples of data that can be stored in the database 112 include retail or wholesale sales data, invoice data, production volume data, inventory data, revenue data, financial data, cost data, quality control data, and other forms of data. In response to detecting a systematic change in data, the change-point detection module 100 is able to provide an alert (e.g., an alarm) to a user of a time point (also referred to as a “change point”) at which the systematic change in data occurred. Note that the change-point detection module 100 is also able to check for systematic changes in data of other databases aside from database 112.
As noted above, a “systematic change” or “structural change” in data results from some change in a particular system that produced the data, where the data change results from an underlying change in the system rather than from data change occurring as a result of normal operation of the system. The term “systematic change” is often used in the industry context, whereas the term “structural change” is often used in the economics context. In this description, the terms “systematic change” and “structural change” are interchangeably used and refer to any change in data that results from a change in the system that produced the data.
In some embodiments, the change-point detection module 100 detects a change point in a time series of data values (stored in the database 112 or elsewhere) by first computing aggregate values corresponding to the data values. The time series of data values is also referred to as a time series of “observations” or “observed data values.” Aggregate values are computed by performing aggregation of the observed data values. The aggregate values are also represented as a time series. In one embodiment, the aggregate values are cumulative sum values. In other embodiments, other types of aggregate values based on other forms of aggregation (e.g., average, minimum, maximum, etc.) can be employed.
In accordance with some embodiments, the change-point detection module 100 performs linear fitting (such as linear regression fitting) onto curve segments representing the aggregate values. In some embodiments, at least two curve segments representing the aggregate values are defined. The curve segments are segments of a curve representing the time series of aggregate values (e.g., cumulative sum values). Linear fitting is performed to fit line segments onto the respective curve segments representing the aggregate values. In one embodiment, linear fitting is performed by building linear regression models with respect to the curve segments. In other embodiments, other forms of fitting can be performed, including non-linear fitting.
Each curve segment represents a respective set of aggregate values. For example, if a curve representing a time series of aggregate values is divided into two curve segments, then the two curve segments represent two respective sets of the aggregate values (also referred to as “aggregate value sets”). If the time series of aggregate values is divided into two aggregate value sets, these two aggregate value sets are referred to collectively as a pair of aggregate value sets. In other embodiments, a time series of aggregate values can be divided into a larger number of aggregate value sets. Change point detection is based on the fittings (e.g., linear fittings) performed by the change-point detection module 100 with respect to the aggregate value sets (two or more). In the ensuing discussion, change point detection is discussed in the context of dividing a time series of aggregate values into two (a pair of) aggregate value sets. However, it is noted that the described techniques are applicable to embodiments in which the time series of aggregate values is divided into greater than two aggregate value sets.
In the analysis according to an embodiment performed by the change-point detection module 100 to find a change point, multiple pairs of aggregate value sets are defined. The numbers of members of the aggregate value sets are varied in the multiple pairs of aggregate value sets such that the aggregate value sets in one pair have differing numbers of members than aggregate value sets in another pair. For example, for a time series of n aggregate values that is divided into two aggregate value sets, a first pair of aggregate value sets can have a first aggregate value set with m aggregate values, and a second aggregate value set with n−m aggregate values. In a second pair of aggregate value sets, the number of aggregate values in a first aggregate value set is k (k≠m), and the number of aggregate values in a second aggregate value set is n−k. A linear fitting is performed with respect to each of the first and second pairs of aggregate value sets. Additional pairs of aggregate value sets are further defined, with further fittings performed on these additional pairs of aggregate value sets.
Comparisons of the fittings performed on the multiple pairs of aggregate value sets are performed to identify an optimal fit, which identifies the pair of aggregate value sets (from among the multiple pairs of aggregate value sets) associated with the optimal fit. This identified pair of aggregate value sets associated with the optimal fit provides the indication of the change point (the time point at which a systematic change in observed data values occurs).
In some embodiments, the comparisons to identify an optimal fit are based on a goodness-of-fit analysis performed for each of the linear regression models built for respective pairs of aggregate value sets. Measures of the goodness-of-fit analyses are then computed and compared to determine the optimal fit from among the linear fits performed on the multiple pairs of aggregate value sets. A goodness-of-fit measure is computed for how well each line segment (for the linear fitting) fits onto the corresponding curve segment representing an aggregate value set. The goodness-of-fit measure can be one of any number of measures, including R-squares, adjusted R-squares, AIC (Akaike's Information Criterion), BIC (Bayesian Information Criterion), and other goodness-of-fit measures.
In some implementations, once a change point is detected, an alert provided by the change-point detection module 100 can be presented to a display monitor 122 (that is able to display a graphical user interface or GUI 124) or an audio output device 126 of the computer 110. Thus, the change-point detection module 100 is able to provide either a visual and/or audio alert to a user in response to a systematic change in data. The display monitor 122 is coupled to a video controller 128 in the computer 110, and the audio output device 126 is coupled to an audio interface 130 in the computer 110. Alternatively, the change-point detection module 100 is also able to communicate an alert of a systematic data change over the data network 114 to a remote computer, such as the client 118. The alert enables a user to act upon the systematic change in data. The alert can be in the form of a report or other indication.
A process performed by the change-point detection module 100 according to an embodiment is illustrated in
The observed data values are depicted in
Next, a grand mean of the observed data values,
The change-point detection module 100 also computes (at 206) aggregate values, in one example cumulative sum (CUSUM) values. In other examples, other types of aggregations can be performed, such as aggregations associated with a generalized likelihood ratio (GLR) algorithm or other aggregation algorithms.
The cumulative sums are calculated based on the residual values ri, with a time series of the cumulative sums represented as {ct}, where ct=Σi=1tri, for t=1, 2, . . . , n.
In accordance with some embodiments of the invention, at least two sets of cumulative sum values are defined (at 208), a first cumulative sum set corresponding to a first curve segment of the curve 304, and a second cumulative sum set corresponding to a second curve segment. In other embodiments, additional sets of cumulative sum values can be defined. The cumulative sum sets described here refer to the aggregate value sets discussed above. The time series of cumulative sum values are partitioned into the following pair of cumulative sum sets: {c1, . . . , ct-1}, {ct, . . . , cn}. The first cumulative sum set of the pair includes cumulative sum values c1, . . . , ct-1, and the second cumulative sum set of the pair includes cumulative sum values ct, . . . cn, where the value of t is selected from a possible change point (PCP) set {2, 3, . . . , n−1, n}. Thus, for example, if the value of t is 5, then the first cumulative sum set includes cumulative sum values c1, c2, c3, and c4, and the second cumulative sum set includes c5, . . . , cn. If the value of t is 2, then the first cumulative sum set includes one cumulative sum value c1, and the second cumulative sum set includes cumulative sum values c2, . . . , cn. The t value is varied (by selecting from the PCP set) to vary the numbers of members in the first and second cumulative sum sets in different pairs of the cumulative sum sets. Effectively, each cumulative sum set in the pair contains a number of members that is based on the value of t. In performing the change-point detection analysis, the change-point detection module 100 varies the value of t to obtain multiple pairs of cumulative sum sets.
At step 208, a first value of t is selected to define the first pair of cumulative sum sets, where t is selected from t=2, . . . , n. A linear regression model is built (at 210) by the change-point detection module 100 for each of the two cumulative sum sets in the pair. Building a linear regression model for each of the two cumulative sum sets refers to performing a linear regression fitting onto curve segments of the curve 304. Other types of fitting, including non-linear fitting, can be performed in other embodiments.
An example of linear fitting onto the two curve segments is depicted in
More formally, in building a linear regression model for each of the cumulative sums sets, a response variable includes the cumulative sum values in each set, and an explanatory variable includes the time point values. Linear regression attempts to model the relationship between two variables (the response variable and explanatory variable) by fitting a linear equation to data (in this case, the aggregated data values, e.g., cumulative sum values). Note that for t=2 or t=n, regression is not performed on the set with one cumulative sum value, but the regression is performed on the other set. Specifically, note that when t=2, the first cumulative sum set in the pair contains only a single value c1, taken at time point 1, and thus a regression fitting does not have to be performed for this set. The second cumulative sum set in the pair contains values {c2, c3, . . . , cn}, taken at time points 2, 3, . . . , n. A linear regression fit is performed on this set. When t=n, the first cumulative sum set in the pair contains values c1, c2, . . . , cn-1, taken at time points 1, 2, . . . , n−1, and a linear regression fitting is performed on this set. The second cumulative sum set in the pair in this case has only a single value cn, taken at time point n, and a regression fitting does not have to be performed for this set. For other values of t (t=2, 3, . . . , n−1), there will be two cumulative sum sets in the pair, each containing more than one value, and a regression fitting is performed on each of the sets.
Next, the change-point detection module 100 computes (at 212) a goodness-of-fit measure for each linear regression model. In other words, a goodness-of-fit measure is computed for how well each line segment 310A, 310B fits onto the corresponding curve segment representing a cumulative sum set. Two goodness-of-fit measures are computed, one for each cumulative sum set in a pair, for the current value of t. These two goodness-of-fit measures are summed to form an overall goodness-of-fit measure for the two line segments partitioned at the time point t. Note that for t=2 or t=n, only one linear regression model is built, so that only one goodness-of-fit measure is computed and used as the overall goodness-of-fit measure. A better fit is indicated by a lower value of the goodness-of-fit measure in some implementations. In other implementations, a better fit is indicated by a higher or some other value of the goodness-of-fit measure. The overall goodness-of-fit measure is referred to as a detection measurability value (DMV).
The change-point detection module 100 next checks (at 214) to determine if all values of t (from the PCP set) have been considered. If not, another value of t is selected (at 216), and the change-point detection module 100 proceeds back to step 208 to repeat steps 208, 210, and 212.
As the value of t is varied, different line segments are fitted onto the respective curve segments corresponding to the changing cumulative sum sets. For example, as depicted in
If all values of t have been considered, as determined at 214, the change-point detection module 100 identifies (at 218) the pair of cumulative sum sets associated with regression models having the lowest DMV. This pair of cumulative sum sets corresponds to a particular value of t, which is identified as the change point. In some embodiments, if a single value for the change point is desired, then the change point is identified as the time point where the DMV attains its optimal (e.g., minimum or maximum) value. This change point is output (at 220) by the change-point detection module 100 (such as in the form of an alert).
On the other hand, if a range of time points is desired, then the following is performed. A confidence level, such as 1−α=90%, is set to identify the most likely values for the change point in the possible change point set. Then the quantile value of the possible change point set is computed at level a. Effectively, in the example where the confidence level is 1−α=90%, the quantile value of the possible change point set at level a, in this example 10%, is computed by finding the values of the DMV that are within 10% of the minimum DMV (in other words, these values of the DMV satisfy the set confidence level). An example is illustrated in
By performing linear fitting onto curve segments representing respective aggregate value sets, identification of a change point (or plural possible change points) is based on goodness-of-fit measurements so that a threshold value does not have to be predefined. As a result, without having to predefine a threshold value, false alarms or detection delays associated with inaccurately set threshold values can be avoided or reduced.
The change-point detection module 100 of
Data and instructions (of the software) are stored in respective storage devices (such as storage 104 in
In the foregoing description, numerous details are set forth to provide an understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these details. While the invention has been disclosed with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover such modifications and variations as fall within the true spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5218299 | Dunkel | Jun 1993 | A |
5983251 | Martens et al. | Nov 1999 | A |
6132969 | Stoughton et al. | Oct 2000 | A |
6454729 | Jacobs | Sep 2002 | B1 |
6772181 | Fu et al. | Aug 2004 | B1 |
6985779 | Hsiung | Jan 2006 | B2 |
7046474 | Jin | May 2006 | B2 |
7047089 | Martin | May 2006 | B2 |
7308385 | Wegerich | Dec 2007 | B2 |
7346593 | Takeuchi et al. | Mar 2008 | B2 |
7587330 | Shan | Sep 2009 | B1 |
20030233273 | Jin et al. | Dec 2003 | A1 |
20040015458 | Takeuchi et al. | Jan 2004 | A1 |
20050069207 | Zakrzewski et al. | Mar 2005 | A1 |
20050143873 | Wilson | Jun 2005 | A1 |
20050203360 | Brauker | Sep 2005 | A1 |
20060074817 | Shan | Apr 2006 | A1 |