To better understand operations within an enterprise (such as a company, educational organization, government agency, and so forth), the enterprise may collect information regarding various aspects of such operations. For example, monitors may be added to information technology (IT) systems to gather data during operation of the IT systems. The enterprise may also collect information regarding business aspects of the enterprise, such as information regarding offerings (goods and/or services) provided by the enterprise.
It is desirable to analyze the data to perform anomaly detection, such as to detect for failure conditions, errors, or any other condition that the enterprise may wish to address. However, such data analysis is complicated by presence of seasonality (or seasonal effects) in received data.
Some embodiments of the invention are described with respect to the following figures:
To allow for accurate analysis of temporal data collected regarding an enterprise, it is desired that seasonal effects (or seasonality) of the temporal data be identified. A seasonal effect refers to a time-dependent pattern in the temporal data collected over time (in a time series), where the pattern tends to repeat every season (or cycle) of a certain length. The length can be seconds, minutes, hours, days, months, years, and so forth. Seasonal behavior in the temporal data can be based on different usage patterns, internal processes of systems, or other factors. For example, user volume often shows daily and weekly cycles, corresponding to typical access patterns to the system.
Without identifying seasonality in the temporal data, some analysis performed on the temporal data may not produce accurate results, such as false alarms. Analysis may be performed on temporal data for anomaly detection, such as to identify failure conditions, errors, or any other condition that the enterprise may wish to address.
The seasonality detection algorithm according to some embodiments does not assume that seasonal effects are based on a static season, such as an hour, day, or week. Instead, the seasonality detection algorithm according to some embodiments is able to consider seasons of arbitrary varying lengths, and to identify one of the seasons representing the seasonality effect in the temporal data. For example, different possible seasons considered can start at one hour and continue in increments of an hour until some maximum season size (e.g., a week, month, or year).
An error score is used to assist in selection of one of the seasons as representative of the seasonality in the temporal data, where the error score is derived based on statistical measures computed based on the temporal data while taking into account a corresponding season. Thus, for multiple seasons being considered (candidate seasons), multiple corresponding error scores are produced. The computation of the error scores is performed in different ways depending upon whether the temporal data is continuous temporal data or discrete temporal data (as discussed further below). The candidate season associated with a lowest (or most optimal) error score is selected as the most likely to represent the seasonal effect in the temporal data. In another embodiment, instead of using error scores, likelihood scores can be used instead.
The seasonality detection algorithm according to some embodiments is able to perform seasonality identification even if there are gaps in the temporal data. In addition, the seasonality detection algorithm is able to tolerate noisy input data relatively well. Moreover, the seasonality detection algorithm works on temporal data (continuous or discrete) without fixed (regular) sampling intervals.
The seasonality detection algorithm also receives (at 104) a set of candidate seasons to test. For example, the candidate seasons can be seasons within a range of hours from 0 to some target number of hours. Thus, the candidate seasons can be a 1-hour season, a 2-hour season, a 4-hour season, a 15-hour season, a 40-hour season, and so forth. The number of candidate seasons tested can be relatively large, in view of the fact that the seasonality detection algorithm is relatively simple and thus can be performed in a timely fashion. By being able to consider a relatively large number of candidate seasons of arbitrary lengths, more accurate identification of the seasonality in the temporal data can be achieved. Each candidate season is referred to as season k, where k=1 to numSeasons, and where numSeasons≧2 represents the number of seasons being considered.
Block 106 in
Next, the samples in the received temporal data are assigned (at 110) to the corresponding buckets, based on the time of each sample. The time of a particular sample falls within one of the buckets. In the example above, if the time of the particular sample occurs between 15 minutes and 29 minutes after the hour, then the particular sample would be assigned to the second bucket 202B in
Next, an error score for season k, error(k), is computed (at 112) based on the data samples in the buckets of season k. To determine an error score, different processing is performed depending on whether the temporal data is continuous temporal data or discrete temporal data, as described in connection with
The processing of block 106 is repeated for each of the candidate seasons considered, such that corresponding error scores are produced for corresponding candidate seasons.
The error scores of the candidate seasons are then compared (at 114). An indication of the minimum error score can then be output (at 116). For example, the error scores of the corresponding candidate seasons may be stored in an error vector, and the indication that is output at 116 can be an index into the error vector. The output index (or other indication) that identifies a corresponding season can then be used in later processing to identify the seasonality of the temporal data. In a different embodiment, instead of selecting the minimum error score, a score having another optimal value (e.g., maximum score) can be selected—the score with “optimal” value depends on the type of score calculated.
As noted above, computation of the error scores is different depending upon whether the temporal data is continuous or discrete data.
In
The absolute deviations between the data samples of a bucket and the statistical measure (e.g., median) of the bucket are then calculated (at 404). These absolute deviations calculated for a particular bucket are summed to produce a corresponding deviation sum: Deviation_Sum(i)=Σ|Di(j)−medi|, where medi represents the median for bucket i, Di(j) represents data sample j in bucket i, where i=1 to Nb
The deviation sums, Deviation_Sum(i), for the buckets are in turn aggregated (at 406), such as by summing, to produce an error score, error(k), for the corresponding candidate season k. Summing the deviation sums is performed as follows to produce error score, error(k), for season k is as follows:
In an alternative implementation, to avoid overfitting, the seasonality detection algorithm performs n-fold cross validation when computing the absolute deviations between data samples in a bucket and the corresponding statistical measure of the bucket. With n-fold cross validation, the data samples in each bucket are partitioned into n groups randomly (n>1). The statistical measure is then calculated on n−1 groups, with the absolute deviation computed on the remaining group. The process is then repeated for each of the n groups.
The entropy of the data samples in each bucket based on the PMF is then computed (at 504):
Next, the error score of season k is computed (at 506) as the average entropy of all buckets:
Several techniques can be employed to estimate pνb
The smoothing above adds a small pseudo count, s, to each value in each bucket to ensure that when the number of data samples in a bucket is small, the distribution is close to uniform. In some embodiments, the value of s is chosen as a function of the number of buckets (Nb
A seasonality detection software 602 is executable on the processor 604. The seasonality detection software 602 is able to take as temporal data 610 stored in the storage media 608 to identify a seasonality of the temporal data 610. The temporal data 610 may be received by the computer 600 through a network interface 606 of the computer 600 (from remote sources, such as monitors).
The computer 600 further includes a baseline estimator 612 executable on the processor 604. The baseline estimator 612 is used to perform baseline estimation of the temporal data once the dominant season has been found, which is the season associated with the minimum score discussed above.
Next, upper and lower thresholds for each bucket can be set (at 708) based on the computed statistics. Multiple levels of threshold (more than two) can be used based on the levels of anomalies to be detected.
Once the thresholds are set according the baseline estimation above, as a new data sample is received, it is mapped to a corresponding one of the buckets, and then compared to the thresholds of the mapped bucket to classify the new data sample as normal or abnormal.
Various enhancements of the algorithms discussed above may be provided. For example, the temporal data may be associated with trends. In some implementations, such trends in the temporal data are detected, such as by computing a periodic median of the temporal data and checking for a linear trend by estimating the best linear regression over aggregated data. Removing trends from the temporal data allows for more accurate identification of seasonality in the temporal data.
Also, in some cases, the temporal data may be associated with multiple seasons. One technique of detecting multiple seasons is to detect the most dominant season using the algorithms as discussed above. This most dominant season is then removed from the temporal data, such as by using filtering, averaging, or other technique. Then the next most dominant season is identified, and the process is repeated.
It is also possible that the seasons are non-linear—in other words, the seasons do not repeat but change over time. To address this, the time scale could be warped to a linear scale and then the techniques according to some embodiments can be applied.
Instructions of software described above (including the classifier training module 102, feature shaping module 103, and classifier 110 of
Data and instructions (of the software) are stored in respective storage devices, which are implemented as one or more computer-readable or computer-usable storage media. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; and optical media such as compact disks (CDs) or digital video disks (DVDs). Note that the instructions of the software discussed above can be provided on one computer-readable or computer-usable storage medium, or alternatively, can be provided on multiple computer-readable or computer-usable storage media distributed in a large system having possibly plural nodes. Such computer-readable or computer-usable storage medium or media is (are) considered to be part of an article (or article of manufacture). An article or article of manufacture can refer to any manufactured single component or multiple components.
In the foregoing description, numerous details are set forth to provide an understanding of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these details. While the invention has been disclosed with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover such modifications and variations as fall within the true spirit and scope of the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2009/050513 | 7/14/2009 | WO | 00 | 9/22/2011 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2011/008198 | 1/20/2011 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6189005 | Chakrabarti et al. | Feb 2001 | B1 |
20020107841 | Hellerstein et al. | Aug 2002 | A1 |
20040123191 | Salant et al. | Jun 2004 | A1 |
20050066241 | Gross et al. | Mar 2005 | A1 |
20060195423 | Sastry et al. | Aug 2006 | A1 |
20060195444 | Sastry et al. | Aug 2006 | A1 |
20060224356 | Castelli et al. | Oct 2006 | A1 |
20070226626 | Yap et al. | Sep 2007 | A1 |
20090024427 | Shan | Jan 2009 | A1 |
20090182701 | Berger et al. | Jul 2009 | A1 |
20090271406 | Wong et al. | Oct 2009 | A1 |
20090276469 | Agrawal et al. | Nov 2009 | A1 |
20100293124 | Berger et al. | Nov 2010 | A1 |
Entry |
---|
James F. Allen, “Maintaining Knowledge about Temporal Intervals”, ACM, 1983. |
Cao et al, “Spatio-temporal Data Reduction with Deterministic Error Bound”, 2005. |
Donjerkovic et al, “Dynamic Histograms: Capturing Evolving Data Sets”, 1999. |
Max J. Egenhofer, “Temporal Relations of Intervals with a Gap”, 2007. |
Faloutsos et al, “Fast Sequence Matching in Time-Series Databases”, 1994. |
Frank Hoppner, “Discovery of Temporal Patterns Learning Rules about the Qualitative Behavior of Time Series”, 2007. |
Goh et al, “Effect of Temporal Interval Between Scan Acquisitions on Quantitative Vascular Parameters in Colorectal Cancer: Implications for Helical Volumetric Perfusion CT Techniques”, 2008. |
Guha et al, “Data-Streams and Histograms”, ACM, 2001. |
Han et al, “Efficient Mining of Partial Periodic Ptterns in Time Series Database”, 1999. |
Hetzer et al, “Integrated Information System for Inter-Domain QoS Monitoring, Modelling and Verification”, 2002. |
Lacouture et al, “Absolute Identification of Temporal Intervals: Preliminary Data”, 2001. |
Laxman et al, “Discovering Frequent Episodes and Learning Hidden Markov Models: A Formal Connection”, IEEE, 2005. |
Laxman et al, “A survy of temporal data mining”, 2006. |
Lee et al, “Mining temporal interval relational rules from temporal data”, 2008. |
Ramaswamy et al, “On the Discovery of Interesting Patterns in Association Rules”, Proceedings of the 24th VLDB Conference, 1998. |
Rossana et al, “Temporal Aggregation and Economic Time Series”, Jornal of Business & Economic Statistics, col. 13, No. 4, 1995. |
Stephen M. Shellman, “Time Series Intervals and Statistical Inference: The Effects of Temporal Aggregation on Event Data Analysis”, 2004. |
Sitzmann et al, “Improving Temporal Joins Using Histograms”, 2000. |
Toumba et al, “Pattern based spatio-temporal Quality of Service analysis for capacity planning”, 2003. |
Xiaobai Yao, Research Issues in Spatio-temporal Data Mining 2003. |
Kawasaki et al., A Model Selection Approach to Detect Seasonal Unit Roots, Dec. 9, 1996 (20 pages). |
Ira Cohen et al., HP, Capturing, Indexing, Clustering, and Retrieving System History, Oct. 2005 (15 pages). |
Gunjan K. Gupta et al., Detecting Seasonal Trends and Cluster Motion Visualization for Very High Dimensional Transactional Data, Apr. 2001 (17 pages). |
Bianca Zadrozny et al., Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers, Jun. 2001 (8 pages). |
Number | Date | Country | |
---|---|---|---|
20120016886 A1 | Jan 2012 | US |