Modern organizations often utilize a system landscape consisting of distributed computing systems providing various computing services. For example, in order to implement desired functionality, an organization may deploy services within on-premise data centers (which themselves may be located in disparate geographic locations) and within data centers provided by one or more infrastructure as-a-service (IaaS) providers. A system landscape may also include computing systems operated by third parties, which are accessed using region-specific access points defined by the third parties. Any number of the computing systems may comprise cloud-based systems (e.g., providing services using scalable-on-demand virtual machines).
Anomalies are rare items, events or observations that differ significantly from normal system states. Anomalous behavior of technical components (e.g., network adapters, containers) within a system landscape contributes negatively to the overall operational cost of the landscape. It is therefore desirable to efficiently detect and classify anomalies which occur within a system landscape. Once detected, additional processes may determine whether an anomaly represents a problem and, if so, automatically initiate resolution of the problem.
Sensors may be used to generate streams of data (e.g., time-series data of metric values) which together represent the state of computing systems within a system landscape. It is desirable to use the data streams to detect anomalies and to determine the root cause of each detected anomaly. In theory, a classifier may be trained to perform this classification task. However, due to the complexity of this classification task, a vast amount of labeled data is required to achieve the desired precision and recall of the classifier. Labelling large data sets consisting of data streams with root causes is expensive and requires expert knowledge. Moreover, since anomalies are rare, acquisition of sufficient amounts of labeled data associated with each root cause may be practically impossible.
Re-occurring problems are often associated with a limited number of root causes. Metrics that are relevant for a specific re-occurring problem may be determined, for example based on practical experiences of domain experts, and used to define tailored data sets for training an anomaly detection and classification system which is specific to the problem. Such problem-specific data sets allow division of the above-described complex classification task into independent lighter-weight binary classification tasks which can determine the degree to which a new instance of tailored data set is “normal” or “anomalous”.
Due to the labelling problems described above, approaches for implementing these binary classification tasks must use unsupervised learning techniques. These approaches include the use of recurrent artificial neural networks (RNNs) such as Long Short-Term Memory networks. These networks predict future samples of a “normal” time series, and the distance between actual values and the predicted samples can be interpreted as a score indicating the extent of anomalous behavior. Other classification approaches such as Nearest Neighbors or Local Outlier Factor determine density measures which may assist in the identification of outliers (i.e., abnormal states) of a system. Local Outlier Probability provides an anomaly score in the range of [0, 1], which can be interpreted as the probability of an instance of a data set being “anomalous” to a given level of confidence.
The above approaches determine an anomaly-related value associated with an instance of a data set and do not provide a binary decision indicating whether the instance is “normal” or “anomalous”. Rather, a user or other downstream process must interpret the anomaly-related value, for example by determining a threshold and comparing the value against the threshold to determine whether the value represents normal behavior or an anomaly. A poorly-determined threshold may result in the determination of too many or too few anomalies.
Determination of an appropriate threshold is difficult. The appropriate threshold may vary over time, across different computing systems and across different problem types. Systems are desired for efficiently and dynamically determining an appropriate score threshold for classifying an instance of a data set as an anomaly.
The following description is provided to enable any person in the art to make and use the described embodiments. Various modifications, however, will remain readily-apparent to those in the art.
Some embodiments provide a system to estimate suitable score thresholds in unsupervised environments without labeled data sets. Some embodiments assume that an arbitrary data set is dominated by normal system states even if it includes some anomalies. Accordingly, embodiments may produce new, or “surrogate” data sets that are similar to acquired arbitrary data sets but which do not include anomalies. Such generated surrogates can then be used as a reference to discriminate normal from anomalous states without the need for a-priori data labeling.
Some embodiments train a generative model to generate a surrogate data set from an arbitrary data set. Even if anomalies exist in the arbitrary data set, these anomalies do not reflect main characteristics of the data set and therefore will not be generalized by a well-sized generative model. In other embodiments, an original set of time-series data is transformed into the Fourier-space and the phases of the Fourier-components are randomly shuffled. The shuffled data set is then transformed back into a time-series data set using the inverse Fourier-transformation. The resulting surrogate data set exhibits the same spectral characteristics as the original data set, but without anomalies. Any anomalies which exist in the original data set are blurred along the time series.
Anomaly detection as described herein can be used to enrich top-level alerting, such as violation of service level agreements. Embodiments may accelerate root cause analysis in case incidents. Moreover, embodiments may facilitate the labeling of historical data sets, which may then be used for supervised training of a system to perform complex classification tasks.
Computing system 110 may comprise any number of hardware and software components which may provide functionality to one or more users (not shown). Computing system 110 may comprise a landscape of disparate cloud-based services, a single computer server, a cluster of servers, and any other combination that is or becomes known.
Computing system 110 generates metric-related data. Such data may be related to metrics associated with resource consumption (e.g., CPU utilization, memory utilization, bandwidth consumption), hardware performance (e.g., read/write speeds, bandwidth, CPU speed), application performance (e.g., queries served per second, number of simultaneous sessions), and any other metrics that are or become known. The data generated for each metric may comprise time-series data, and may be generated at different respective time intervals.
Monitoring system 120 may comprise any suitable system to receive the metric-related data generated by computing system 110. Monitoring system 120 may query computing system 110 for selected metric-related data, may subscribe to the selected metric-related data, may receive metric-related data pushed from computing system 110, or may acquire the metric-related therefrom using any suitable protocol. Monitoring system 120 may execute an application for recording real-time metric data in a time series database using an HTTP pull model, such as but not limited to Prometheus.
Monitoring system 120 provides time-series data of each of a plurality of metrics to anomaly detection system 130. In the illustrated example, data for each metric associated with a first time point (e.g., M0t0, M1t0, . . . , M9t0, given metrics M0-M9) is provided, followed by data for each metric associated with a next relevant time point (e.g., M0t1, M1t1, . . . , M9t1), and so on. Embodiments are not limited thereto. For example, monitoring system 120 may provide the data for each metric to anomaly detection system 130 as an independent time-series (e.g., M0t0, M0t1, . . . , M0tn; M1t0, M1t1, . . . , M1tn; . . . M9t0, M9t1, . . . , M9tn). In cases where the data is generated by computing system 110 at high sampling rates, and in order to reduce processing costs, monitoring system 120 may provide time-series data based on a reasonable time delta Δt (e.g., M0t0, M0(t0+1*Δt), M0(t0+2*Δt), . . . , M0(t0+n*Δt)) if a higher sampling rate is not required for anomaly detection.
Monitoring system 120 may perform any suitable processing on the data prior to providing the data to anomaly detection system 130, including but not limited to noise reduction, normalization, and filtering. For example, the time series may be normalized using a standard scaler with average 0 before serialization, in order to avoid artifacts due to the different scales of different metrics M. Pre-processing may also or alternatively be performed by system 130. In a similar regard, the processes attributed herein to system 130 may be performed in whole or in part by monitoring system 120 according to some embodiments.
Anomaly detection system 130 determines a score threshold for use in identifying a past anomaly. Generally, anomaly detection system 130 trains a score generator to generate an outlier score based on the received time-series data, generates surrogate time-series data based on the received time-series data, inputs the surrogate time-series data to the trained score generator to determine an outlier score threshold for identification of an anomaly, and compares outlier scores generated by the trained score generator based on the received time-series data to the outlier score threshold to identify anomalies in the received time-series data.
More particularly, according to some embodiments, data windowing component 132 of anomaly detection system 130 generates training data instances based on the data received from monitoring system 120. Each training data instance includes time-series data of each relevant metric for a given respective time period. The time periods associated with two or more training data instances may partially overlap (thus the use of the term “windowing”).
Unsupervised learning system 134 trains score generator 135 to output an outlier score based on the training data instances. Score generator 135 may comprise an RNN as described above, a Nearest Neighbors algorithm, an Isolation Forest algorithm, a Local Outlier Factor algorithm or a Local Outlier Probability algorithm, for example. Unsupervised learning system 134 may comprise any suitable system to train the selected type of score generator 135.
Training of score generator 135 may include input of training data, acquisition of resulting output, modification of score generator 135 based on the output, and determination to terminate training upon satisfaction of a given target (e.g., an accuracy level, an elapsed time period, a number of iterations). The trained score generator 135 may receive a data instance consisting of the values of several metrics at different time points and output an outlier score predicting a degree to which the data instance represents an anomaly. As noted above, the outlier score does not indicate whether the data instance does or does not represent an anomaly.
Surrogate data generator 136 generates surrogate time-series data for each metric, based on the time-series data of that metric which was received from monitoring system 120. Surrogate data generator 136 may comprise a generative model or an autoencoder trained to generate time-series data having similar spectral characteristics to input time-series data, but without including any anomalies. According to some embodiments, and for each metric, surrogate data generator 136 applies a Fourier transform to the associated acquired time-series data, performs phase-shuffling of the resulting frequency components, and applies an inverse Fourier transform to return the data to the time domain.
The surrogate time-series data for each metric is then input to the trained score generator 135. As described above, data windowing component 132 creates surrogate data instances consisting of surrogate time-series data of each metric for a given respective time period. The time periods may be the same time periods used to generate the training data instances used to train score generator 135. The surrogate data instances are input to score generator 135, resulting in an outlier score for each surrogate data instance (and for each time period represented by each surrogate data instance).
Anomaly threshold determination component 138 determines a score threshold based on the outlier scores generated based on the surrogate data instances. In some embodiments, the score threshold is equal to the highest of the generated outlier scores. Embodiments are not limited thereto. For example, if several outlier scores are much higher than the other outlier scores, the score threshold may be determined as the smallest of the several higher outlier scores, the average of the several higher outlier scores, or in any other manner.
Anomaly detection system 130 then identifies those training data instances which, when input to trained score generator 135, result in an outlier score greater than the determined threshold. The identified training data instances are each determined to represent an anomaly. Stated differently, an anomaly is detected for each of the time periods represented by the identified training data instances. In contrast to prior systems which merely output an outlier score, an appropriate user, administrator or department may be automatically notified of the detected anomalies without requiring further processing or user judgment.
The above process may be repeated successively after the acquisition of new time-series data by monitoring system 120. A subsequent execution of the process may use some of the time-series data which was used during a previous execution of the process, or all new time-series data. Since score generator 135 is re-trained during the subsequent execution using at least partially new training data instances, and because the generated surrogate time-series data differs from the previous execution, the newly-determined score threshold may differ from the previously-determined score threshold. This dynamic determination of the score threshold may advantageously adapt to natural, innocuous changes to computing system 110 over time and may result in more reliable detection of anomalies than prior systems.
System 100 shows domain expert 140 in communication with system 130. Domain expert 140 may provide a particular set of metrics which are believed to be indicative of a particular type of system problem. According to some embodiments, these metrics are the metrics whose time-series data is used to train score generator 135 and is converted to surrogate time-series data. Accordingly, the anomalies detected by system 130 as described above are indicative of a potential occurrence of the particular type of problem. Any notifications of the detected anomalies may therefore also include an identification of the problem.
Anomaly detection system 130 may perform the process described above for different sets of metrics which are indicative of different problems. Each of these processes may be performed in parallel. One or more metrics may be included in two or more of such sets of metrics.
Initially, at S205, a plurality of metrics associated with a problem type are determined. The plurality of metrics may be determined from a domain expert. Computing systems may generate hundreds of metrics, many of which may be irrelevant to detection of a particular type of problem. Leveraging knowledge of a domain expert, the plurality of metrics determined at S205. Leveraging knowledge of a domain expert, the plurality of metrics determined at S205 may be narrowed to a set of indicators of a specific problem, thereby specializing from a general problem detector to a detector of specific problem classes.
At S210, time-series data of each of the plurality of metrics is acquired. The data may be acquired by a monitoring component which obtains the data from one or more computing systems, and/or directly from the one or more computing systems. The acquired time-series data includes, for each metric, a value of the metric at different points in time. Training data instances are determined from overlapping windows of the time-series data at S215. The windows need not overlap in some embodiments.
A system is trained at S220 using unsupervised learning to generate an outlier score. The training is based on the training data instances determined at S215. The outlier score represents a degree to which an input instance might represent an anomaly, but does not indicate whether the data instance does or does not represent an anomaly. The training is unsupervised because the training data instances are unlabeled (i.e., not associated with an indication of whether or not they are examples of an anomaly). The score-generating system may comprise an RNN, a Nearest Neighbors algorithm, an Isolation Forest algorithm, a Local Outlier Factor algorithm or a Local Outlier Probability algorithm, for example.
Surrogate time-series data for each metric is generated at S225. The surrogate time-series data may be generated by a generative model or an autoencoder trained to generate time-series data having similar characteristics to input time-series data, but without including any anomalies. In other embodiments, S225 includes applying a Fourier transform to time-series data of a metric, phase-shuffling the resulting frequency components, and applying an inverse Fourier transform to return the data to the time domain. These steps are performed independently for the time-series data of each metric, to generate surrogate time-series data for each metric.
Input data instances are determined from overlapping windows of the surrogate time-series data at S230. The input data instances may be determined in the same manner as described above with respect to the training data instances, and using the same time periods used to generate the training data instances.
The input data instances are input to the trained score generator at S235 to generate an outlier score for each input data instance. For example, each of input data instances 710, 720 and 730 of
A largest one of the outlier scores is identified at S240. Continuing the present example of
Next, at S245, the training data instances which, when input to the trained score generator, result in an outlier score greater than the determined threshold are identified.
In some cases, surrogate time-series data might contain anomaly-like patterns due to random interferences in phase space. These patterns may result in determination of a larger than optimal score threshold. To address this possibility, S225-S240 could be repeated several times for the same time window, with the final threshold equal to a mean or average value of the intermediately-determined score thresholds. Alternatively, and to avoid extra calculations per time window, some embodiments determine the threshold as a weighted average of the last n determined thresholds (e.g., n=8) with exponential decay.
At S250, a notification of an anomaly is generated for each of the identified training data instances. The notification may include the time periods associated with the training data instances, values of each of the plurality of metrics during the time periods, and an identification of the associated problem type.
Flow returns to S210 to acquire next time-series data of each of the plurality of metrics. Process then continues as described above, using only newly-acquired time-series data or also some of the time-series data which was used during the previous execution of process 200. Accordingly, the system is re-trained at S220 using at least partially new training data instances, and the generated surrogate time-series data differ from those generated during the previous execution. Therefore, the score threshold next determined at S240 may differ from the previously-determined score threshold.
In some embodiments, after a first execution of process 200, flow cycles between S210, S245 and S250 to acquire data and identify anomalies based on the currently-determined outlier score threshold. During this execution, the threshold will not change (i.e., adapt to newly-seen data) but the time required to identifier anomalies will be reduced. Retraining (i.e., execution of all steps S210-S250) may be then executed occasionally.
As mentioned above, different instances of process 200 may be performed for different sets of metrics associated with different problem types. Each of these instances may be performed in parallel. Moreover, two or more of these different sets of metrics may include one or more of the same metrics.
Anomaly detection system 950 and its components 952 through 958 may operate as described above with respect to anomaly detection system 130. In this regard, domain experts 960 and 970 may each provide system 950 with a respective set of metrics for each of one or more problem types. Anomaly detection system 950 may execute in parallel an instance of process 200 for each problem type, using the respective set of metrics for each problem type. It is assumed that the time-series metric-related data provided by monitoring system 940 includes all of the metrics of each set of metrics, and that a particular metric may be included in one or more of the sets of metrics.
In some embodiments, monitoring system 940 provides time-series metric-related data as well as identifiers of the system 910-930 from which the data of each time-series was acquired. Anomaly detection system 950 may use this information to train computing system-dependent models and determine computing system-dependent thresholds as described above.
Application server 1010 and database server 1020 may operate to provide one or more services to users. In this regard, application server 1010 and database server 1020 may comprise an implementation of computing system 110, or of computing systems 910 and 920.
Monitoring system 1030 receives metric-related time series data from each of application server 1010 and database server 1020. Anomaly detection system 1040 receives this data (or a subset) thereof from monitoring system 1030. Anomaly detection system 1040 may operate as described herein to identify anomalies based on the received time-series data.
The foregoing diagrams represent logical architectures for describing processes according to some embodiments, and actual implementations may include more or different components arranged in other manners. Other topologies may be used in conjunction with other embodiments. Moreover, each component or device described herein may be implemented by any number of devices in communication via any number of other public and/or private networks. Two or more of such computing devices may be located remote from one another and may communicate with one another via any known manner of network(s) and/or a dedicated connection. Each component or device may comprise any number of hardware and/or software elements suitable to provide the functions described herein as well as any other functions. For example, any computing device used in an implementation of a system according to some embodiments may include a processor to execute program code such that the computing device operates as described herein.
All systems and processes discussed herein may be embodied in program code stored on one or more non-transitory computer-readable media. Such media may include, for example, a hard disk, a DVD-ROM, a Flash drive, magnetic tape, and solid-state Random Access Memory (RAM) or Read Only Memory (ROM) storage units. Embodiments are therefore not limited to any specific combination of hardware and software.
Embodiments described herein are solely for the purpose of illustration. Those in the art will recognize other embodiments may be practiced with modifications and alterations to that described above.
Number | Name | Date | Kind |
---|---|---|---|
20190391901 | Gupta et al. | Dec 2019 | A1 |
20210136098 | Stergioudis et al. | May 2021 | A1 |
20210158260 | Kolar | May 2021 | A1 |
20210281492 | Di Pietro | Sep 2021 | A1 |
20210304891 | Kozloski et al. | Sep 2021 | A1 |
20210352090 | Kim | Nov 2021 | A1 |
20210367885 | Lin | Nov 2021 | A1 |
20210397500 | Wieder et al. | Dec 2021 | A1 |
20220129791 | Nia et al. | Apr 2022 | A1 |
20220174511 | Kvernvik | Jun 2022 | A1 |
20220191085 | Li et al. | Jun 2022 | A1 |
20220368614 | Lee | Nov 2022 | A1 |
20230023646 | Xu | Jan 2023 | A1 |
Entry |
---|
Kriegel, Hans-Peter et al., “LoOP: Local Outlier Probabilities”, CIKM'09, Nov. 2-6, 2009, (pp. 1649-1652, 4 total pages). |
Wang, Lin et al., “Log-based Anomaly Detection from Multi-view by Associating Anomaly Scores with User Trust”, 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2021, DOI 10.1109/TrustCom53373.2021.00096, (pp. 643-650, 8 total pages). |