Metric alert rules are used to proactively detect service problems. Many of today's alerts are applied on various metrics generated by a service and rely on threshold values that are manually defined. An effective alert rule alerts when the metric does not behave as expected, while on the other hand, should not create too many false positives. Configuring static thresholds is a complex task, requiring the service owner to learn the historical behavior of each metric, apply some of his deep domain knowledge of the service, and make a prediction of what value ranges should be considered within the norm. The challenge scales up when a metric has one or more dimensions slicing it to multiple time series with different normal behaviors. In the dynamic environment in which modern services operate, services undergo frequent updates, and there are frequent changes to the way services are consumed. This requires an ongoing adjustment of static thresholds which means repeating the complex task every time a change happens.
Forecasting future metric values based on past behavior is widely used in alerting systems, where a prediction mechanism provides not only a predicted single value for a future timestamp but an additional range around the value considered as the model estimation on the possible error around the prediction. It is important that this uncertainty range will be estimated efficiently for the system to provide valuable anomaly detections. Using too wide of a range will make the prediction not useful, while making the range too narrow will result in many anomalies
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Methods, systems, apparatuses, and computer-readable storage mediums described herein are configured to provide dynamic thresholds for alerting users of anomalous resource usage of computing resources. The dynamic thresholds may be based on the historical behavior of compute metrics (or a time series obtained therefor) associated with the computing resources and a detected seasonality in that time series. The seasonality is detected based on an analysis of several, different time series combinations that are based on the original time series, which advantageously increases the probability of successful seasonality detection. Based on characteristics of the time series, a model for generating dynamic thresholds may be determined. The dynamic thresholds track the detected seasonality of the compute metrics, rather than being a static (or straight-line) threshold. As utilization of the computing resources continue, the determined thresholds are applied to the compute metrics. If the determined thresholds are exceeded, an alert indicating an anomalous resource usage (which may be indicative of an issue with respect to the computing resource(s)) may be provided to a user. The dynamic threshold may be adjusted (e.g., tightened or relaxed) based on a confidence level of the detected seasonality. This advantageously reduces the number of false alerts.
Further features and advantages, as well as the structure and operation of various example embodiments, are described in detail below with reference to the accompanying drawings. It is noted that the example implementations are not limited to the specific embodiments described herein. Such example embodiments are presented herein for illustrative purposes only. Additional implementations will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein.
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate example embodiments of the present application and, together with the description, further serve to explain the principles of the example embodiments and to enable a person skilled in the pertinent art to make and use the example embodiments.
The features and advantages of the implementations described herein will become more apparent from the detailed description set forth below when taken in conjunction with the drawings, in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The drawing in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.
The present specification and accompanying drawings disclose numerous example implementations. The scope of the present application is not limited to the disclosed implementations, but also encompasses combinations of the disclosed implementations, as well as modifications to the disclosed implementations. References in the specification to “one implementation,” “an implementation,” “an example embodiment,” “example implementation,” or the like, indicate that the implementation described may include a particular feature, structure, or characteristic, but every implementation may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same implementation. Further, when a particular feature, structure, or characteristic is described in connection with an implementation, it is submitted that it is within the knowledge of persons skilled in the relevant art(s) to implement such feature, structure, or characteristic in connection with other implementations whether or not explicitly described.
In the discussion, unless otherwise stated, adjectives such as “substantially” and “about” modifying a condition or relationship characteristic of a feature or features of an implementation of the disclosure, should be understood to mean that the condition or characteristic is defined to within tolerances that are acceptable for operation of the implementation for an application for which it is intended.
Furthermore, it should be understood that spatial descriptions (e.g., “above,” “below,” “up,” “left,” “right,” “down,” “top,” “bottom,” “vertical,” “horizontal,” etc.) used herein are for purposes of illustration only, and that practical implementations of the structures described herein can be spatially arranged in any orientation or manner.
Numerous example embodiments are described as follows. It is noted that any section/subsection headings provided herein are not intended to be limiting. Implementations are described throughout this document, and any type of implementation may be included under any section/subsection. Furthermore, implementations disclosed in any section/subsection may be combined with any other implementations described in the same section/subsection and/or a different section/subsection in any manner
Embodiments described herein provide dynamic thresholds for alerting users of anomalous resource usage of computing resources. The dynamic thresholds may be based on the historical behavior of compute metrics (or a time series obtained therefor) associated with the computing resources and a detected seasonality in that time series. The seasonality is detected based on an analysis of several, different time series combinations that are based on the original time series, which advantageously increases the probability of successful seasonality detection. Based on characteristics of the time series, a model for generating dynamic thresholds may be determined. The dynamic thresholds track the detected seasonality of the compute metrics, rather than being a static (or straight-line) threshold. As utilization of the computing resources continue, the determined thresholds are applied to the compute metrics. If the determined thresholds are exceeded, an alert indicating an anomalous resource usage (which may be indicative of an issue with respect to the computing resource(s)) may be provided to a user. The dynamic threshold may be adjusted (e.g., tightened or relaxed) based on a confidence level of the detected seasonality. This advantageously reduces the number of false alerts.
The foregoing techniques advantageously enable the automatic detection of seasonal behavior and automatically set the thresholds such that an alert will be triggered only on deviation from the expected seasonal behavior. For example, alerts based on dynamic thresholds will not be triggered if a service is regularly idle on the weekends and then spikes every Monday. The techniques described herein recognize this seasonality and generate the dynamic thresholds based thereon. Static thresholds, on the other hand, are not very effective for such seasonal metrics. Instead, static thresholds issue alerts during spikes caused by seasonal behaviors, and as a result, unnecessary diagnostics are performed on the associated compute resource. This in turn causes significant downtime with respect to the compute resource. Accordingly, the techniques described herein improve the functionality of a system in which such compute resources are included, as any issues with the compute resources are accurately detected (and thus resolvable), while also avoiding unnecessary downtime due from false positives.
Moreover, the embodiments described herein improve the functioning of the computing devices for which the metrics are being obtained. For instance, conventional techniques that utilize static thresholds may mask legitimate issues. For instance, if the static threshold is set large enough to accommodate a large seasonal spike, then anomalous behaviors may go undetected. As such, a user may never be alerted when such a behavior occurs and subsequently remedy the issue. This may have a detrimental effect on the computing device. For instance, the computing device may be suffering from abnormal memory usage and/or network usage, which would go unnoticed by the user. Accordingly, the computing device may operate much more slowly and/or may be unable to properly handle requests. In contrast, because the embodiments described herein dynamically track metrics based on their seasonality, such a situation is avoided.
For example,
Clusters 102A, 102B and 102N may form a network-accessible server set. Each of clusters 102A, 102B and 102N may comprise a group of one or more nodes and/or a group of one or more storage nodes. For example, as shown in
In an embodiment, one or more of clusters 102A, 102B and 102N may be co-located (e.g., housed in one or more nearby buildings with associated components such as backup power supplies, redundant data communications, environmental controls, etc.) to form a datacenter, or may be arranged in other manners. Accordingly, in an embodiment, one or more of clusters 102A, 102B and 102N may be a datacenter in a distributed collection of datacenters.
Each of node(s) 108A-108N, 112A-112N and 114A-114N may be configured to execute one or more software applications (or “applications”) and/or services and/or manage hardware resources (e.g., processors, memory, etc.), which may be utilized by users (e.g., customers) of the network-accessible server set. Node(s) 108A-108N, 112A-112N and 114A-114N may also be configured for specific uses. For example, as shown in
Dynamic threshold-based alert engine 118 may be configured to determine dynamic thresholds for alerting users of anomalous resource usage of resources maintained by system 100. For instance, a monitor may obtain metrics associated with resources, such as, but not limited to, operating systems, applications, services executing on one or more of nodes 108A-108N, 112A-112N and/or 114A-114N, hardware and virtual resources maintained by the network-accessible server set (e.g., nodes 108A-108N, 112A-112N and/or 114A-114N, virtual machines, central processor units (CPUs), storage (e.g., storage disks 122), memories, etc.), and/or I/O network bandwidth, power, etc., associated therewith. The metrics may represent numerical data values that describe an aspect of such resources at a particular point of time. For example, the metrics may represent CPU usage, a number of requests issued by a particular application or service, memory or storage utilization, etc. Such metrics may be collected at regular intervals (e.g., each second, each minute, each hour, each day, etc.) and may be aggregated as a time series (i.e., a series of data points indexed in time order). The monitor may collect multiple days or weeks of worth data to obtain the historical behavior of the metric. The time series for each metric may be stored in a storage, such as storage disks 122.
Dynamic threshold-based alert engine 118 may analyze the historical behavior of the metric (i.e., the time series) to determine a seasonal pattern (i.e., a seasonality) therein. A seasonal pattern is a characteristic of the time series in which the data experiences regular or predictable changes that occur at particular time interval, such as hourly, daily, weekly, etc. Examples of seasonal patterns include, but are not limited to, increased network traffic on weekdays than compared to weekends, increased network traffic during business hours than compared to non-business hours, a daily spike in CPU and/or storage utilization (e.g., due to a backup process), etc. The foregoing may be determined by generating several different time series combinations based on the original time series. Additional details regarding determining a seasonal pattern is described below in Subsection A.
The historical time series and/or determined seasonal pattern for a given metric may be utilized by a model selector, which is configured to automatically select a modeler for generating dynamic thresholds with regards to the metric. The model selector may utilize the determined seasonal pattern and/or the diversity of values of the metric to determine which model best fits. Examples of modelers include, but are not limited to, a low dispersion-based modeler, a seasonal adjusted boxplot-based modeler and a Box-Cox transformation-based modeler. The selected modeler is utilized to determine the dynamic thresholds for the metric. Additional details regarding the model selector is described below in Subsection B.
As the computing resources continue to operate, the monitor continues to obtain computing metrics associated with such resources. The determined thresholds are applied to such computing metrics. If the determined thresholds are exceeded, an alert indicating anomalous resource usage with respect to the computing resource(s) may be provided to a user (e.g., via computing device 104).
A user may access dynamic threshold-based alert engine 118 via computing device 104, for example to enable dynamic threshold generation and/or to receive anomalous resource usage alerts. As shown in
A. Seasonal Pattern Detection
Monitor 202 may obtain metrics associated with resources 218, such as, but not limited to, operating systems, applications, services executing on one or more of nodes 108A-108N, 112A-112N and/or 114A-114N, hardware and virtual resources maintained by the network-accessible server set (e.g., nodes 108A-108N, 112A-112N and/or 114A-114N, virtual machines, central processor units (CPUs), storage (e.g., storage disks 122), memories, etc.), and/or I/O, network bandwidth, power, etc., associated therewith. Such metrics may be collected at regular intervals and may be aggregated as a time series 216. Monitor 202 may collect multiple days or weeks of worth data to obtain the historical behavior of the metric. The time series for each metric may be stored in storage 214, which may be an example of storage disk(s) 122.
Seasonality detector 204 may be configured to analyze the historical behavior of the metric (i.e., time series 216) to determine a seasonal pattern (i.e., a seasonality) therein. Using known techniques to detect seasonality in a time series is problematic in many real-world scenarios due to noise in the metric that prevents the seasonality from being detected (e.g., by using Fast Fourier Transforms (FFTs)). To overcome this, seasonality detector 204 may generate several different time series combinations that are generated based on the original time series (e.g., time series 216)), which advantageously increases the probability to detect the seasonality. An FFT may be applied to each combination to detect seasonality for each of the generated combinations. The foregoing may be performed using unsupervised machine learning-based techniques, which utilize time series 216 during a training phase in which the seasonal pattern is detected. In accordance with an embodiment, the training phase may be performed approximately every 24 hours. In accordance with such an embodiment, data newly available since the last training phase is added while trailing data is omitted. In accordance with a further embodiment, the history time span used for training may be 10 days, except when weekly seasonality is detected, in which case 28 days of historic span is used. As will be described below, once the seasonal pattern is detected (e.g., via the training phase), dynamic thresholds may be generated based thereon, and the dynamic thresholds may be applied to current compute metrics to detect anomalous behavior. Such techniques may continuously learn a particular metric's behavior and adapts to metric changes. That is, the seasonal pattern and amount of data used for training determined for a particular metric may change over time as the behavior of the metric being monitored changes.
Each time series combination may be generated based on a combination of one or more parameters. The parameter(s) may include, but are not limited to, a clipped (or non-clipped) version of the time series) and/or one or more filtered versions of the non-clipped and/or clipped version of the time series, where the time series are filtered based on different window sizes.
For instance, time series 216 may be clipped by clipper 206. Clipper 206 may be configured to remove outlying data points of time series 216 (e.g., to remove spikes in the metric). For instance, clipper 206 may be configured to remove a certain percentage of the highest and lowest values of time series 216 (e.g., 5% of the highest and lowest values) to generate a clipped time series 220. Each of time series 216 and clipped time series 220 may be provided to time-based filter(s) 208.
Time-based filter(s) 208 may be configured to perform a filtering (or smoothing) function on time series 216 based on different window sizes. The filtering function is configured to reduce the noise in the metric, while preserving the seasonal pattern. The different window sizes may be computed to match the seasonal spans that are frequently recurring in time series 216 (e.g., hourly, daily, weekly, etc.). For instance, time-based filter(s) 208 may generate a first filtered time series 222 based on time series 216 in accordance with a first window size (e.g., hourly). In particular, time-based filter(s) 208 may generate first filtered time series 222 by performing a filtering function on time series 216 that, for each data point in time series 216, combines adjacent data points with the data point to determine an average value. The average values are used to generate first filtered time series 222. Time-based filter(s) 208 may generate a second filtered time series 224 and a third filtered time series 226 based on time series 216 in accordance with a second window size (e.g., daily) and a third window size (e.g., weekly), respectively, in a similar manner as described with reference to first filtered time series 222 However, when generating second filtered time series 224, time-based filter(s) 208 may combine adjacent points for a given data point that are further in vicinity than the adjacent data points utilized to generate first filtered time series 222. Similarly, when generating third filtered time series 226, time-based filter(s) 208 may combine adjacent points for a given data point that are further in vicinity than the adjacent data points utilized to generate second filtered time series 224.
Time-based filter(s) 228 may also be configured to perform a filtering function on clipped time series 220 based on different window sizes in a similar manner as described above. For instance, as shown in
Combiner 210 may be configured to combine time series 216 with first filtered time series 222 to generate a first combined time series 234, combine time series 216 with second filtered time series 224 to generate a second combined time series 236, combine time series 216 with third filtered time series 226 to generate a third combined time series 238, combine clipped time series 220 with first filtered, clipped time series 228 to generate a fourth combined time series 240, combine clipped time series 220 with second filtered, clipped time series 230 to generate a fifth combined time series 242, and combine clipped time series 220 with third filtered, clipped time series 232 to generate a sixth combined time series 244. Combiner 210 may perform a Cartesian multiplication operation to perform the above-referenced combinations to generate first combined time series 234, second combined time series 236, third combined time series 238, fourth combined time series 240, fifth combined time series 242, and sixth combined time series 244.
Transformer 212 may be configured to perform an FFT on each of time series 216, clipped time series 220, first combined time series 234, second combined time series 236, third combined time series 238, fourth combined time series 240, fifth combined time series 242, and sixth combined time series 244 to detect seasonality for each time series 216, clipped time series 220, first combined time series 234, second combined time series 236, third combined time series 238, fourth combined time series 240, fifth combined time series 242, and sixth combined time series 244. The foregoing process considerably increases the probability of detecting the seasonality in at least one of the generated time series. It is noted that transformer 212 attempts to find seasonality in original time series (i.e., time series 216) to not hinder the seasonality detection in the event that clipped time series 220 no longer includes the seasonality to do the clipping operation performed by clipper 206. If a seasonal pattern is detected, transformer 212 outputs a detected seasonal pattern 246. In the event that more than one seasonal pattern is detected (e.g., a daily seasonality and a weekly seasonality), the longest seasonal pattern (e.g., the weekly seasonality) is used to model the data because they are multiples of each other. When modeling, for example, a weekly seasonal pattern also having a daily seasonality, the entire daily seasonality is contained within the weekly seasonal pattern.
It is noted that the window sizes utilized by time-based filter(s) 208 are purely exemplary and that any window size (e.g., monthly, yearly, etc.) may be utilized to determine seasonality for different time frames (e.g., monthly, seasonality, yearly seasonality, etc.).
Accordingly, a seasonal pattern may be determined in a time series in many ways. For example,
Flowchart 300 begins with step 302. In step 302, a predetermined percentage of the highest and lowest values from a time series is removed to generate a clipped time series. For example, with reference to
In step 304, the time series is filtered in accordance with at least one window size to generate at least one filtered time series. For example, with reference to
In step 306, the clipped time series is filtered in accordance with the at least one window size to generate at least one filtered, clipped time series. For example, with reference to
In step 308, the seasonal pattern is determined based on applying a respective transform to the time series, the clipped time series, a combination of the time series and the at least one filtered time series, and a combination of the clipped time series and the at least one filtered, clipped time series. For example, with reference to
B. Model Selection for Generating Dynamic Thresholds
Once seasonal pattern 246 is detected from time series 216 using the techniques described above with reference to Subsection A, seasonal pattern 246 and time series 216 (e.g., the most recent data values collected for time series 216) may be provided to a model selector to determine the optimal model for generating dynamic thresholds for a particular metric. For example,
Model selector 402 may be configured to perform a statistical analysis based on time series 216 and/or seasonal pattern 246 to automatically determine which modeler should be utilized to determine the automatic thresholds. For instance, model selector 402 may analyze the range, mean, variance, standard deviation, spread, etc., of time series 216 and/or seasonal pattern 246. Low dispersion modeler 406 may be selected if the analysis indicates that time series 216 and/or seasonal pattern 246 is relatively constant and/or rarely changes). If time series 216 and/or seasonal pattern 246 is relatively variable (e.g., has a sinusoidal pattern, high variance, etc.), model selector 402 may select one of a seasonal adjusted boxplot-based modeler 408 or Box-Cox transformation-based modeler 410. Model selector 402 may select Box-Cox transformation-based modeler 410 if time series 216 and/or seasonal pattern 246 has a majority of positive values (e.g., does not include values that are less than or equal to 0) and may select seasonal adjusted boxplot-based modeler 408 if time series 216 and/or seasonal pattern 246 includes values that are less than or equal to 0.
Box-Cox transformation-based modeler 410 has an inherent limitation dealing with non-positive values. Thus, non-positive values may be removed from time series 216 before applying Box-Cox transformation-based modeler 410. In some metrics, a substantial part of the data (more than 80%) is zeros. A common example for such metrics is a request count or network traffic for processes which are active only on a part of the day. In such situations, Box-Cox transformation-based modeler 410 performs poorly because too little values remain after removal of zeros. Accordingly, seasonal adjusted boxplot-based modeler 408 may be utilized in certain scenarios.
Generally, Box-Cox transformation-based modeler 410 generates more accurate thresholds than compared to seasonal adjusted boxplot-based modeler 408, as more data points are analyzed (as will be described below with reference to Subsection B.2). Seasonal adjusted boxplot-based modeler 408 may be utilized as a fallback modeler in the event that time series 216 and/or seasonal pattern 246 includes values that are less than or equal to 0.
Additional details regarding seasonal adjusted boxplot-based modeler 408 and Box-Cox transformation-based modeler 410 are described below with reference to Subsections B.1 and B.2, respectively.
After determining which modeler to utilize, the selected modeler automatically generates the dynamic thresholds. In the event that the analysis determines that none of the modelers are applicable, then no dynamic thresholds are automatically generated. It is noted that model bank 404 may store any number of modelers and that low dispersion-based modeler 406, seasonal adjusted boxplot-based modeler 408 and Box-Cox transformation-based modeler 410 are some examples of the modelers that may be utilized to generate thresholds.
Accordingly, a modeler for generating dynamic thresholds may be selected in many ways. For example,
Flowchart 500 begins with step 502. In step 502, a determination is made as to whether the data values of the time series are relatively constant or have a relatively high variance. For example, with reference to
In step 504, the low dispersion-based modeler is selected. For example, with reference to
In step 506, one of the seasonal adjusted boxplot-based modeler or the Box-Cox transformation-based modeler is selected. For example, with reference to
In accordance with one or more embodiments, selecting one of the seasonal adjusted boxplot-based modeler or the Box-Cox transformation-based modeler comprises determining whether the data values of the time series comprise non-positive values. Based at least on determining that the data values of the time series comprise non-positive values, the seasonal adjusted boxplot-based modeler is selected. Based at least on determining that the data values of the time series do not comprise non-positive values, the Box-Cox transformation-based modeler is selected. For example, with reference to
1. Seasonal Adjusted Boxplot Modeler
Bin-based time series generator 602 may be configured to de-seasonalize seasonal pattern 246 to generate a plurality of different time series based on seasonal pattern 246. Each generated time series may be based on a particular bin (or bucket) associated with seasonal pattern 246. A bin may represent a particular time interval. For example, each bin may represent a particular hour of a given day. In this case, the number of bins would be 24 (1 bin for each hour of the day), and therefore, the number of time series generated would also be 24. In another example, each bin may represent a ten-minute interval. In a scenario in which seasonal pattern 246 comprises 7 days of data (or 10,008 minutes' worth of data), the number of bins would be 1,008, and therefore, the number of time series generated would also be 1,008. It is noted that any time interval may be utilized.
For a particular bin, bin-based time series generator 602 may determine each of value of the metric for that bin in seasonal pattern 246. For example,
Referring again to
The most straightforward and popular method for generating thresholds is the “3-sigma” rule, according to which one estimates the sample mean and standard deviation and sets the boundaries at 3 standard deviations from the mean. However, it is well known that the sample mean and standard deviation are non-robust statistics, prone to significant errors in the presence of outliers.
To deal with this caveat, several methods based on robust statistics have been proposed. In accordance with an embodiment, threshold determiner 604 utilizes a modified version of Tukey's method (a.k.a. boxplot) to estimate the boundaries of the normal behavior of the data.
Given that P25 and P75 are the 25th and 75th percentiles of the data, the classic Tukey's test sets the higher and lower boundaries at P75+K·(P75−P25) and P25−K·(P75−P25), respectively. K in this equation is a predetermined factor, and usually K=1.5 defines boundaries for “mild outliers” and K=3 defines boundaries for “extreme outliers”.
In the modified version, since it is not very rare that the interquartile range (P75−P25) in telemetry data is 0, the inter-percentile range between the 90th and 10th percentiles (P90−P10) is used instead. Thus, the minimum and/or maximum thresholds may be set at P90+{tilde over (K)}·(P90−P10) and P10−{tilde over (K)}·(P90−P10), respectively.
The {tilde over (K)}(≈0.55) factor applied may be determined so that if the data were normally distributed, the minimum and/or maximum thresholds will be the same as those set by the classic Tukey's method with K=1.5. In addition, minimum and/or maximum thresholds for different detection sensitivity levels can be set by using a factor larger than {tilde over (K)}. In accordance with an embodiment, 1.5×{tilde over (K)} and 2×{tilde over (K)} for used for medium and extreme outliers, respectively. Furthermore, to account for possible skewness of the data, an adjusted boxplot may be used that adds a compensating factor that is a function of the medcouple.
Threshold determiner 604 provides the determined thresholds (shown as thresholds (thresholds 610A-610N) to dynamic threshold generator 606. Dynamic threshold generator 606 may be configured to compose (or combine) each of thresholds 610A-610N to generate a continuous, dynamic threshold 612, which tracks the seasonality of seasonal pattern 246. For instance, with reference to
Accordingly, a seasonal adjusted boxplot-based modeler may be utilized to generate dynamic thresholds in many ways. For example,
Flowchart 900 begins with step 902. In step 902, a plurality of bins for the seasonal pattern is determined. For example, with reference to
In step 904, for each bin of the plurality of bins, data values from the time series are selectively assigned to the bin. For example, with reference to
In step 906, for each bin of the plurality of bins, a bin-based time series is generated based on the assigned data values for the bin. For example, with reference to
In step 908, for each bin of the plurality of bins, a bin-based threshold is determined based on the bin-based time series. For example, with reference to
In step 910, each of the bin-based thresholds are combined to generate the threshold. For example, with reference to
2. Box-Cox Transformation-Based Modeler
When forecasting a seasonal metric, several values of the same seasonal phase are usually required to establish a forecasting threshold; since forecasting metrics based on past behavior requires estimating the variation of metric values. Variation cannot be computed on a single value or be reliable on very few measurements. In some seasonality spans, such as weekly patterns, taking few weeks back to build a forecasting model might be wrong in terms of adapting to changes in models. So, in such cases, the use of several seasons is simply misleading. There is a need to reduce the time span used in building the forecast model.
A different approach is to use the entire data to predict the variation and make it a constant variation in all the seasonal phases. This approach is also problematic since in many cases, the variation of service metrics is somewhat correlative to the signal values. For example, the variation of traffic data is higher during business hours than during the night or weekends. As a result, after decomposing the seasonal behavior of the data, the random (or noisy) components remain. Such components might have constant variance or variance that changes over time.
By applying Box-Cox transformation-based modeler 410 to the seasonal decomposed variants of the data, all the residuals are changed to a common space on which bandwidth of the forecast belt can be estimated. At that point, an adjusted boxplot can be applied to the residuals. Therefore, a backward (or inverse) transform is applied to go back to the original metric values to have a ready model with seasonally-adjusted boxplot thresholds. The foregoing advantageously enables the creation of a forecasting seasonal model by using a very few numbers of seasons.
Seasonal decomposer 1002 may be configured to decompose the seasonal behavior of time series 216. For instance, seasonal decomposer 1002 may remove seasonal pattern 246 from time series 216. This may be performed by subtracting seasonal pattern 246 from time series 216. The remaining portion of time series 216 represents a natural random variation (or residual data) of time series 216. For instance,
The issue is the residual data is that there is a varying degree of noise. When performing statistical analysis, it is desired to have the same level of noise all the time. Accordingly, a power transformation is performed to stabilize the noise in the residual data. For example, with reference to
where yi and yi(λ) are the ith data point in the original and transformed scale, respectively. This transformation formula is defined as such so the transformation is continuous in λ as it approaches 0. The adequate λ for a given time series is estimated by maximizing the log likelihood function of the residuals given they are normally distributed.
Dynamic threshold generator 1006 may be configured generate a continuous, dynamic threshold 1012, which tracks the seasonality of seasonal pattern 246. For example, dynamic threshold generator 1006 may be configured to generate at least one of a minimum threshold and a maximum threshold for transformed residual data 1010. Because transformed residual data 1010 is relatively constant, the minimum and/or maximum thresholds may be relatively constant threshold(s). After determining the minimum and/or maximum thresholds, dynamic threshold generator 1006 may perform an inverse transformation on the minimum and maximum thresholds to reintroduce the variance (i.e., the transform performed by variance stabilizer 1004 is reversed). Thereafter, the transformed minimum and/or maximum thresholds are combined with seasonal pattern 246, thereby resulting in minimum and/or maximum thresholds (i.e., dynamic thresholds 1012) that tracks seasonal pattern 246.
Accordingly, a Box-Cox transformation-based modeler may be utilized to generate dynamic thresholds in many ways. For example,
Flowchart 1200 begins with step 1202. In step 1202, the seasonal pattern is decomposed from the time series to obtain residual data associated with the time series. For example, with reference to
In step 1204, a Box-Cox transform is applied to the residual data to stabilize a variance of the residual data to generate transformed residual data. For example, with reference with
In step 1206, a transformed residual data-based threshold is determined based on the transformed residual data. For example, with reference to
In step 1208, an inverse Box-Cox transform is applied to the transformed residual data-based threshold to reintroduce the variance, thereby generating a transformed threshold. For example, with reference to
In step 1210, the seasonal pattern is combined with the transformed threshold to generate the threshold. For example, with reference to
C. Method for Issuing Alerts Indicative of Anomalous Resource Usage Based on Dynamic Thresholds
Once the dynamic thresholds are determined for a particular metric, the determined dynamic thresholds may be utilized to determine whether the metric exhibits anomalous behavior (e.g., an excessive amount or abnormally low number of requests, an excessive usage of CPU cycles, memory and/or storage, etc.) as the corresponding computing resource(s) continue to operate. If the determined thresholds are exceeded, an alert indicating anomalous resource usage with respect to the computing resource(s) may be provided to a user.
For example,
Flowchart 1300 begins with step 1302. In step 1302, a time series of data values corresponding to a metric associated with a computing resource is obtained. For example, with reference to
In step 1304, a seasonal pattern in the time series is detected. For example, with reference to
In accordance with one or more embodiments, the seasonal pattern may be determined in accordance with system 200 of
In step 1306, a statistical analysis of the time series is performed. For example, with reference to
In step 1308, a modeler is selected from among a plurality of different modelers based on results of the statistical analysis. For example, with reference to
In accordance with one or more embodiments, the plurality of different modelers comprises at least one of a low dispersion-based modeler, a seasonal adjusted boxplot-based modeler, or a Box-Cox transformation-based modeler. For example, with reference to
In accordance with one or more embodiments, the modeler may be selected in accordance with system 400 of
In step 1310, the selected modeler is utilized to generate a threshold based on the seasonal pattern. For example, with reference to
In step 1312, the metric associated with the computing resource is monitored to determine whether the metric exceeds the threshold. For example, with reference to
In step 1314, an indication is provided based at least on determining that the metric exceeds the threshold. For example, with reference to
In accordance with one or more embodiments, providing the indication includes issuing an alert. The alert (e.g., indication 1418) may be issued to computing device 104, as shown in
In accordance with one or more embodiments, indication 1418 may trigger one or more actions to be performed with respect to resources 1410. For instance, additional resources of resources 1410 may be automatically allocated to handle an excessive amount of network requests, CPU usage, memory usage, etc. to compensate for the anomalous behavior.
D. Adjusting Dynamic Thresholds Based on Confidence Approximation
The dynamic threshold(s) described above may suffer from overfitting (i.e., the threshold(s) may conform too closely to the seasonality detected during the training phase). If the metric values being monitored conform to this threshold, then no issue arises. However, in many cases, the monitored metric values tend to differ from the metric values collected during the training phase. As such, this may cause false alerts to be issued. This issue becomes more apparent the more complex the detected seasonality.
In accordance with an embodiment, the dynamic threshold(s) may be adjusted based on a confidence level of the seasonality pattern detected (i.e., how confident that the seasonality pattern is accurate). Generally, there is a gap between the baseline training error and the baseline-applied period error. The baseline training error may be represented by the interpercentile range (IPR) calculated for the metric values obtained during the training phase. The IPR may be based on the residuals of the medians of the metric values collected during the training phase. The IPR is the bandwidth around the medians that are observed during the training phase. The dynamic threshold(s) described above are based on this IPR. The baseline-applied period error is represented by the IPR estimated for metric values to be received after the training phase on which the dynamic threshold(s) are applied. The gap may be referred to as the optimism (i.e., the optimism about the performance of the dynamic threshold(s)).
Intuitively, the more confident the algorithm of the baseline accuracy, the tighter the dynamic threshold(s) that are generated. Formally, the thresholds are a function of the baseline and a distance between it and the training data. Seasonal patterns are a predictable change in a metric baseline in a set interval, such as hourly, daily or weekly patterns. As described above, detecting seasonal patterns reduces false positives and reduces the number of rules needed to capture metric behavior and its implication on the health of a resource. The higher order of the seasonal pattern detected increases model complexity. The influence of the number of samples (available metric history) and model complexity on the gap between the training error and the applied period error (also referred to as the test error) is visualized in
To adjust the dynamic threshold(s), an IPR is estimated for data received after the training phase. The IPR may be determined accordance with Equations 2-4, which are described below:
where IPRtest represents the IPR estimated for data received after the training phase, IPRtrain represents the IPR estimated for data receiving during the training phase, period represents the detected seasonality (e.g., weekly, daily, hourly, etc.), and N represents the number of non-null (e.g., non-zero) samples of the observed metric used to determine the seasonality.
Since the IPR measurement is more robust (less likely to overfit), the constant value ‘2’ is replaced with a smaller value. In accordance with an embodiment, a constant value of ‘1.75’ is used, as shown below in Equation 5:
The value in the parenthetical of Equation 5 may be referred to as the IPR factor. Accordingly, to determine IPRtest, IPRtrain is multiplied by the IPR factor. The adjusted dynamic threshold(s) are based on IPRtest. In particular, the dynamic threshold(s) are either tightened or relaxed based on IPRtest. A relatively smaller value for IPRtest is indicative a greater confidence of the detected seasonal pattern, and a relatively larger value for IPRtest is indicative of a lesser confidence of the detected seasonal pattern. In other words, the confidence is represented by the ratio between how many samples are utilized and the complexity of the seasonality. If the seasonality is relatively complex, but the number of samples are relatively low, the confidence will be lower. If the seasonality is relatively simple, and the number of samples is relatively high, the confidence will be higher. The higher the confidence, the tighter the dynamic threshold(s) will be adjusted. The lower the confidence, the looser the dynamic threshold(s) will be adjusted.
Consider the following example, in which 10 days of training data are utilized and are sampled every 5 minutes. In accordance with Equation 5, a non-seasonal pattern (i.e., period=1) would result in an IPR factor of approximately 1, an hourly seasonal pattern would result in an IPR factor of approximately 1, a daily seasonal pattern would result in an IPR factor of approximately 1.175, a first weekly seasonal pattern (e.g., 6 weeks) would result in an IPR factor of approximately 1.291, and a second weekly seasonal pattern (e.g., 3 weeks) would result in an IPR factor of approximately 1.583. Accordingly, the more complex the seasonal pattern, the greater the value of the IPR factor. The greater the value of the IPR factor, the more the dynamic threshold(s) are adjusted to be more relaxed.
The resulting IPR factor is multiplied with IPRtrain to determine IPRtest. The dynamic threshold(s) are determined based on the determined IPRtest.
It is noted that while the embodiments described herein disclose techniques for adjusting dynamic thresholds, the embodiments described herein are not so limited and such techniques may be utilized to adjust other entities.
Accordingly, a dynamic threshold may be smoothed in many ways. For example,
Flowchart 1600 begins with step 1602. In step 1602, a dynamic threshold is generated based on a seasonal pattern detected in a time series of data values corresponding to a metric associated with a computing resource. For example, with reference to
In accordance with one or more embodiments, the dynamic threshold is generated during a training phase in which the seasonal pattern is detected in the time series of data values.
In step 1604, the generated dynamic threshold is adjusted based on a confidence level of the seasonal pattern. For example, with reference to
In accordance with one or more embodiments, the confidence level is based on a number of data values in the time series and a period associated with the detected seasonal pattern.
In step 1606, the metric associated with the computing resource is monitored to determine whether the metric exceeds the adjusted dynamic threshold. For example, with reference to
In step 1608, an indication is provided based at least on determining that the metric exceeds the adjusted dynamic threshold. For example, with reference to
In accordance with one or more embodiments, providing the indication includes issuing an alert. The alert (e.g., indication 1418) may be issued to computing device 104, as shown in
In accordance with one or more embodiments, the indication causes an automatic allocation of additional computing resources.
In accordance with one or more embodiments, the adjustment of the generated dynamic threshold is based on statistical features associated with the time series of data values received during the training phase. For example,
Flowchart 1800 begins with step 1802. In step 1802, a first statistical feature associated with the time series of data values received during a training phase in which the seasonal pattern is detected is determined, the generated dynamic threshold being determine based on the first statistical feature. For example, with reference to
In step 1804, a second statistical feature for a subsequent time series of data values to be received after the training phase completes is estimated, the second statistical feature being based on the first statistical feature and the confidence level, the adjusted dynamic threshold being determined based on the second statistical feature. For example, with reference to
In accordance with one or more embodiments, the confidence level is based on one or more of a number of data values in the time series or a period associated with the detected seasonal pattern. For example, with reference to
In accordance with one or more embodiments, the first statistical feature is a first interpercentile range associated with the time series of data values received during the training phase, and the second statistical feature is a second interpercentile range that is estimated for a subsequent time series of data values to be received after the training phase. For example, with reference to
In accordance with one or more embodiments, the generated dynamic threshold is adjusted a first amount based on the confidence level being relatively high, and wherein the generated dynamic threshold is adjusted a second amount based on the confidence level being relatively low, wherein the first amount is greater than the second amount. For example, with reference to
As shown in
System 2000 also has one or more of the following drives: a hard disk drive 2014 for reading from and writing to a hard disk, a magnetic disk drive 2016 for reading from or writing to a removable magnetic disk 2018, and an optical disk drive 2020 for reading from or writing to a removable optical disk 2022 such as a CD ROM, DVD ROM, BLU-RAY™ disk or other optical media. Hard disk drive 2014, magnetic disk drive 2016, and optical disk drive 2020 are connected to bus 2006 by a hard disk drive interface 2024, a magnetic disk drive interface 2026, and an optical drive interface 2028, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the computer. Although a hard disk, a removable magnetic disk and a removable optical disk are described, other types of computer-readable memory devices and storage structures can be used to store data, such as flash memory cards, digital video disks, random access memories (RAMs), read only memories (ROM), and the like.
A number of program modules may be stored on the hard disk, magnetic disk, optical disk, ROM, or RAM. These program modules include an operating system 2030, one or more application programs 2032, other program modules 2034, and program data 2036. In accordance with various embodiments, the program modules may include computer program logic that is executable by processing unit 2002 to perform any or all of the functions and features of nodes 108A-108B, 112A-112N, and/or 114A-114N, storage node(s) 110, computing device 104, and dynamic threshold-based alert engine 118 of
A user may enter commands and information into system 2000 through input devices such as a keyboard 2038 and a pointing device 2040 (e.g., a mouse). Other input devices (not shown) may include a microphone, joystick, game controller, scanner, or the like. In one embodiment, a touch screen is provided in conjunction with a display 2044 to allow a user to provide user input via the application of a touch (as by a finger or stylus for example) to one or more points on the touch screen. These and other input devices are often connected to processing unit 2002 through a serial port interface 2042 that is coupled to bus 2006, but may be connected by other interfaces, such as a parallel port, game port, or a universal serial bus (USB). Such interfaces may be wired or wireless interfaces.
Display 2044 is connected to bus 2006 via an interface, such as a video adapter 2046. In addition to display 2044, system 2000 may include other peripheral output devices (not shown) such as speakers and printers.
System 2000 is connected to a network 2048 (e.g., a local area network or wide area network such as the Internet) through a network interface 2050, a modem 2052, or other suitable means for establishing communications over the network. Modem 2052, which may be internal or external, is connected to bus 2006 via serial port interface 2042.
As used herein, the terms “computer program medium,” “computer-readable medium,” and “computer-readable storage medium” are used to generally refer to memory devices or storage structures such as the hard disk associated with hard disk drive 2014, removable magnetic disk 2018, removable optical disk 2022, as well as other memory devices or storage structures such as flash memory cards, digital video disks, random access memories (RAMs), read only memories (ROM), and the like. Such computer-readable storage media are distinguished from and non-overlapping with communication media (do not include communication media or modulated data signals). Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wireless media such as acoustic, RF, infrared and other wireless media. Embodiments are also directed to such communication media.
As noted above, computer programs and modules (including application programs 2032 and other program modules 2034) may be stored on the hard disk, magnetic disk, optical disk, ROM, or RAM. Such computer programs may also be received via network interface 2050, serial port interface 2042, or any other interface type. Such computer programs, when executed or loaded by an application, enable system 2000 to implement features of embodiments discussed herein. Accordingly, such computer programs represent controllers of the system 2000. Embodiments are also directed to computer program products comprising software stored on any computer useable medium. Such software, when executed in one or more data processing devices, causes a data processing device(s) to operate as described herein. Embodiments may employ any computer-useable or computer-readable medium, known now or in the future. Examples of computer-readable mediums include, but are not limited to memory devices and storage structures such as RAM, hard drives, floppy disks, CD ROMs, DVD ROMs, zip disks, tapes, magnetic storage devices, optical storage devices, MEMs, nanotechnology-based storage devices, and the like.
In alternative implementations, system 2000 may be implemented as hardware logic/electrical circuitry or firmware. In accordance with further embodiments, one or more of these components may be implemented in a system-on-chip (SoC). The SoC may include an integrated circuit chip that includes one or more of a processor (e.g., a microcontroller, microprocessor, digital signal processor (DSP), etc.), memory, one or more communication interfaces, and/or further circuits and/or embedded firmware to perform its functions.
A method is described herein. The method includes: generating a dynamic threshold based on a seasonal pattern detected in a time series of data values corresponding to a metric associated with a computing resource; adjusting the generated dynamic threshold based on a confidence level of the detected seasonal pattern; monitoring the metric associated with the computing resource to determine whether the metric exceeds the adjusted dynamic threshold; and provide an indication based at least on determining that the metric exceeds the adjusted dynamic threshold.
In one implementation of the foregoing method, the confidence level is based on one or more of a number of data values in the time series or a period associated with the detected seasonal pattern.
In one implementation of the foregoing method, the generated dynamic threshold is adjusted a first amount based on the confidence level being relatively high, and the generated dynamic threshold is adjusted a second amount based on the confidence level being relatively low, wherein the first amount is greater than the second amount.
In one implementation of the foregoing method, said adjusting comprises: determining a first statistical feature associated with the time series of data values received during a training phase in which the seasonal pattern is detected, the generated dynamic threshold being determined based on the first statistical feature; and estimating a second statistical feature for a subsequent time series of data values to be received after the training phase completes, the second statistical feature being determined based on the first statistical feature and the confidence level, the adjusted dynamic threshold being based on the second statistical feature.
In one implementation of the foregoing method, the first statistical feature is a first interpercentile range associated with the time series of data values received during the training phase, and the second statistical feature is a second interpercentile range that is estimated for a subsequent time series of data values received after the training phase.
In one implementation of the foregoing method, said providing the indication includes issuing an alert.
In one implementation of the foregoing method, the indication comprises at least one of: an e-mail message; a telephone call; or a short messaging service message.
In one implementation of the foregoing method, the indication causes an automatic allocation of additional computing resources.
A system in accordance with any of the embodiments described herein is also disclosed. The system includes: at least one processor circuit; and at least one memory that stores program code configured to be executed by the at least one processor circuit, the program code comprising: a modeler configured to generate a dynamic threshold based on a seasonal pattern detected in a time series of data values corresponding to a metric associated with a computing resource; a dynamic threshold adjuster configured to adjust the generated dynamic threshold based on a confidence level of the detected seasonal pattern; and a monitor configured to: monitor the metric associated with the computing resource to determine whether the metric exceeds the adjusted dynamic threshold; and provide an indication based at least on determining that the metric exceeds the adjusted dynamic threshold.
In one implementation of the foregoing system, the confidence level is based on one or more of a number of data values in the time series or a period associated with the detected seasonal pattern.
In one implementation of the foregoing system, the dynamic threshold adjustor is configured to adjust the generated dynamic threshold a first amount based on the confidence level being relatively high, and the dynamic threshold adjustor is configured to adjust the generated dynamic threshold a second amount based on the confidence level being relatively low, wherein the first amount is greater than the second amount.
In one implementation of the foregoing system, the modeler is configured to determine a first statistical feature associated with the time series of data values received during a training phase in which the seasonal pattern is detected, the generated dynamic threshold being determined based on the first statistical feature; and the dynamic threshold adjuster comprises a statistical feature determiner that is configured to estimate a second statistical feature for a subsequent time series of data values to be received after the training phase completes, the second statistical feature being based on the first statistical feature and the confidence level, the adjusted dynamic threshold being determined based on the second statistical feature.
In one implementation of the foregoing system, the first statistical feature is a first interpercentile range associated with the time series of data values received during the training phase, and the second statistical feature is a second interpercentile range that is estimated for a subsequent time series of data values received after the training phase.
In one implementation of the foregoing system, the monitor is configured to provide the indication by issuing an alert.
In one implementation of the foregoing system, the indication comprises at least one of: an e-mail message; a telephone call; or a short messaging service message.
In one implementation of the foregoing system, the indication causes an automatic allocation of additional computing resources.
A computer-readable storage medium having program instructions recorded thereon that, when executed by at least one processor, perform a method. The method includes: generating a dynamic threshold based on a seasonal pattern detected in a time series of data values corresponding to a metric associated with a computing resource; adjusting the generated dynamic threshold based on a confidence level of the detected seasonal pattern; monitoring the metric associated with the computing resource to determine whether the metric exceeds the adjusted dynamic threshold; and provide an indication based at least on determining that the metric exceeds the adjusted dynamic threshold.
In one implementation of the foregoing computer-readable storage medium, the confidence level is based on one or more of a number of data values in the time series or a period associated with the detected seasonal pattern.
In one implementation of the foregoing computer-readable storage medium, the generated dynamic threshold is adjusted a first amount based on the confidence level being relatively high, and the generated dynamic threshold is adjusted a second amount based on the confidence level being relatively low, wherein the first amount is greater than the second amount.
In one implementation of the foregoing computer-readable storage medium, said adjusting comprises: determining a first statistical feature associated with the time series of data values received during a training phase in which the seasonal pattern is detected, the generated dynamic threshold being determined based on the first statistical feature; and estimating a second statistical feature for a subsequent time series of data values to be received after the training phase completes, the second statistical feature being based on the first statistical feature and the confidence level, the adjusted dynamic threshold being determined based on the second statistical feature.
While various example embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be understood by those skilled in the relevant art(s) that various changes in form and details may be made therein without departing from the spirit and scope of the embodiments as defined in the appended claims. Accordingly, the breadth and scope of the disclosure should not be limited by any of the above-described example embodiments, but should be defined only in accordance with the following claims and their equivalents.
This application claims priority to U.S. Provisional Patent Application No. 62/878,997, filed Jul. 26, 2019, entitled “Confidence Approximation-based Dynamic Thresholds for Anomalous Computing Resource Usage Detection,” the entirety of which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
62878997 | Jul 2019 | US |