Large scale and cloud datacenters are becoming increasingly popular, as they offer computing resources for multiple tenants at a very low cost on an attractive pay-as-you-go model. Many small and medium businesses are turning to these cloud datacenters, not only for occasional large computational tasks, but also for their IT jobs. This helps them eliminate the expensive, and often very complex, task of building and maintaining their own infrastructure. To fully realize the benefits of resource sharing, these cloud datacenters must scale to huge sizes. The larger the number of tenants, and the larger the number of virtual machines and physical servers, the better the chances for higher resource efficiencies and cost savings. Increasing the scale alone, however, cannot fully minimize the total cost as a great deal of expensive human effort is required to configure the equipment, to operate it optimally, and to provide ongoing management and maintenance. A good fraction of these costs reflect the complexity of managing system behavior, including anomalous system behavior that may arise in the course of system operations.
The online detection of anomalous system behavior caused by operator errors, hardware/software failures, resource over-/under-provisioning, and similar causes is a vital element of system operations in these large scale and cloud datacenters. Given their ever-increasing scale coupled with the increasing complexity of software, applications, and workload, patterns, anomaly detection techniques in large scale and cloud datacenters must be scalable to the large amount of monitoring data (i.e., metrics) and the large number of components. For example, if 10 million cores are used in a large scale or cloud datacenter with 10 virtual machines per node, the total amount of metrics generated can reach exascale, 1018. These metrics may include Central Processing Unit (“CPU”) cycles, memory usage, bandwidth usage, and any other suitable metrics.
The anomalous detection techniques currently used in industry are often ad hoc or specific to certain applications, and they may require extensive tuning for sensitivity and/or to avoid high rates of false alarms. An issue with threshold-based methods, for instance, is that they detect anomalies after they occur instead of noticing their impeding arrival. Further, potentially high false alarm rates can result from monitoring only individual rather than combination of metrics. Other recently developed techniques can be unresponsive due to their use of complex statistical techniques and/or may suffer from a relative lack of scalability because they mine immense amounts of non-aggregated metric data. In addition, their analyses often require prior knowledge about applications, service implementation, or request semantics.
The present application may be more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:
Anomaly detection techniques for large scale and cloud datacenters are disclosed. The anomaly detection techniques are able to analyze multiple metrics at different levels of abstraction (i.e., hardware, software, system, middleware, or applications) without prior knowledge of workload behavior and datacenter topology. The metrics may include Central Processing Unit (“CPU”) cycles, memory usage, bandwidth usage, operating system (“OS”) metrics, application metrics, platform metrics, service metrics and any other suitable metric.
The datacenter may be organized horizontally in terms of components that include cores, sockets, node enclosures, racks, and containers. Further, each physical core may have a plurality of software applications organized vertically in terms of a software stack that includes components such as applications, virtual machines (“VMs”), OSs, and hypervisors or virtual machine monitors (“VMMs”). Each one of these components may generate an enormous amount of metric data regarding their performance. These components are also dynamic, as they can become active or inactive on an ad hoc basis depending upon user needs. For example, heterogeneous applications such as map-reduce, social networking, e-commerce solutions, multi-tier web applications, and video streaming may all be executed on an ad hoc basis and have vastly different workload and request patterns. The online management of VMs and power adds to this dynamism.
In one embodiment, anomaly detection is performed with a parametric Gini-coefficient based technique. As generally described herein, a Gini coefficient is a measure of statistical dispersion or inequality of a distribution. Each node (physical or virtual) in the datacenter runs a Gini-based anomaly detector that takes raw monitoring data (e.g., OS, application, and platform metrics) and transforms the data into a series of Gini coefficients. Anomaly detection is then applied on the series of Gini coefficients. Gini coefficients from multiple nodes may be aggregated together in a hierarchical manner to detect anomalies on the aggregated data.
In another embodiment, anomaly detection is performed with a non-parametric Tukey based technique that determines outliers in a set of data. Data is divided into ranges and thresholds are constructed to flag anomalous data. The thresholds may be adjusted by a user depending on the metric being monitored. This Tukey based technique is lightweight and improves over standard Gaussian assumptions in terms of performance while exhibiting good accuracy and low false alarm rates.
It is appreciated that, in the following description, numerous specific details are set forth to provide a thorough understanding of the embodiments. However, it is appreciated that the embodiments may be practiced without limitation to these specific details. In other instances, well known methods and structures may not be described in detail to avoid unnecessarily obscuring the description of the embodiments. Also, the embodiments may be used in combination with each other.
Referring now to
For example,
An example core for use with datacenter 100 and cloud 200 is shown in
The sheer magnitude of a cloud datacenter (e.g., cloud datacenter 200) requires that anomaly detection techniques handle multiple metrics at the different levels of abstraction (i.e., hardware, software, system, middleware, or applications) present at the datacenter. Furthermore, anomaly detection techniques for a large scale and cloud datacenter also need to accommodate the workload characteristics and patterns including day of the week, and hour of the day patterns of workload behavior. The anomaly detection techniques also need to be aware of and address the dynamic nature of data center systems and applications, including dealing with application arrivals and departures, changes in workload, and system-level load balancing though, say, virtual machine migration. In addition, the anomaly detection techniques must exhibit good accuracy and low false alarm for meaningful results.
Referring now to
As appreciated by one of skill in the art, the statistical-based anomaly detection framework 400 may be implemented in a distributed manner in the datacenter, such that each node (physical or virtual) may run an anomaly detection module 410. The anomaly detection from multiple nodes may be aggregated together in a hierarchical manner to detect anomalies on the aggregated data.
Referring now to
The normalization module 505 receives metrics from a metrics collection module (e.g., metrics collection module 405 shown in
It is appreciated that normalization module 505, the binning module 510, the Gini coefficient module 515, and the threshold module 520 are implemented to process data for a single computational node in a large scale and cloud datacenter. To detect anomalies in the entire datacenter requires the data from multiple nodes to be evaluated. That is, the anomaly detection needs to be aggregated along the hierarchy in the datacenter (e.g., the hierarchy illustrated in
The anomaly detection aggregation is implemented in the aggregation module 525. In various embodiments, the aggregation may be performed in different ways, such as, for example, in a bin-based aggregation 530, a Gini-based aggregation 535, or a threshold-based aggregation 540. In the bin-based aggregation 530, the aggregation module 525 combines the information from the binning module 510 running in each node. In the Gini-based aggregation 535, the aggregation module 525 combines the Gini coefficients from the multiple nodes. And in the threshold-based aggregation 540, the aggregation module 525 combines the results for the threshold comparisons performed in the multiple nodes.
The anomaly alarm module 545 generates an alarm when the Gini coefficient for the given look-back window exceeds the threshold. The alarm and the detected anomalies may be indicated to a user in the dashboard module (e.g., dashboard module 415).
The operation of the anomaly detection module 500 is illustrated in more detail in a flow chart shown in
where μ is the mean and σ is the standard deviation of the collected metrics within the look-back window and i represents the metric type.
After normalization, data binning is performed (605) in the binning module 510 by hashing each normalized sample value into a bin. A value range [0,r] is predefined and split into in equal-sized bins indexed from 0 to m−1. Another bin indexed m is defined to capture values that are outside the value range (i.e., greater than r). Each of the normalized values is put into the in bin if its value is greater than r, or into a bin with index given by the floor of the sample value divided by (r/m) otherwise, that is:
where Bi is the bin index for the normalized sample value vi′. Both m and r are pre-determined statistically and can be configurable parameters.
It is appreciated that if the node for which the metrics were collected, normalized, and binned is not a root node (610), that is, a leaf in the datacenter hierarchy tree shown in
Once the samples of the collected metrics within the look-back window are pre-processed and transformed into a series of bin index numbers, an m-event is generated that includes the transformed values from multiple metric types into a single vector for each time instance. More specifically, an m-event Et of a single machine at time t can be formulated with the following vector description:
E
t
=
B
t1
,B
t2
, . . . ,B
tk
where Btj is the bin index number for the j metric at time t for a total of k metrics. Two m-events Ea and Eb have the same vector value if they are created on the same machine and Baj=Bbj, ∀jε[1,k]. It is appreciated that each node in the datacenter may send its m-event with bin indices to the aggregation module 525 for bin-based aggregation 530. The aggregation module 525 combines the bin indices to form higher dimensional m-events and calculate the Gini coefficient and threshold based on those m-events.
The calculation of a Gini coefficient starts by defining a random variable E as an observation of m-events within a look-back window with a size of, say, n samples. The outcomes of this random variable E are v m-event vector values {e1, e2, . . . , ev}, where v<n when there are m-events with the same value in the n samples. For each of these v values, a count of the number of occurrences of that ei in the n samples is kept. This is designated as n, and represents the number of m-events having the vector value ei.
A Gini coefficient G for the look-back window is then calculated (625) as follows:
It is appreciated that each node in the datacenter may send its Gini coefficient to the aggregation module 525 for Gini-based aggregation 535. The aggregation module 525 then creates an m-event vector with k elements. Element i of this vector is the bin index number associated with the Gini coefficient value for the ith node. Ah aggregated Gini coefficient is then computed as the Gini coefficient of this m-event vector within the look-hack window. Anomaly detection can then be checked for this aggregated value.
To detect anomalies within the look-back window, the Gini coefficient above needs to be compared to a threshold. In one embodiment, the threshold T is a Gini standard deviation dependent threshold and can be calculated (630) as follows:
where μG is the average Gini coefficient value over all sliding look-back windows and calculated asymptotically from the look-back window using the statistical Cramer's Delta method, and σG is the estimated standard deviation of the Gini coefficient obtained by also applying the Delta method, which uses a Taylor series approximation of the Gini coefficient and obtains approximations to standard deviations of intractable functions such as the Gini coefficient function in Eq. 4.
It is appreciated that this threshold computation, by using the estimated standard deviation σG, delivers an estimate of the variability of the Gini coefficient. It is this variability that allows anomalies to be detected. If the Gini coefficient G(E) exceeds this threshold value T (either G(E)>T or G(E)<−T), then an anomaly alarm is raised (635) and notified to the user or operator monitoring the datacenter (such as, for example, by displaying the alarm and the detected anomaly in the dashboard module 415).
It is appreciated that a threshold-based aggregation 540 may also be implemented to aggregate anomaly detection for multiple nodes. In this case, anomalies are detected if any one of the nodes has an anomaly alarm.
It is further appreciated that the above parametric-based anomaly detection technique using the Gini coefficient and a Gini standard deviation dependent threshold is computationally lightweight. In addition, the Gini standard deviation threshold enables an entirely new automated approach to anomaly detection that can be systematically applied to multiple metrics across multiple nodes in large scale and cloud datacenters. The anomaly detection can be applied numerous times to metrics collected within sliding look-back windows.
Referring now to
The non-parametric anomaly detection module 700 is implemented with a data quartile module 705, a Tukey thresholds module 710, and a anomaly alarm module 715. The data quartile module 705 divides the collected metrics into quartiles for analysis. The Tukey thresholds module 700 defines Tukey thresholds for comparison with the quartile data. The comparisons are performed in the anomaly alarm module 715.
The operation of the anomaly detection module 700 is illustrated in more detail in a flow chart shown in
Next, two Tukey thresholds are defined, a lower threshold T1 and an upper threshold Tn:
T
1
=Q
1
−k|Q
3
−Q
1| (Eq. 6)
T
n
=Q
3
+k|Q
3
+k|Q
3
−Q
1 (Eq. 7)
where k is an adjustable tuning parameter that controls the size of the lower and upper thresholds. It is appreciated that k can be metric-dependent and adjusted by a user based on the distribution of the metric. A typical range for k may be from 1.5 to 4.5.
The data in the quartiles is compared to the lower and upper Tukey thresholds (810) so that any data outside the threshold range (815) triggers an anomaly detection alarm. Given a sample x of a given metric in the look-back window, an anomaly is detected (on the upper end of the data range) when:
Q
3(k/2)Q3−Q1|≧x≧(k/2)|Q3−Q1| (Eq. 8)
or (on the lower end, of the data range) when:
Q
1−(k/2)|Q3−Q1|≧x≧Q1−(k/2)|Q3−Q1| (Eq. 9)
It is appreciated that this non-parametric anomaly detection approach based on the Tukey technique is also computational lightweight. The Tukey thresholds may be metric-dependent and computed a priori, thus improving the performance and efficiency of automated anomaly detection in large scale and cloud datacenters. Both the parametric (i.e., Gini-based) and the non-parametric (i.e., Tukey-based) anomaly detection approaches discussed herein provide good responsiveness, are applicable across multiple metrics, and have good scalability properties.
It is appreciated that the previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.