This disclosure is directed to computational systems and methods for analyzing performance of virtual machines.
In recent years, virtual machines (“VMs”) have become increasingly used in datacenter operations and in other large-scale computing environments. VMs are a software implemented abstraction of a physical machine, such as a computer, which is presented to the application layer of the system. A VM may be based on a specification of a hypothetical computer and may be designed to recreate a computer architecture and function of a physical computer. In datacenters, VMs are often used in server consolidation. For example, a typical non-virtualized application server may achieve between 5% to 10% utilization. But a virtualized application server that hosts multiple VMs can achieve between 50% to 80% utilization. As a result, virtual clusters composed of multiple VMs can be hosted on fewer servers, translating into lower costs for hardware acquisition, maintenance, energy consumption and cooling system usage. The VMs in a virtual cluster may be interconnected logically by a virtual network across several physical networks.
In order to monitor the performance of VMs, datacenters generate streams of telemetry data. Each stream is composed of metrics that represent different aspects of the behavior of an application, a VM, or a physical machine. For example, virtual machine monitors can be used to produce a stream of telemetry data composed of hundreds of real and synthesized metrics associated with a VM. The telemetry streams may be sampled at very high rates. As a result, the telemetry datasets can be very large, containing hundreds of metrics for each VM resulting in aggregate data volumes that scale with the number of VMs monitored. The telemetry data size and high sample rates strain efforts to store, process, and analyze the telemetry data stream.
This disclosure is directed to systems and methods for mining streams of telemetry data in order to identify virtual machines (“VMs”), discover relationships between groups of VMs, and evaluate VM performance problems. The systems and methods transform streams of raw telemetry data consisting of resource usage and VM-related metrics into information that may be used to identify each VM, determine which VMs are similar based on their telemetry data patterns, and determine which VMs are similar based on their patterns of resource consumption. The similarity patterns can be used to group VMs that run the same applications and diagnose and debug VM performance.
Systems and methods described below model virtual machine (“VM”) metrics in order to obtain VM performance related information. In particular, the systems and methods receive streams of raw telemetry data associated with each VM and determine VM similarity. In other words, the system and methods automatically identify patterns associated with groups of VMs and application workloads. A fingerprint is constructed for each VM. The fingerprint identifies the VM and characterizes the VM's performance. A fingerprint may also be used to identify performance problems of a VM, compare the performance of the VM to the performance of other VMs in order to obtain information about compatible co-location and compare clusters of VMs run by different hosts in order to identify factors that degrade performance.
The power of similarity relationships stems from the additional context that similarity provides. For example, VMs that should ostensibly be “similar” because the VMs run the same applications (version, configuration etc.) or perform the same task but appear in practice to be dissimilar can be used to signal a possible performance issue. The quantitative or qualitative “distance” between a VM and its expected cohort may be used to explain or diagnose the discrepancy. Analogously, the distance between a VM and another cohort can be used to explain why the VMs are dissimilar. Moreover, groups of similar VMs may help redefine the notion of normal and abnormal VM performance.
Fingerprints are constructed to determine the relationships between VMs. These relationships (neighborhoods of similarity) between VMs based on their telemetry may then be used to explain performance variations, such as explaining why certain VMs that should ostensibly be similar behave as if they are not. The fingerprints scale with the number of metrics considered not the number of machines, which is important for use in large clusters.
The methods described below uses VM similarity rather than historical observations in order to provide additional context for anomaly detection and diagnosis. The use of similarity also allows users to attempt diagnosis before an extensive history has been collected by comparing a VM with its nearest neighbors. The methods described below use clustering techniques from statistical machine learning to automatically detect instances of similar VMs (i.e., a neighborhood) and then examine the behavior of key telemetry metrics of all the VMs in that neighborhood to detect, explain, and diagnose differences between the metrics.
It should be noted at the onset that streams of telemetry data and data output from the systems and methods for analyzing the streams of telemetry data described below are not, in any sense, abstract or intangible. Instead, the data is necessarily digitally encoded and stored in a physical data-storage computer-readable medium, such as an electronic memory, mass-storage device, or other physical, tangible, data-storage device and medium. It should also be noted that the currently described data-processing and data-storage methods cannot be carried out manually by a human analyst, because of the complexity and vast numbers of intermediate results generated for processing and analysis of even quite modest amounts of data. Instead, the methods described herein are necessarily carried out by electronic computing systems on electronically or magnetically stored data, with the results of the data processing and data analysis digitally encoded and stored in one or more tangible, physical, data-storage devices and media.
A stream of raw telemetry data is collected for each VM. The telemetry data may be generated by a virtual machine monitor (“VMM”), which may be an application, firmware or hardware that runs VMs and generates metrics in the form of a stream of telemetry data associated with each VM. A performance manager manages statistics collected for each of the VMs and provides an application programming interface (“API”) for querying performance counters. A performance counter is the value of a metric at a particular point in time. The telemetry data stream associated with each VM is composed of performance counter values collected for each metric in intervals of time called epochs. In other words, each metric is sampled a number of times within an epoch. An order statistic is used to identify features of each VM. For example, the data collected for each VM is stored in a computer-readable medium and can be represented in the following format:
subscript L is a VM integer index that ranges from 0 to N;
index i identifies the metric that ranges from 0 to d+1; and
M is the number of performance counter values sampled in an epoch.
The order statistic applied to performance counter values associated with the metric i is called a feature and calculated over each epoch. The feature can be a percentile, minimum, maximum, sample average or sample median. In the following description, the feature is the median of the metric values sampled over an epoch. Performance counters may be sampled at different frequencies, such as every 20 seconds, one minute, or five minutes, and the epoch is longer than the interval between samples. Every epoch is sampled for the previous performance counter values that occur within the epoch. For example, consider a raw telemetry data stream composed of 300 metrics for a VM. Assuming every epoch is 1 hour in duration and the performance manager samples the 300 metrics every 20 seconds. As a result, a 300 metric by 180 sample value matrix (i.e., 54,000 sample values) is generated for each epoch.
Next, for each epoch, the metrics are pre-processed to discard constant-valued metrics and system metrics across the VMs. As a result, substantially constant metrics are discarded from the sample data matrices. The union of non-constant metrics are considered for further analysis below. Note that a metric that is constant for one VM but variable for another VM is not discarded, because the sample data matrices of each VM are separately pre-processed for each epoch.
For each metric, the median of the associated sample values collected in an epoch is calculated as follows:
m(tj)Lp=median{sp0,sp1, . . . , spM} (1)
where m(tj)Lp is the median value of the M+1 sample values sp0, sp1, . . . , spM of the metricp collected in the epoch tj for the VM VML. At each epoch, the median of each metric is calculated for each VM.
The d+1 order statics obtained for each epoch and each VM are arranged into vectors that are, in turn, arranged into data matrices associated with each epoch.
{right arrow over (VM)}(tj)L=[m(tj)L0m(tj)L1 . . . m(tj)Ld] (2)
For the epoch tf, feature vectors {right arrow over (VM)}(tj)0 304, {right arrow over (VM)}(tj)1 305, {right arrow over (VM)}(tj)2 306, . . . , {right arrow over (VM)}(tj)N 307 are formed for each of the VMs VM0, . . . , VMN, respectively, according to Equation (2). Next, for each epoch tj, the associated feature vectors are arranged as rows in data matrices denoted by M(tj).
Next, data matrices may be compacted to eliminate data redundancies and to obtain a compact metric data representation. One computational technique for compacting the data matrices is principle component analysis (“PCA”). PCA has the effect of reducing the dimensions of feature vectors {right arrow over (VM)}(tj)L to a lower dimensional projection of the original feature vector. For example, feature vectors described above are (d+1)-dimensional. PCA may be used to reduce a (d+1)-dimensional feature vector to a two- or three-dimensional feature vector. By reducing to a two- or three-dimensional data set, clusters of VMs can be represented graphically, enabling visual inspection of the results. PCA is a data summarization technique used to characterize the variance of a set of data, can be used to identify principle vector components, and eliminate components that are redundant. PCA is applied to each of the feature vectors of the data matrices M(tj). For the matrix M(tj), the mean is calculated for each of the d+1 features as follows:
where p is the metric index that ranges from 0 to d+1. The mean for each feature is subtracted from each of the d+1 data dimensions in the feature vectors to obtain mean-centered feature vectors:
where each element in the mean centered feature vector is given by:
{tilde over (m)}(tj)Lp=m(tj)Lp−mp
The mean-centered feature vectors can be arranged in rows to give a mean centered data matrix give by:
The covariance matrix is calculated for each of the mean centered feature vectors L as follows to give a (d+1)×(d+1) matrix:
where T represents matrix transpose.
Eigenvalues and eigenvectors of the covariance matrix Σ are calculated. The dimension u (i.e., u<d+1) to which a user selects to reduce the data is selected, which may be accomplished by first ordering the eigenvalues from highest to lowest then forming a matrix of eigenvectors F(d+1×u) composed of u eigenvectors associated with the u highest eigenvalues. The PCA data matrix is constructed by multiplying the matrix {tilde over (M)}(tj) by the matrix F.
MPCA(tj)={tilde over (M)}(tj)·F (7)
In the resulting PCA data matrix MPCA(tj), the rows are reduced feature vectors that correspond to the feature vectors in the data matrix {tilde over (M)}(tj).
K-means clustering is then used to identify and group together the VMs that are similar based on their corresponding metric values. K-means clustering is an unsupervised machine learning technique used to identify structure in a data set. K-means clustering can be applied to raw data in the data matrices {tilde over (M)}(tj) or applied to the PCA data matrix MPCA(tj). K-means clustering treats the feature vectors {right arrow over (VM)}L as though the feature vectors lie within a d-dimensional space. As a result, each feature vector {right arrow over (VM)}L corresponds to a VM and is assumed to be a point in a (d+1)-dimensional space based on the vector's metric values. The feature vectors that are close in space correspond to VMs that have similar metric values. K-means clustering receives a (d+1)-dimensional feature vector VML and a set of clusters C={C1, C2, . . . , CS} amoung which the features vectors are to be partitioned. K-means clustering minimizes within-cluster sum of squares given by:
where {right arrow over (Z)}1 is the centroid of Ci.
Given randomly generated initial values Z10, Z20, . . . , ZS0 for the cluster centroids, K-means clustering iteratively proceed through assignment and update steps until convergence. At each step, each feature vector {right arrow over (VM)}L is assigned to a cluster Cjt with the closest centroid {right arrow over (Z)}jt, and the centroid of each cluster is updated according to
The K-means method requires number of clusters s to be provided as an input, which implies that an optimal number of clusters for a given data configuration has to be determined. A poor choice for the number of clusters can lead to a poor result. Two different methods may be used to select the optimal number of clusters: the elbow method and Bayesian information criterion (“BIC”). For the elbow method, a marginal loss distortion Ds for a given partition of data s clusters is defined as
The elbow criterion run K-means for s=1, 2, 3, . . . and in each case computes the associated distortion given in Equation (10). Note that as the number of clusters increases, the distortion decreases to the value “0” as s approaches in. Additional clusters do not produce a better cluster model for the data. As a result, a best model may correspond to a sudden drop or “elbow” in the marginal loss distribution.
Alternatively, BIC provides a quantitative method for choosing the number of clusters. If L(θ) is the log-likelihood function and m is the number of clusters, then the BIC is given by:
where
f is the number of free parameters, and
n is the number of observation.
If s is the number of clusters and (d+1) is the number of dimensions, then the number of free parameters is the sum of s−1 class probabilities, s(d+1) centroids and sd(d+1)/2 free parameters in the co-variance matrix. The log-likelihood function of the ith cluster and the BIC are given by:
However, it is not yet determined whether to select the number of clusters that correspond to troughs 610 and 611. An angle based method can be used to select the optimal number of clusters. First, the local minimas among the successive differences are found and sorted in the decreasing order of their absolute values. Pointers to the corresponding number of clusters are maintained. The angle associated with each local minimum is computed as follows. When i is the corresponding number of clusters, the angle can be computed according to
When the first local maxima is found among the angles, the method stops.
Using PCA described above, the original N+1 (d+1)-dimensional feature vectors {right arrow over (VM)}L are projected them into u-dimensional vectors. For example, the (d+1)-dimensional feature vectors {right arrow over (VM)}L may be projected into 2-dimensional vectors that lie in the Euclidean plane. Furthermore, K-means clustering has been used to group the u-dimensional vectors into clusters. The VMs in the same cluster are similar, but which of the original d+1 metrics responsible for bringing the VMs together in the same cluster remains to be determined. One-vs-all logistic regression is used to determine which metrics best characterize the cluster. BIC works by providing a separator between a cluster of interest and the remaining clusters (e.g., in two dimensions the separator is a line and in three dimensions the separator is a plane). The analytical representation of the separator gives a weight for each of the dimensions. The higher the weight, the more important the dimension is and implicitly the corresponding metric.
One-vs-all logistic regression (“OVA LR”) is used to extract a subset of features that best describe each cluster of VMs. OVA LR is a supervised statistical machine learning technique for classification. Given a dataset of features and labeled points (i.e., feature vector) that represent positive and negative examples, OVA LR identifies the subset of features and their associated coefficients that can be used to distinguish the positive examples from the negative examples. OVA LR uses features in one cluster as the set of positive examples while considering all the points in the remaining clusters as negative examples, reducing the problem to a 2-class classification. The subset of features and coefficients that describe a cluster is the cluster's “fingerprint.”
Logistic regression characterizes the structure of the statistical clusters obtained from K-means clustering described above by identifying the relevant features of each cluster. OVA LR produces a fingerprint for each group of VMs in the form of a summarized/compressed representation of raw metrics. OVA LR is a classification technique for identifying the aspects that describe a labeled set of data points. OVA LR is based on a sigmoid function classifier taken from values between 0 and 1:
where {right arrow over (θ)} is a vector of weights and has the same dimensions as the feature vector {right arrow over (VM)}L. OVA LR assigns a label (y=1 or y=0) to each new data point {right arrow over (VM)}L based on a training set for which the labels are already known. The hypothesis output hθ({right arrow over (VM)}L) is interpreted as the estimated probability that y=1 on {right arrow over (VM)}L. The rule that assigns labels given θ parameters is intuitive:
The classification rule in Equation (15) can be simplified to
The {right arrow over (θ)}T{right arrow over (VM)}L=0 describes the decision boundary for our hypothesis. The points on one side of the boundary receive the label y=1 while points on the other side receive y=0. The components of the vector {right arrow over (θ)} are determined by minimizing a cost function given by:
The output from OVA LR is the vector {right arrow over (θ)} that describes the class of positive examples and their associated weights called coefficients. Examples of vector {right arrow over (θ)} coefficients are presented in tables of the Example subsection below. The coefficients of the vector {right arrow over (θ)} are the fingerprint used to identify each of the virtual machines and can be used to compare the performance of one VM to the performance of other VMs.
The quality of the classification is analyzed by examining certain measures, such as precision, recall, and an F-measure given respectively by:
where
tp is the number of true positives;
fp is the number of false positives;
tn is the number of true negatives; and
fn is the number false negatives.
The data-mining methods described above were applied to debug performance for a tool used to emulate and evaluate large-scale deployments of virtual desktops. The tool was configured to generate workloads that are representative of user-initiated operations (e.g., interacting with documents, media and email) that take place in virtualized desktop infrastructure (“VDI”). The VDI is the practice of hosting a desktop operating system in a virtual machine running on a centralized server.
A tool deployment consists of three groups of VMs: 1) desktops that generate loads, such as launch applications and execute tasks, 2) clients that are connected to the desktops via a remote display protocol, such as PCoIP, and display the results of actions being executed on the desktop, and 3) infrastructure VMs that host the components of the tool (e.g., the controller that launches the experiments), and VMs concerned with monitoring the virtual and physical environment, such as a virtualized data center. During a tool run, the desktop VMs run a mix of applications. Applications perform a randomized mix of tasks including: open, close, save, save as, minimize, maximize, start/stop presentation, modify document, play/stop video, as appropriate for the specific application being run.
The tool experiment cluster included a total of 175 VMs: 84 desktop VMs, 84 client VMs that use PCoIP to connect to desktops, and 7 infrastructure VMs (3 vCOPS VMs and 4 tool infrastructure VMs). The tool run lasted for ˜5 hours and the 184 VMs generated approximately 360 MB of metric data in a concise CSV-based storage format. The results below show that: (1) The VMs can be automatically group/clustered based on their telemetry patterns. The clustering results are robust, remaining stable over time and they are not sensitive to various choices of order statistics used on raw telemetry data to create the features used for clustering. (2) An accurate fingerprint that contains the subset of metrics that best describe the behavior of the VMs in the group can be generated. (3) The raw metric feature vectors using techniques like PCA described above can be used to compress the raw metric data and maintain accurate and stable VM groupings. (4) Techniques from signal processing can be used to filter and select fingerprint metrics useful for explaining/diagnosing differences within groups of ostensibly similar VMs. Finally, it was demonstrated that conditional probability distributions can be used to effect an explanation/diagnosis.
Tables 1, 2 and 3 shows the respective metric fingerprints that best describe (and partition) the clusters of VMs. Table 1 displays the fingerprints for the cluster of 84 clients. Note the prominent contribution of CPU and network metrics.
Table 2 displays the fingerprints of the clusters of 51 desktops.
Table 3 displays the fingerprints for the cluster of 22 desktops.
The original expectation was to have 3 groups of VMs instead of 4. A technique for debugging the difference is now described. Specifically why and how the two adjacent groups of desktop VMs shown in
Spread metrics were used to differentiate two clusters or cause a cluster to split or diffuse. Spread metrics characterize how much the expected value of the order statistic (e.g., the median) of a metric E[m] differs between two clusters (i.e., the expected value is conditioned on the cluster). Note that the expected value summarizes the behavior of the distribution of the order statistics of a metric, which by definition it is the weighted average of all the possible values a random variable can take. The expectation over the distribution of an order statistic captures the aggregate behavior over a population of VMs. In this case, the population of VMs is the neighborhood of similar VMs determined by K-means clustering. Concisely, a spread metric is given by
where
E[m|cluster i] is the expected value of a metric m conditioned on the VM being in cluster i;
mmax is the maximum value of that metric over the clusters considered and serves as a normalization factor; and
θ is a tuning parameter that is used to identify metrics to filter/remove based on the magnitude of the differences in expected values.
Using too small of a value for θ filters or removes a larger number of metrics, potentially to the point of removing metrics that distinguish previously disparate (non-adjacent) clusters of VMs. Experiments revealed that values for θ between 0.1 and 0.2 work well. In this example, θ=0.1 was used.
Given that there are possibly hundreds of metrics that can be consider as candidates for spread metrics and considering the computational expense of constructing conditional probability distributions, the Silverman's test was used to identify multi-modal metrics and process these first when looking for candidate metrics to construct the conditional probability distributions over. Entropy-based measures, e.g., mutual information, were also used to determine what metrics to condition metric in on.
Table 4 shows the top ten spread metrics that separate the two clusters of desktops in
Next, the data-mining methods were applied to a cluster of computers for 5 days, generating 19 GB of data stored in concise CSV format and 1.2 GB of indexes used for querying.
The spread metrics were computed within the cluster of 207 VMs to identify the metrics that explain the dispersion. Thirty metrics were identified where the spread is greater than θ=0.1 and shown in the top 10 metrics in Table 5.
Embodiments described above are not intended to be limited to the descriptions above. For example, any number of different computational-processing-method implementations that carry out for mining telemetry data may be designed and developed using various different programming languages and computer platforms and by varying different implementation parameters, including control structures, variables, data structures, modular organization, and other such parameters.
It is appreciated that the previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Number | Name | Date | Kind |
---|---|---|---|
20060224725 | Bali et al. | Oct 2006 | A1 |
20100281482 | Pike et al. | Nov 2010 | A1 |
Entry |
---|
Bodik, Peter, et al., “Fingerprinting the Datacenter: Automated Classification of Performance Crises”, In Proceedings of the 5th European Conference on Computer Systems, NY, NY 2010. |
Number | Date | Country | |
---|---|---|---|
20150007173 A1 | Jan 2015 | US |