This disclosure relates to computational systems and methods for reducing the number of metrics used to monitor computer resources.
Computer resources are typically monitored to evaluate performance and assess how certain resources perform with respect to different operations. A computer resource can be monitored by generating one or more metrics that indicate how often or much particular components of the resource are used over time. For example, the metrics typically collected for a server over time may be the average number of times a buffer is accessed, the number of times certain connections are used or idle, electrical power consumption, network throughput, hard disk space, and processor time. After multiple metrics have been collected, the metrics can be evaluated to assess the performance of individual components of a resource or the metrics can be used to track the performance of the resource. For example, the same metrics collected for various servers can be used to compare how different servers perform when executing the same set of instructions.
However, many resource monitoring applications use a large number of metrics which, in turn, creates problems for resource users. For example, a resource user trying to determine which of a large number of metrics to select in order to assess performance of the resource may be overwhelmed and have to guess as to which metrics to collect; the large number of metrics collected increases storage requirements and, therefore, increases the cost of evaluating a resource's performance; and, when monitoring multiple resources, the large number of metrics may reduce the scale of monitored resources and/or monitoring time when one of the resources is dedicated to monitoring the other resources. As a result, those working in the computer industry seek tools that can be used to reduce the number of metrics without sacrificing useful information that may be used to evaluate the performance of a resource.
This disclosure presents computational methods and systems for identifying a subset of a set of metrics that can be used to monitor a resource in which the subset is representative of the information represented by the set of metrics. The methods and systems receive metrics values associated with a set of metrics used to monitor a resource over a sample period of time and calculate the correlation magnitude for each pair of metrics. The correlation magnitudes are compared to one another in order to identify correlated metrics. At least one of the correlated metrics is deleted from the set of metrics resulting in a subset of metrics that produce information representative of the full set of metrics. Deletion of metrics from the set of metrics may be optimized for accuracy by determining a subset of metrics that gives a minimum accuracy or the set of metrics may be optimized for cost by determining a subset of metrics that gives a best accuracy for a maximum number of metrics allowed. The representative subset of metrics can then be used to monitor the resource.
This disclosure presents computational systems and methods for identifying a subset of a set of metrics that is representative of the information provided by the full set of metrics regarding resource performance. As used herein, a resource may be any physical or virtual component of a computer system that has limited availability and is able to be evaluated using one or more metrics. Examples of a resource include a server, a storage array, a network, and a sensor. A resource may also be any external device connected to a computer system or any internal component of the computer system. Resources also include virtual resources such as files, network connections and memory areas. A metric may be any system of measurement that produces a numerical value that represents a measurable aspect, feature, or component of a resource over time. For example, if a resource to be monitored is a server, an example set of metrics that can be collected to evaluate the server's performance over a period of time include percentage of buffer hits, buffer reads per second, processing time, and electrical power consumption.
It should be noted at the onset that data relating to a resource, such as metric data and correlation data are not, in any sense, abstract or intangible. Instead, the data is necessarily digitally encoded and stored in a physical data-storage computer-readable medium, such as an electronic memory, mass-storage device, or other physical, tangible, data-storage device and medium. It should also be noted that the currently described data-processing and data-storage methods cannot be carried out manually by a human analyst, because of the complexity and vast numbers of intermediate results generated for processing and analysis of even quite modest amounts of data. Instead, the methods described herein are necessarily carried out by electronic computing systems on electronically or magnetically stored data, with the results of the data processing and data analysis digitally encoded and stored in one or more tangible, physical, data-storage devices and media.
{tilde over (M)}={m
i
|i=1, . . . ,n} (1)
where n is a positive integer; and
“˜” symbol indicates a set.
are used to monitor the resource. The set of metrics can be encoded and stored in a computer-readable medium. The selection of metrics may be determined by the type of resource being monitored. For example, when the resource is a server, m1 can be the CPU usage, m2 can be number of buffer accesses, m3 can be electrical power consumption and so on. Metric values associated with each of the n metrics are calculated and stored in a computer-readable medium for each of the p time intervals in the time line 202. In the time interval 204 between time “0” and time t1, n metrics values are calculated for the resource and stored in a computer-readable medium, and in the time interval 205 between time t1 and time t2, the same n metrics are again calculated for the resource and stored in the computer-readable medium. The values of metrics collected for each of the intervals of the time line 202 shown in
The values of the metrics may depend on the resource usage pattern over the time line. As a result, the metric values may be calculated and collected while running a mix of tests against the resource during the time line. For example, one test can include positive tests in which the demand on the resource is low as well as negative tests in which the demand on the resource is high. The test may also include stress tests. The test selected can be designed to cover cases in which the metrics have strong correlation under positive test conditions and have potentially weak correlation when the resource is under negative test conditions.
After the n metrics values have been collected for each of the p time interval, as represented by the set {m1(t), m2(t), . . . , mn(t)|t=1, . . . , p}, the correlation magnitude is calculated for each pair of metrics in the set {tilde over (M)}:
c
i,j=|corr(mi,mj)| (2)
where miε{tilde over (M)} and mjε{tilde over (M)}; and
corr(mi, mj) is the correlation between the metrics mi and mj.
The correlation magnitude associated with each pair of metrics is encoded and stored in the computer-readable medium. In Equation (2), the correlation may be computed using:
for the entire population of metrics in the set {tilde over (M)}. The numerator in Equation (3) is the expectation value given by:
and the denominator of Equation (3) is a product of standard deviations of the pair of metrics mi and mj, which can be computed according to:
where s=i and j.
The expectation value in Equation (4) and standard deviation in Equation (5) are functions of the average value of each of the metrics mi and mj:
The value resulting from the correlation corr(mi, mj) calculated according to Equation (3) lies in the interval [−1,1] and indicates the non-linearity and direction of a linear relationship between the metrics mi and m1. The closer corr(mi, mj) is to “0,” the lower the correlation between the pair of metrics mi and mj with corr(mi, mj)=0, indicating that the metrics mi and mj are not correlated. The closer corr(mi, mj) is to “−1” or “1,” the higher the correlation between the pair of metrics mi and mj with corr(mi, mj)=−1 or 1, indicating the metrics mi and mj are highly correlated. The sign associated with the correlation corr(mi, mj) indicates whether the data has a positively sloped or negatively sloped relationship. For example, if linear regression is applied to the set of metric value pairs {(mi(t),mj(t))|t=1, . . . , p}, a positive correlation would correspond to a positively sloped regression line and a negative correlation would correspond to a negatively sloped regression line. Alternatively, the correlation magnitude Ci,j calculated according to Equation (2) lies in the interval [0,1] and provides the correlation but not the positive or negative slope relationship of the metrics mi and mj. In particular, the closer Ci,j is to “0,” the lower the correlation between the pair of metrics mi and mj with Ci,j=0, indicating that the metrics mi and mj are not correlated. The closer ci,j is to “1,” the higher the correlation between the pair of metrics in and mi with mj with ci,j=1, indicating the metrics mi and mj are highly correlated.
Alternatively, rather than calculating the correlations and/or correlation magnitudes for the entire population of p time intervals in a time line, as described above with reference to Equations (2)-(7), sample correlations and/or sample correlation magnitudes may be calculated and stored in the computer-readable medium. For example, a sample correlation for each pair of metrics can be calculated using the sample correlation
where q is the number of samples with q≦p;
r is an integer time interval index selected from the range 1 to p; and
with s=i and j.
The sample correlation magnitude is given by
c
i,j
sample=|corrsample(mi,mj)| (8)
In other words, every metric is highly correlated with itself. Because the correlations form a symmetric matrix and the diagonal elements equal to “1,” the correlations corr(mi, mj) are only calculated for i=1, . . . , n and j=i+1, . . . , n, which corresponds to the off-diagonal upper-triangular portion, lower-triangular portion, of the correlation matrix 400.
{tilde over (C)}={c
i,j
|i,j=1, . . . ,n;j=i+1, . . . ,n} (9)
that are calculated and stored in a computer-readable medium. The diagonal matrix elements are not included in {tilde over (C)} because each diagonal matrix element represents the correlation magnitude of a metric with itself, which is irrelevant when eliminating matrix elements that are correlated with other matrix elements in the set {tilde over (M)}.
It should be noted that correlations and correlation magnitudes associated with certain metrics may be undefined and identified as not a number (“NaN”). Undefined correlation result from metrics that are constant throughout the time intervals of the time line. For example, when a metric ms(t)=constant for all t=1, . . . , p, according to Equation (6) μs=constant, which gives a standard deviation σs=0 according to Equation (5). As a result, the correlation and correlation magnitude of ms(t) paired with any other metric in {tilde over (M)} is undefined.
The set of metrics {tilde over (M)} in Equation (1) and a set of correlation magnitudes {tilde over (C)} in Equation (9) are mathematical related such that each pair of elements in {tilde over (M)}, (mi, mj), are related to one of the elements ci,j of {tilde over (C)}. The mathematical relationship may be represented by a graph that consists of the two sets {tilde over (M)} and {tilde over (C)} denoted by ({tilde over (M)}, {tilde over (C)}). The elements of {tilde over (M)} may be called vertices or nodes, and the elements of {tilde over (C)} may be called edges that connect two vertices (i.e., pair of metrics). A subset {tilde over (M)}′ of the set {tilde over (M)} is representative of the information provided by the entire set of metrics in {tilde over (M)} may be obtained by identifying a number of metrics with the highest correlation to other metrics in the set {tilde over (M)} and eliminating those metrics from the set {tilde over (M)} to give the representative subset {tilde over (M)}′.
{tilde over (M)}
ex
={m
i|=1, . . . ,8}
and corresponding set of correlation magnitudes given by:
{tilde over (C)}
ex
={c
i,j
|i=1, . . . ,8;j=i+1, . . . ,8}
{c1,2,c2,3,c2,4,c2,5,c2,6,c2,7,c2,8}
For the metric m5, the subset of correlation magnitudes of {tilde over (C)}ex in which the metric m5 is one of the pair of metrics is
{c1,5,c2,5,c3,5,c4,5,c5,6,c5,7,c5,8}
The correlation magnitudes associated with the metrics m2 and m5 can be determined by summing the correlation magnitudes associated with the metrics m2 and m5 as follows:
sum
2
=c
1,2
+c
2,3
+c
2,4
+c
2,5
+c
2,6
+c
2,7
+C
2,8
sum
5
=c
1,5
+c
2,5
+c
3,5
+c
4,5
+c
5,6
+c
5,7
+c
5,8
Assuming, for example, that sum5>sum2, the metric m5 is deleted from the set {tilde over (M)}ex (i.e., {tilde over (M)}ex={tilde over (M)}ex−m5) and the associated correlation magnitudes are deleted from the set {tilde over (C)}ex to give the graph represented in
{c1,3,c2,3,c3,4,c3,6,c3,7,c3,8}
For the metric m8, the subset of correlation magnitudes of {tilde over (C)}ex in which the metric m8 is one of the pair of metrics is
{c1,8,c2,8,c3,8,c4,8,c6,8,c7,8}
Summing correlation magnitudes associated with the metrics m3 and m8 gives:
sum3=c1,3+c2,3+c3,4+c3,6+c3,7+c3,8
sum8=c1,8+c2,8+c3,8+c4,8+c5,8+c7,8
Assuming, for example, that sum3>sum8, the metric m3 is deleted from the set {tilde over (M)}ex (i.e., {tilde over (M)}ex={tilde over (M)}ex−m3) and the associated correlation magnitudes are deleted from the remaining set of correlation magnitudes to give the graph represented in
{c1,4,c2,4,c4,6,c4,7,c4,8}
For the metric m7, the subset of correlation magnitudes of {tilde over (C)}ex in which the metric m7 is one of the pair of metrics is
{c1,7,c2,7,c4,7,c6,7,c7,8}
Summing correlation magnitudes associated with the metrics m4 and m7 gives:
sum4=c1,4+c2,4+c4,6+c4,7+c4,8
sum7=c1,7+c2,7+c4,7+c6,7+c7,8
Assuming, for example, that sum7>sum4, the metric m7 is deleted from the set {tilde over (M)}ex (i.e., {tilde over (M)}ex={tilde over (M)}ex−m7) and the associated correlation magnitudes are deleted from the remaining set of correlation magnitudes to give the graph represented in
{tilde over (M)}
ex
′={m
1
,m
2
,m
4
,m
6
,m
8}
By deleting the highest correlated metrics from the set {tilde over (M)}, the representative subset {tilde over (M)}′ is largely composed of lesser correlated metrics and/or metrics that may not be correlated at all. Using the representative subset {tilde over (M)}′ to assess the performance of a resource instead of using all of the metric in the set {tilde over (M)} avoids computing metrics that would otherwise give potentially redundant information. For example, in the example described above with reference to
The size of the representative subset {tilde over (M)}′ may be determined by (1) optimizing for cost or (2) by optimizing for accuracy. When optimizing for cost, a user may select a maximum number, Nmax, of metrics allowed in the representative subset {tilde over (M)}′. Alternatively, when optimizing for accuracy, a user may select a minimum accuracy, accmin, for the representative subset {tilde over (M)}′. Beginning with the highest correlated metric from the set {tilde over (M)}, metrics are iteratively deleted from the set {tilde over (M)} until the minimum accuracy is reached. The accuracy may be calculated at each iteration using:
where r is a positive integer index with acc1=1;
max_ci,j represents a current maximum correlation magnitude in the set {tilde over (C)}; and
n′ is the current number of elements in the set {tilde over (M)}.
As long as acc≧accmin, either mi or mj is deleted from the set {tilde over (M)} based on which of these metrics has the highest overall correlation magnitude with the other metrics remaining in the set {tilde over (M)}, as described above with reference to
where l is the index of metrics in {tilde over (M)} that have correlation magnitudes ci,l in {tilde over (C)}.
In block 705, correlation magnitudes that correspond to mj correlated with other metrics in the set {tilde over (M)} are identified and summed to give
where k is the index of metrics in {tilde over (M)} that have correlation magnitudes ck,j in {tilde over (C)}.
When sumi is greater than sumj in block 706, control flows to block 707, otherwise, control flows to block 709. In block 707, the metric mi is deleted from the set {tilde over (M)}:
{tilde over (M)}={tilde over (M)}−m
i
and for each l, the correlation magnitudes ci,l may be deleted from the set {tilde over (C)} in block 708. In block 709, the metric mj is deleted from the set {tilde over (M)}:
{tilde over (M)}={tilde over (M)}−m
j
and for each k, the correlation magnitudes ck,j may be deleted from the set {tilde over (C)} in block 710. In block 711, the number N is decremented to match the number of metrics remaining in the set {tilde over (M)}. As long as the number N of metrics remaining in the set {tilde over (M)} is greater than the user defined Nmax, in block 712, the operations in blocks 702-711 are repeated. Otherwise, the remaining set of metrics is returned and is composed of metrics that are representative of the information in the original set of metrics.
Embodiments described above are not intended to be limited to the descriptions above. For example, any number of different computational-processing-method implementations that carry out for identifying a subset of larger set of metrics used to evaluate the performance of a resource may be designed and developed using various different programming languages and computer platforms and by varying different implementation parameters, including control structures, variables, data structures, modular organization, and other such parameters. The systems and methods are not limited to using sumi and sumj described above in Equations (11a) and (11b). Alternatively, Equations (11a) can be replaced by an average:
where l is the index of metrics in {tilde over (M)} that have correlation magnitudes ci,l in {tilde over (C)}; and
L is the current number metrics with correlation magnitudes ci,l.
And Equation (11b) can be replaced by an average:
where k is the index of metrics in {tilde over (M)} that have correlation magnitudes Ck,j in {tilde over (C)}; and
K is the current number metrics with correlation magnitudes ck,j.
Alternatively, rather than reducing a set of metrics based on correlation magnitudes, correlations alone can be used to reduce the set of metrics.
It is appreciated that the previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.