IBM® is a registered trademark of International Business Machines Corporation, Armonk, N.Y., U.S.A. Other names used herein may be registered trademarks, trademarks or product names of International Business Machines Corporation or other companies.
1. Field of Invention
This invention relates in general to systems, and more particularly to monitoring multiple time series as a measurement of system behavior.
2. Description of Background
Large volumes of time series data are routinely generated in scientific, engineering, financial, and medical domains. A wide spectrum of applications monitor time series data for process and quality control, pattern discovery, and abnormality forecasting. Much study has focused on revealing the internal structure (e.g., autocorrelation, trends and seasonal variation) of time series, and recently, mining time series for knowledge discovery has received a lot of attention from data mining, information retrieval, and bioinformatics communities.
Examples of monitoring a large number of times series streams includes data collected by distributed sensors, real time quotes of thousands of securities, system events generated by a large number of networked hosts, and DNA expression levels gathered by the microarray technology for thousands of genes, etc.
One of the common tasks in monitoring multiple time series simultaneously is to find correlations among them. Discovering correlations is important to many applications for at least two reasons. First, fluctuation of values in one time series often depends on many factors. Separate analyses on single series are not sufficient to understand the underlying mechanism that produces the multiple interrelated time series. Second, monitoring tens of thousands of time series is a resource intensive task. Knowing the interrelationship among the time series may enable us to concentrate limited resources on as few time series as possible, as the behavior of other time series can be derived by these time series.
Thus, there is a need for a method that measures the relevance of multiple time series by leveraging state transition points and mutual information maximization.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method for measuring time series relevance using state transition points including inputting time series data and relevance threshold data. The method further includes converting all time series value to ranks within [0,1] interval. Afterwards the method proceeds by calculating the valid range of the transition point in [0,1]. Then the method includes verifying that a time series Z exists for each pair of time series Z and Y, such that the relevances between X and Z, and between Y and Z are known. The method further includes deducing the relevance of X and Y. The relevance of X and Y must be at least one of, (i) higher than the given threshold, and (ii) lower than the given threshold. The method proceeds by terminating all remaining calculations for X and Y provided Z is found. Next, the method proceeds by segmenting time series if no Z time series exists and using the segmented time series to estimate the relevance. Then the method includes applying a hill climbing algorithm in the valid range to find the true relevance.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.
As a result of the summarized invention, technically we have achieved a solution for a method for measuring time series relevance using state transition points.
The subject regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
a)-2(b) illustrate an example of two different plots of time series;
a)-(b) illustrate an example of a relationship between bifurcating points, entropy and mutual information;
The detailed description explains an exemplary embodiment of the invention, together with advantages and features, by way of example with reference to the drawings.
The typical configuration of a medium-to-large enterprise production system usually comprises a few thousand hosts and provides hundreds of different services such as cataloging, website search, shopping cart management, credit card authentication, check out, etc.
Systems are monitored internally as well as externally. Externally, on the client side, end-to-end probing measures the availability and the response time of each service by sending dummy requests to the server.
Internally, the entire information infrastructure can be closely monitored in every aspect. For instance, we monitor available system resources, including CPU, physical memory, free disk space, network bandwidth, etc. A system resource is often monitored using many types of metrics. For example, CPU usages are usually measured in system time, user time, process time and idle time. Important service providers, such as web servers, database servers, e-mail servers, directory servers, DHCP servers, storage servers, and multimedia servers, are intensively monitored as well.
The large amount of information generated during monitoring often exceeds our processing capability. A common approach is to aggregate data by time-based windows and use aggregated data for analysis. The aggregated data is still huge. Case studies show that typically, there are hundreds of end-to-end probing metrics and around a million resource monitoring metrics.
The sheer number and size of time series can make even the simplest analysis nontrivial to realize. Effective monitoring of tens of thousands of times series depends on whether the following critical tasks are satisfactorily performed.
Clearly, discovering meaningful interrelationships of multiple metrics is essential to each of these critical tasks. However, traditional correlation measurements such as Pearson's coefficients are not equal to this task. The disclosed invention is directed at accomplishing these tasks by using a new relevance model and an efficient algorithm.
Finding highly correlated time series is important to many applications. Currently, there is no intuitive model nor robust computation methods for this task. The disclosed invention proposes:
A new model is introduced to measure correlation between a pair of time series. Through real life examples in system management, it is shown that traditional methods are not effective in capturing important correlations. Then a new measure is proposed based on state transition points, which are an essential concept in describing a complicated system.
Pearson's correlation is often used to measure association between a pair of time series. However, it is only meaningful for quantifying the strength of linear correlations, and in situations where data contains outliers or missing values, Pearson's correlation coefficient may cease to be meaningful.
For example, in
Even in data free of outliers and missing values, Pearson's coefficient may not be able to capture every meaningful correlation. In
The phenomena in
From the phenomena discussed the following observation arises. A certain threshold may determine correlation between two time series. In regions away from the threshold, correlation is weak and insignificant. However, when one time series crosses the threshold, the other time series responds dramatically.
Intuitively, to capture the correlation, the time series can be discretized into binary sequences at the threshold point. That is, any above this threshold will be encoded by 1, and any point below encoded by 0. Note that threshold points are determined by the intrinsic mechanism of the system, and are different in different time series.
A discretization method shall be introduced based on the above intuition, and used to develop a correlation model for discretized time series. The notations utilized in this application are summarized in the table shown in
One objective is to find the best threshold to discretize a time series into a binary sequence. Clearly, exact values in the time series are not important; what matters are the relative values to the threshold.
The original time series undergoes a preprocess to convert exact values into percentiles (or ranks). This makes further computation and discussion easier, and it also eliminates outliers in the time series.
The resulting percentile time series are composed of points in the range of [0,1], and the next task is to find a threshold κε[0,1] to bifurcate time series into binary strings. Throughout the remainder of this application, time series is used to mean percentile time series.
Intuitively, think of a system as having a set of states, which form a state space. At any point of time, the system is in one of the states in the state space. Imagine there are points (or hyper-planes) in the state space that divide the space into regions where the system behaves very differently. Then, once the state passes these state transition points into different regions, there will be non-continuous change in system behavior.
When the system is experiencing non-continuous changes, such changes will reflect in the time series that monitors the system. In other words, there may exist a threshold within the [0,1] range of time series that corresponds to such changes.
Another objective is to discretize time series at such thresholds and study their correlation on discretized time series. In the following, bifurcation functions are defined to discretize real valued time series into binary sequences.
Definition 1 Given a threshold κε[0,1] and a time series T=Bθ(t1), . . . , Bθ(tn) where
Bθ(x)={T if x≧θ, ⊥ if x<θ
The resulting time series is called bifurcated time series because it is discretized at point θ into binary representation. The threshold θ is as the bifurcating point.
If the right bifurcating points are chosen, two bifurcated times series may exhibit high correlation. Let θx and θy be the thresholds chosen for time series X and Y. The bifurcating point divide X's range of [0,1] into a low regision [0,θx) and a high region [θx,1], and θy does the same thing for Y. The two time series X and Y are relevant if (Xi, Yi) largely fall into diagonal regions. That is, i) either Xi and Yi tend to be in the same high or low region, ii) or, they tend to be in opposite regions, that is, when Xi is in the high regions of X, Yi is in the low region of Y, and vice versa.
b) is an example that corresponds to the second case above. The two time series plotted in
Given two time series S and T, the correlation between S and T is defined as:
Where Sθ
Based on the previous description, it may be concluded that two time series S and T are more relevant if they share larger diagonal regions. However, the regions are defined by the thresholds θs and θt. If the threshold is set at the high or the low extremes (1 or 0), than all of the points will be in the same region, and consequently, any two times series will be highly correlated according to the above criteria.
In other words, in order to make the relevance measure meaningful, at the same time when trying to maximize diagonal regions shared by the two time series, thresholds that divide the value range into more even regions are favored.
Mutual information is a statistic that can be used for this purpose.
Note that all logarithm used here are 2-based so the maximum value of entropy is 1, as shown in
Problem 1 (Relevance Discovery) Let S and T be two time series. Find bifurcating points θs and θt that maximize mutual information I(Sθ
The above definition requires that the optimal thresholds for both of the time series at the same times be found. In some applications, the bifurcating point, or the threshold of one of the time series is given, and then it is only required to find the optimal threshold of the other time series. For example, the breach points (thresholds) of service metrics are obtained by breach point sensitivity analysis. So only the thresholds of resource metrics are changeable. This variation is a strictly simpler problem of the problem stated above.
Problem 2 (Highly correlated Pairs) Given multiple time series and a threshold μ, find any pair S and T whose correlation, or maximum mutual information I(Sθ
To compute the correlation of two time series, w must first identify the bifurcating points that maximize the mutual information. If time series are of length m, this process has complexity O(m2). Then, to find pair-wise relevance for n time series of length m using a naïve, brute-force algorithm requires O(n2m2) time. This section proposes methods to improve performance by identifying lowly correlated pairs and filter them out as early as possible.
In summary, the algorithm adopts the following strategies:
Algorithm 1 is the main procedure for finding highly correlated pairs (Ti, Tj) in a set of time series T. The algorithm returns R(Ti,Tj), the correlation between any pair Ti and Tj.
By using algorithm 1, first the valid bifurcating range is determined, which is represented by ρ (line 2). Then, for any pair Ti and Tj, check if their correlation is below μ by using two methods, triangular inequality bound, and the segmentation method. If neither method is successful in eliminating the pair, we invoke the Nelder-Mead method to approximate the correlation.
The preliminary focus is on finding pairs of time series whose mutual information is above a minimal threshold μ. For two random variables X and Y, their mutual information is defined as:
I(X;Y)=H(X)−H(X|Y)
where H(X) is the entropy of X. Since entropy is non-negative, we have I(X;Y)≦H(X). Thus, in order for a bifurcated time series X to form a highly correlated (correlation≧μ) pair with another times series, its entropy H(X) must be above μ. This result can be used to prune time series.
In other words, the choice of the bifurcating points must yield enough entropy so that the mutual information can be possibly above the minimal threshold. It is not difficult to derive, given the minimal threshold μ, the bifurcating points must be inside the following range.
[H−1(μ),1−H−1(μ)] (1)
For example, if μ is 0.7, bifurcating points must reside in the range of [0.189,0.811].
It is expensive to compute the exact correlation of two time series as it requires multiple scans of the data. Using a feature vector to represent each time series, and estimate pair-wise mutual information using feature vectors. If the estimation falls below μ with large probability, then there is no need to actually compute the precise mutual information.
The intuition of the segmentation and estimation method comes from the locality of time series values. Time series is cut into a few segments. Because of locality, values in a segment may fall inside a small range. Then the ranges can be used to approximate the original time series.
According to Equation 1, valid bifurcating points are inside a range determined by user threshold μ. Denote the range by [x,y], where 0≦x<y≦1. Given a time series T, no matter what bifurcating point is chosen inside [x,y], values less than x will always be discretized as ⊥, and values above y will always be discretized as . We call values inside range [0,x] surely negative, and values inside range (y,1] surely positive.
Partition a time series T into m segments, T1, . . . , Tm. The segments do not need to be equal in size. T is represented by two feature vectors:
Cz,33 (T)=sp(T1), . . . , sp(Tm)
C⊥(T)=sn(T1), . . . , sn(Tm)
where sp(Ti) and sn(Ti) are the number of points in Ti that are surely positive and surely negative respectively.
Each segment may contain points that are neither surely positive nor surely negative. The number of such points is |Ti|−sp(Ti)−sn(Ti). Whether such a point is discretized to or ⊥ will be determined by the bifurcating point chosen. Let L(Tk) denote the number of positive points in segment Tk, and let L⊥(Tk) denote the number of negative points in segment Tk. Estimate L(Tk) and L⊥(Tk) as follows:
Intuitively, the estimation assumes that the number of positive points in a segment is proportional to the number of surely positive points in that segment, and same for the number of negative points. This is a reasonable assumption as points in a time series often exhibit locality.
Based on the estimation of the number of positive and negative points in each segment, to derive the probability distribution in the entire time series use:
However, to compute entropy and mutual information, the joint distribution of two time series must be known. For instance, we need to know P(S,T) and P(S⊥,T⊥), that is, whether the points in S and T are aligned in such a way that positive points in S always appear together with positive points in T.
Because computing joint distribution is expensive, we use the distribution of single time series to estimate the joint distribution. As previously mentioned, when the entropy of each time series is fixed, the maximum correlation occurs when the diagonal region is maximized. That is, either positive points of S tend to appear together with positive points of T, or tend to appear with negative points of T.
Since C(T) and C⊥(T) have been computed for each time series T, they can be used to estimate the joint probabilities of S and T:
Clearly, the value of P(S,T) computed above is a maximum estimation, as it occurs only if all positive points are aligned. Same is true for P(S⊥,T⊥).
To maximize the diagonal region, alternatively maximize P(S,T⊥) and P(S⊥,T) in a similar way:
Use Equation (2,3) and Equation (4,5) to compute the mutual information, and choose the bigger one as our final estimation.
Empirical studies were conducted on the estimation method.
If the estimated relevance of two time series is above μ, then an expensive search must be performed to find the optimum points. A brute-force method is to use grid-point survey of the I(S,T) surface.
Observe the following:
Finding the maxima in functions of several variables is a classical optimization problem. Two methods were compared: the gradient method and the Nelder-Mead Simplex method.
The Nelder-Mead Simplex Method is a widely used classical function optimization algorithm. While gradient methods have to compute function values in a very small region during each iteration in order to simulate the first and second directives, this method starts from a set of points distributed in a rather bigger area. There are three operations that can be performed in each iteration-reflection, contraction and expansion.
The original algorithm was adapted on the following aspects:
The experiments conducted show this algorithm converges in our problem. Finding all highly correlated pairs is an expensive process. Various methods are proposed to reduce computation cost by using previously computed pairwise results.
The key is to develop triangular inequalities to estimate the pair-wise relevance between X and Y by studying their relationships with a third variable Z. The problem is addressed in two steps. First, derive general triangular information inequality. Then, extend the results to the case of bifurcated time series.
Using the correlation we have computed between X and Z and between Y and Z, we would like to infer the relevance between X and Y. We start with a lemma that is the foundation of the following theorems.
Lemma 1 The following inequalities hold
H(X|Y)≦H(X|Z)+H(Z|Y). 1.
H(X|Y)≧I(X;Z)−H(Z|Y). 2.
Applying the above inequality to mutual information I(X;Y)=H(X)−H(X|Y), we obtain the upper bound and the lower bound for I(X,Y) as shown below.
Theorem 1 The mutual information I(X;Y) have the following bounds.
Theorem 1 enables the estimation of the range of I(X;Y) through two pairwise information relationships between X and Z and between Y and Z. As a result, without computing the exact value of I(X;Y), if the lower bound is above the relevance threshold. It can be concluded that X and Y are significantly relevant; likewise, if its upper bound is below the relevance threshold, it can be concluded that X and Y are not significantly relevant.
The triangular inequality was extended to handle bifurcated time series produced by different bifurcating points. Notice that for any two given bifurcating points θ1 and θ2 on a time series X, we can easily compute the information relationship between Xθ
Theorem 2 Assume θ1≦θ2, then the following equalities hold:
To compute the triangle inequality on time series generated by different bifurcating points, introduce some notations. Let θxz, θzy, and θyx represent the optimal threshold sets that maximize the mutual information of (X,Z), (Z,Y) and (Y,X), respectively. Note that θ is overloaded here to refer to a pair of optimal bifurcating points. For example, θxz refers to the two thresholds on X and Z that together optimize the mutual information of X and Z. With the three sets of thresholds, we obtain six binary random variables:
Xθ
Theorem 3 The mutual information of I(Xθ
I(Xθ
where
I′θ
−(H(Zθ
I′θ
−(H(Zθ
Here, I′θ
The proof is through recursively applying the inequality 6. Using the same strategy, the upper bound of I(Xθ
Theorem 4 The mutual information of I(Xθ
I(Xθ
where
I″θ
−(H(Zθ
I″θ
−(H(Zθ
Relevance discovery in time series provides a way to understand the relationship among monitored entities. The state of art method for this task is to use the Pearson's correlation and the relevance measure. As we have explained, there are situations where the Pearson's correlation can not reveal the true relevance and the measure is not robust enough for very noisy data.
It is very common for a computer resource metric to exhibit significant behavior change once its metric values exceed or fall below a threshold. However, the absolute metric value may have relatively little observable effect. Inspired by this phenomena, we proposed a measure based on the state transition point model. The measure seeks to find a trade-off between association and effectiveness. The measure is essentially the mutual information of bifurcated time series. The information theoretical measure requires no artificial parameters.
The proposed relevance measure, although fits some problem domains better, is more computationally expensive. So we also propose methods that can speed up the computation. We proposed a estimation method based on using feature vectors obtained from segmenting and aggregating the time series. We also proved there is a special type of triangular inequality that exists for relevance that we can use to avoid pair-wise relevance calculation. The experiments showed our algorithm is significantly faster than the brute force method.
Referring to
Then, at step 120, the valid range of the transition point in [0,1] is calculated. At step 130, a verification occurs to determine if a time series Z exists for each pair of time series Z and Y, such that the relevances between X and Z, and between Y and Z are known. Subsequently, at step 140, the relevance of X and Y is deduced. The relevance of X and Y must be either higher than the given threshold, or lower than the given threshold. At step 150 confirmation of Z takes place, provided Z is found, at step 160, terminate all remaining calculations for X and Y.
At step 170, the time series is segmented if no Z time series exists, and then use the segmented time series to estimate the relevance. Then at step 180, a hill climbing algorithm is applied in the valid range to find the true relevance.
While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.
Number | Name | Date | Kind |
---|---|---|---|
4734909 | Bennett et al. | Mar 1988 | A |
5893058 | Kosaka | Apr 1999 | A |
7117108 | Rapp et al. | Oct 2006 | B2 |
Number | Date | Country | |
---|---|---|---|
20080177813 A1 | Jul 2008 | US |