In information processing environments, a vast variety of performance data is available. Performance data is collected by system performance monitors at the hardware level, operating system level, database level, middleware level, and application level. Collecting and using the large amount of performance data available is an onerous task requiring significant resources. In some cases, collecting and using performance data negatively impacts performance, and hence performance data, itself. Efficient collection and use of performance data is desirable.
For a detailed description of the embodiments of the invention, reference will now be made to the accompanying drawings in which:
Certain terms are used throughout the following claims and description to refer to particular components. As one having ordinary skill in the art will appreciate, different entities may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms “including” and “comprising” are used in an open-ended fashion, and thus should be interpreted to mean “including, but not limited to . . . ” Also, the term “couple” or “couples” is intended to mean an optical, wireless, indirect electrical, or direct electrical connection. Thus, if a first device couples to a second device, that connection may be through an indirect electrical connection via other devices and connections, through a direct optical connection, etc. Additionally, the term “system” refers to a collection of two or more hardware components, and may be used to refer to an electronic device.
The following discussion is directed to various embodiments of the invention. Although one or more of these embodiments may be preferred, the embodiments disclosed should not be interpreted, or otherwise used, as limiting the scope of the disclosure, including the claims, unless otherwise specified. In addition, one having ordinary skill in the art will understand that the following description has broad application, and the discussion of any embodiment is meant only to be exemplary of that embodiment, and not intended to intimate that the scope of the disclosure, including the claims, is limited to that embodiment.
Trend determination and identification is disclosed. Self-tuning predictive performance models, based on machine learning, utilize performance data to monitor system performance levels, control the monitoring levels at various layers so that the variety and the detail of the performance data collected are decided dynamically, and determine potential service level objective violations. As such, the models capture performance data in different deployment scenarios, configurations, and workloads. The models tune and refine themselves to increase predictive performance. Furthermore, each piece of the multitude of performance data is available to be collected, but excessive and unnecessary monitoring is avoided, saving time and resources. Consequently, implementation of the models results in fewer violations as well as a time and resource advantage over competitors.
Referring to
The processor 102 preferably monitors performance data.
The processor 102 preferably constructs a model of SLO compliance based on the monitored performance data. Let S={SLO compliance, SLO violation} be the set of possible states for a given SLO. At any time t, the state of an SLO, St, may be in one of these two states. Let Mt denote a vector of values, [m0, m1, m2, . . . , mn]t, collected by the processor 102 using the performance indicators being monitored. The processor 102 preferably constructs a model F(M,k,Δ) that maps the input vector [Mt−k, Mt−k+1, . . . , Mt] to St+Δ, the state of the SLO at time t+A. In at least one embodiment, the thresholds k and A are parameters. In at least one other embodiment, the parameter k is infinite and the processor 102 uses all the available history of the performance indicator values to construct the model F(M,k,Δ). There are a variety of machine learning techniques that the processor 102 uses to construct the model F(M,k,Δ). For example, machine learning techniques used in processor 102 include, but are not limited to, naïve Bayes classifier, support vector machines, decision trees, Bayesian networks, or neural networks. For the details of these techniques, refer to T. Hastie, R. Tibrishani, and J. Friedman, The elements of statistical learning, Springer, 2001. In at least one embodiment, the processor 102 preferably constructs the model F(M,k,Δ) in a classifier C, approximating the function F(M,k,Δ), based on a given training set containing the past observations of the performance indicators and the observed state of the SLO metrics.
In at least one embodiment, the processor 102 combines values of the performance indicators with the directionality of these values over time. Let Dt=[{+,=,−}1, {+,=,−}2, {+,=,−}3, . . . , {+,=,−}n]t be a directionality vector, indicating the directional difference between Mt and Mt−1. Each element ej in Dt indicates whether or not the corresponding metric j in Mt has increased ({+} value), decreased ({−} value), or stayed the same ({=} value). In at least one embodiment, the processor 102 constructs a model F(M,k,Δ) that maps the input vector [Mt, Dt−k, Dt−k+1, . . . , Dt] to St+Δ, the state of the SLO at time t+Δ.
While monitoring each piece of performance data is possible, the cost of monitoring would be prohibitive as the amount of performance data increases. As such, the processor 102 determines a subset of the performance data correlated with a measure of underperformance. In at least one embodiment, the measure of underperformance is based on a service level objective (“SLO”). A SLO is preferably a portion of a service level agreement (“SLA”) between a service provider and a customer. SLOs are agreed means of measuring the performance of the service provider and are helpful in managing expectations and avoiding disputes between the two parties. In at least one embodiment, the SLA is the entire agreement that specifies the SLOs, what service is to be provided and how the service is supported as well as times, locations, costs, performance, and responsibilities of the parties involved. The SLOs are specific measurable characteristics of the SLA, e.g., availability, throughput, frequency, response time, and quality. For example, an SLO between a website hosting service and the owner of a website may be that 99% of transactions submitted be completed in under one second, and the measure of underperformance tracks the SLO exactly. Expressed in words, the subset of performance data correlated with the measure of underperformance may be, for example, a tripling of website traffic in less than ten minutes.
In at least one embodiment, processor 102 selects the subsets of the performance indicators using a feature selection technique. The processor 102 selects the M−, a subset of M, such that the difference between their corresponding models F*(M*) and F(M) is minimal, with respect to the training set. The processor 102 preferably uses a greedy algorithm that eliminates a single metric m, at each step, such that |F(M−m)−F(M)| is minimal.
In at least one embodiment, the subset corresponds to one SLO. However, in at least one other embodiment, the SLO is composed of one or more performance indicators that are combined to produce an SLO achievement value. As such, an SLO may depend on multiple components, each of which has a performance indicator measurement. The weights applied to the performance indicator measurements when used to calculate the SLO achievement value depend on the nature of the service and which components are given priority by the service provider and the customer. Preferably, in such an embodiment, each of the multiple components corresponds to its own subset of performance data. In this way, the measure of underperformance is a combination of sub-measures of underperformance. In at least one embodiment, the correlation value between the subset and the measure of underperformance must be above a programmable threshold. As such, the selection of elements of performance data to include in the subset is not over-inclusive or under-inclusive.
If the subset is appropriately correlated with the measure of underperformance, the subset may be monitored to anticipate the measure. If the measure corresponds with an SLO violation, then a breach of the SLA agreement can be anticipated.
The processor 102 determines a trend of the subset of performance data, the trend also correlated with the measure of underperformance. Preferably, the processor 102 determines a trend correlated with an SLO violation itself. Determining a trend of the subset of performance data comprises determining that one element of the subset is behaving in a certain fashion, another element is behaving in a certain fashion, etc., where each behavior could be independent of each other behavior and each behavior need not occur simultaneously. The behaviors comprise a linear, exponential, arithmetic, geometric, etc., increase, decrease, oscillation, random movement, etc. The behaviors also include directionality. For example, the two behaviors {n1=1, n2=2, n3=3} and {n1=3, n2=2, n3=1}, where nx is the xth value of the element, are different behaviors even though each behavior contains the same values. The former behavior is a tripling in website traffic while the latter behavior is a reduction of website traffic by a third. In at least one embodiment, the behaviors can also be expressed as thresholds. For example, {1<n1<2, 2<n2<3, 3<n3<4}. Specifically, the first value for the element is between 1 and 2, the second value is between 2 and 3, etc. As an example, a trend can be determined by determining that one element is increasing and another element is decreasing simultaneously over a particular period of time. Note that the behaviors of the elements need not always occur simultaneously. A number of adjustable parameters can be used to increase the correlation between a trend and a measure of underperformance, which allows for a more accurate prediction of the measure of underperformance. Such parameters comprise any or all of: the number of elements of performance data used for the subset, the number of samples collected for each element, the rate of recording of each element, the rate of change of an element, the rate of change of the entire trend, and correlations between different elements of the performance data themselves, e.g., if change in one element causes change in another element. Many adjustable parameters and combinations of parameters are possible. In at least one embodiment, the trend is a combination of sub-trends of the subset. For example, the processor determines different subsets of performance data that, when each subset is behaving in its own particular way, will result in a SLO violation, but when less than all subsets exhibit their behavior, will not result in a SLO violation.
In at least one embodiment, the processor 102 ceases to monitor the performance data except for the subset after determining the trend. Because monitoring itself is an added overhead that uses system resources, it is advantageous to keep the amount of system resources dedicated to monitoring at a minimum. As such, ceasing monitoring performance of performance data that has little or no correlation to the measure of underperformance is preferable. By monitoring the subset, the processor 102 is still able to identify an occurrence of the trend. After such identification, in at least one embodiment, the processor 102 monitors a second subset of the performance data. Preferably, the second subset comprises at least one element not in the subset. System administrators prefer to study various data sources to determine the root cause of SLO violations after the fact, and this dynamic control of the collection of diagnostics information (when and what kinds of more detailed monitoring and instrumentation to be turned on as the second subset) assists system administrators in the event that a SLO violation occurs. However, it is an inefficient use of resources to collect the same level of diagnostic information during normal operation. If a violation does occur, the processor 102 preferably refines the subset of performance data automatically. Many methods of refinement are possible.
Machine learning techniques determine and refine the trends that establish correlation between performance data and measures of underperformance. Because the machine learning techniques create succinct representations of correlations from a diverse set of data, the techniques are ideal for determining which performance metrics lead to underperformance and which performance metrics can be safely ignored. As such, the system 100 is self-refining. Specifically, instances of SLO violations provide positive examples for the training of the machine learning models while normal operating conditions, without SLO violations, provide the negative examples for training. As such, the subset of performance data correlated with the underperformance can be adjusted automatically, and if a highly correlated subset suddenly or gradually becomes uncorrelated for any reason, the subset can be adjusted to maintain a high correlation. In this way, a steady supply of positive and negative examples allow for self-refining. Manual refining is also possible.
The alert module 104 preferably outputs an alert based on the identification of a trend. In at least one embodiment, the processor 102 sends a signal to the alert module 104 to output the alert. In at least one embodiment, the alert is a combination of alerts comprising a visual alert, an audio alert, an email alert, etc. Many alerting methods are possible. Preferably, the measure of underperformance is a future measure of underperformance and the alert is output prior to occurrence of the future measure of underperformance. In at least one embodiment, the future measure of underperformance is based on an SLO.
Referring to
The above discussion is meant to be illustrative of the principles and various embodiments of the present invention. Numerous variations and modifications will become apparent to those having ordinary skill in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US08/79739 | 10/13/2008 | WO | 00 | 5/23/2011 |