The invention generally relates to identifying correlated operation management events.
An information technology (IT) business service typically includes applications, middleware, systems and a storage infrastructure that are all closely connected. A given problem occurring in one of these domains may result in problems in other of the domains, leading to the logging of multiple operation management events. Multiple teams typically coordinate actions to gather cross domain knowledge and perform a root cause analysis to solve related inter-domain problems.
Problems occurring in multiple domains of a given computer system may be logged as operation management events in an operation management event log, which contains time-stamped event descriptions that correspond to inter-domain problems. Some of the operation management events may be related and as such, arise from the same root cause. Other events are not related and occur due to independently occurring problems. Due to at least the volume of logged operation management events, sorting through the logged events and attempting to find out which events are correlated may be a formidable task, especially if performed manually. Systems and techniques are disclosed herein, which automatically process logged operation management events to identify events that are related, or correlated, to each other for purposes of developing correlation rules that set forth relationships between events. For example, a particular correlation rule may be that when event A happens, events B and C occur. Such rules facilitate the recognition of specific problems and the development of and application of solutions to these problems.
As an example, in some implementations, it is generally assumed that operation management events that are correlated occur in the vicinity of each other in terms of time. In particular, as an example, correlation rules may be determined pursuant to a technique that includes grouping the event into episodes based on how close the events are together in time and then identifying the correlated events of each episode.
Referring to
As shown in
In accordance with a specific example described herein, one of the physical machines 100a contains machine executable program instructions and hardware that executes these instructions for purposes of automatically identifying, or determining, event correlation rules based on logged operation management events, such as events that are logged in an exemplary operation management event log 115 that is depicted in
The processing by the physical machine 100a results in data indicative of correlation rules that identify whether, for example, a particular event A is correlated to event B. Whether event A is deemed to be correlated to event B is regulated by such measures as support and confidence. The support measure specifies how often the rule occurs (i.e., |AUB|) for a correlation to occur, and the confidence measures a minimum for the probability of P(B|A), meaning that the confidence measure specifies what percentage of times did event B happen, given event A. Genuine correlations may be identified by setting thresholds corresponding to the support and confidence measures particularly high.
Therefore, by identifying the correlation rules, a correlation rule database 116 may be updated and maintained (such as in local, external storage or on remote storage) for purposes of quickly finding the root causes of present and future inter-domain problems that are indicated by the time-stamped event descriptions that are stored in the operation management log 115.
It is noted that in other implementations, all or part of the above-described correlation rule identification may be implemented on one, two, three or more physical machines 100. Therefore, many variations are contemplated and are within the scope of the appended claims.
The architecture that is depicted in
As depicted in
In general, the physical machine 100a, for this example, includes a set of machine executable instructions, which when executed by the CPU(s) 124 form an “event pre-processing application 110”, which is responsible for mapping the operation management events contained in the log 115 to a set of surrogate event types, which are further processed to group the events into episodes. In this manner, the physical machine 100a also includes a set of machine executable instructions, which when executed by the CPU(s) 124 form an episode creator, or “episode creation application 112,” which is responsible for processing the surrogate event types to organize the events into episodes. In general, a given episode contains events that occur within a certain time interval (called “t”) of each other. Additionally, the physical machine 100a, for this example, includes a set of machine executable instructions, which when executed by the CPU(s) 124 form a “data mining application 114,” which is responsible for processing each episode to identify correlation rules (if any) within the episode. The functionality of the applications 110, 112 and 114 may be consolidated into a single application or into two applications; or the functionality of the applications 110, 112 and 114 may be performed by more than three applications, as many implementations are contemplated and are within the scope of the appended claims.
In general, the other physical machines of
As a more specific example, in accordance with some embodiments of the invention, the physical machine 100a performs a technique 200 that is depicted in
Referring to
In accordance with an example, the event pre-processing application 110 determines the surrogate event type for a given event description by decomposing the event description and comparing this decomposed event description with one or more decomposed event descriptions. More specifically, in general, the event description, which may take on numerous forms, may contain a fixed part as well as one or more variable parts. For example, a exemplary generic event description for a logging error may be as follows:
DBSPI10-82: Data logging failed for <Object Name>. Make sure Performance Agent is running.
In the above example, the values in the angle brackets are variables, and the other text is fixed. As a more specific example, the following are two specific event description instances:
DBSPI10-82: Data logging failed for DBSPI_MSS_GRAPH. Make sure Performance Agent is installed and running.
BlackBerry Dispatcher WBCXOEB021 [0×2710] 8304: (#50099) BlackBerry Dispatcher Shutdown complete
In accordance with an example implementation, for purposes of classifying an event as a particular surrogate event type, the event pre-processing application 110 subdivides the event description into words, or tokens; discards single character tokens; and thereafter performs other measures to determine whether a given event description is the same or nearly the same as another event description.
For example, in accordance with an exemplary implementation, the event pre-processing application 110 may evaluate a given event description to determine if the given event description corresponds to a certain predetermined surrogate event classified in the following manner. For this example, the event pre-processing application 110 compares the given event description to a reference event description, which is associated with the predetermined surrogate event classifier. This comparison may involve determining whether at least two of the tokens are at the same position and if so, whether at least two thirds of the tokens at the same positions are identical. If the given event description passes these comparison measures, then the event pre-processing application 110 assigns the predetermined surrogate event classifier to the given event description. Otherwise, the event pre-processing application 110 searches for another appropriate surrogate event classifier and may (if all comparisons fail) assign a new surrogate event classifier. Other token similarity measures may be used, in accordance with other exemplary implementations. Moreover, in accordance with some implementations, the event pre-processing application 110 examines a first predetermined number (fifteen, for example) of tokens of each event description for purposes of increasing processing speed.
As another example of a measure used to process the event description, in accordance with some implementations, the event pre-processing application 110 uses an additional vector, or field, of the event description, which identifies a particular application type. In this manner, the event pre-processing application 110 presumes that all event descriptions that are associated with the same surrogate event type are also associated with the same type of application. Therefore, by excluding non-similar application attributes, the event pre-processing application 110 avoids comparing all event descriptions that are contained in the operation management log 115.
As a non-limiting example, one way for the episode creation application 112 to organize the surrogate event types into episodes is based on the timestamps of the surrogate event types. This is based on the observation if event A is correlated to event B, then there is an expectation that the two events A and B occur within a time t of each other. Therefore, for purposes of creating episodes, in accordance with some implementations, the episode creation application 112 groups events that occur within time t of each other together. In other words, the episode creation application 112 receives a dataset (called “D”) from the event pre-processing application 110, which indicates a set of surrogate event types and the associated timestamps of these surrogate event types; and the episode creation application 110 maps the D dataset to another dataset of episodes (called “D′”). Each episode has an associated episode identification (ID), and, in general, is a set of events, which occurred within some time t of each other.
In accordance with some implementations, the creation of the episodes may be performed in a manner that is depicted in a technique 250 of
After the frequent event types have been removed, pursuant to block 254, the technique 250 includes initializing (block 258) a window of time. In this regard, the time at which the events occur span a certain range of time, and the episode creation application 112 slides the time window across this range to identify events (that fall within the confines of the window) to be grouped in the same episode.
More specifically, if the entire time range is divided into time intervals of size t+Δ and the window is moved by Δ until the entire time range is covered. Then for Δ=t/2, any event i that occurs at Ti, there exists an episode E such that all events occur within Ti−t/2and Ti+t/2 and are contained in episode E. The choice of Δ is a tradeoff, where a relatively small Δ results in a large number of positions for the sliding window making the computation prohibitively expensive; and a relatively large Δ introduces a larger inaccuracy, because only those events that occur in the time range of interest are considered along with events that are part of other episodes. In accordance with some implementations, the assumption is made that the cost of introducing inaccuracy is the same as that of the computational cost, which means that Δ is set equal to time t. Thus, in accordance with an example implementation, the sliding window has a size of 2t and is moved by time t for each episode identification.
Thus, still referring to
After the episode creation application 112 identifies the episodes and generates the corresponding D′ dataset, the episodes are processed by the data mining application 114, which identifies whether given events are correlated based at least in part on an examination of all of the episodes to determine whether the given events occur together across a significant number of episodes. In general, the generation of correlation rules (whether event A is correlated to event B, for example) are governed by thresholds that are supplied as input parameters to the application 112, which specifies support and confidence. The support measures how often the rule occurs, and the confidence measures the probability of event B occurring given event A. In general, the thresholds are set so that the data mining application 114 obtains rules with relatively high confidence and relatively high support.
As a non-limiting example, the data mining application 114 may be the Enterprise Miner's application, which is available from SAS. The data mining processes the D′ episode dataset that is provided by the episode creation application 112 to generate a set of rules and a link graph showing how various rules are related to each other. Furthermore, the application 114, in accordance with some implementations, provides a visual presentation of the confidence and support.
Referring back to
847 Configuration distribution pending: Template . . .
5 Can't read template file . . .
6 Distribution problem occurred . . .
These events may otherwise be identified as distinct and independent events that are scattered among other events. Therefore, the systems and techniques that are disclosed herein provide guidance for creating event correlation rules according to newly found association rules.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art, having the benefit of this disclosure, will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.