This invention was originally disclosed in Provisional Application No. 60/631,905 filed on Nov. 30, 2004. The inventor claims all rights and priorities associated with the provisional application.
Not applicable
Not applicable
In today's enterprise computing environment, there are many applications that need constant monitoring and managing. One such application is the SAP database. There are many products in the marketplace that can monitor SAP, including a monitoring tool from SAP called CCMS, which will report various types of monitoring data, e.g., alerts, status, performance metrics.
There are various products available to monitor the data, but none has the ability to capture and process data asynchronously, consolidate data from multiple sources, correlate the data, identify root causes, report correlated alerts, events and performance data, and make recommendations to the system operator. A few examples of prior art products include: Quest: Foglight, BMC Software: Patrol for SAP, Veritas, HP: OpenView, Calif.: Unicenter, Tivoli, and SAP CCMS.
There are several problems facing application monitoring today. First, too much monitoring information is sent to the operator. Additionally, too many applications are sending information at one time and there are too many consoles to monitor at the same time. Also, there are not enough experienced operators/administrators to review all the data generated by the various applications. Application monitoring does not correlate data from multiple sources and applications. Finally, application monitoring can't determine root causes of problems from all the information.
The invention provides a way to consolidate the data from multiple sources; analyze and correlate data using existing expert knowledge, know-how and experience, i.e., create an “expert-in-a-box” approach; filter out unnecessary data points; provide meaningful alerts and performance information to the operator; and provide recommendations based on correlated alerts, events, and performance data.
The invention monitors and manages performance and availability data from multiple data providers. A set of executable hierarchical decision trees is used. Each tree has an anchor data node that, if matched to an incoming data point, will trigger the execution of the decision tree. Each tree has lower level data nodes that may request data when the data nodes are traversed during the execution of the tree. Each data node request a particular type of data to be received within a certain time window. Depending on the availability and analysis of the data, the node will return a result, causing the decision tree to proceed and branch the hierarchical decision tree according to the result, if necessary. At the end of each tree branch is an action node, which represents the correlation of an alert, event, or performance metric. The path of the anchor node, data nodes, and action node followed in the executable hierarchical decision tree are used to generate a correlation event.
At startup time, all the correlation trees are loaded into the system and the attributes of the data nodes are known. As data from the data providers come in, a preliminary match of data to data nodes may be made. If there is a match, the data will be held in a data holding bin awaiting a request from an executing correlation tree. Data points that match a correlation tree are tagged with a lifespan, which is used to determine how long the data points will be maintained in the data holding bin. Once the lifespan has expired and no executing correlation tree is matched with the data point, the data point will be discarded.
When an anchor node matches a particular event a correlation tree is activated and the tree begins execution. As the system proceeds down the tree and traverses a data node, the data node will request data and wait for data. If the requested data is available, the data node will analyze the data and output a result. If the data is not available, the data node will output a different result indicating the absence of data. Depending on the result of the analysis or the availability of the data, the tree will continue execution and perform a branch, if necessary.
When an action node is reached at the end of a tree branch, a correlation of data points has occurred, and a correlation event is issued. A diagnostic report is also generated and provided to the system operator. The decision reached on the trees represents knowledge and expertise on how to analyze data points from the various data sources. Each tree is customized to represent certain types of alerts, events, or performance metrics, and the data nodes on the tree are used to analyze particular data associated with such alerts, events, or performance metrics.
In addition, the data points corresponding to a correlated alert, event or performance metric may occur out of chronological order or asynchronously, unlike the prior art. In other words, the relevant data points do not have to occur in any particular chronological order so long as they occur during a pre-defined time window. This allows for the capturing of relevant data even before an event occurs that would trigger the capturing of such data. This is also referred to as “Fuzzy Time” processing of data.
The invention consolidates data points from multiple data sources to analyze the data and correlates the data from multiple sources. It handles the data “asynchronously” reporting only relevant events and recommends courses of action and diagnostic reports. The invention improves over the prior art by allowing monitoring at the operating system level, application and database level, and network performance and connectivity level. The system provides consolidated view of data, and reduces data traffic to operator; i.e., reduce “noise” at the console
The system performs data correlation and root cause analysis, and provides proactive analysis of data instead of merely reacting to incoming data. It enables execution of daily system/application checklists; provides 24 hour and 7 day a week support; and minimizes outages and Service Level Agreement exceptions.
The above objects and advantages of the present invention will become more apparent by describing in detail a preferred embodiment thereof with reference to the attached drawings in which:
Glossary
“Asynchronous Time” (or “Fuzzy Time”) refers to the concept that data points associated with an event may occur out of order with respect to chronological time. For example, an event A may have three data points associated with it: X, Y, and Z. However, the data points may occur in any order, such as X, Z, and Y or Z, X, and Y. Under the “Fuzzy Time” approach, the order of the data point occurrence is not important, so long as they occur within a specified time window, and once the three data points have occurred, event A is reported.
C# (“C sharp”) is the programming language used to implement the invention. C# is part of the Dot NET (.NET) programming package provided by the Microsoft Corporation.
CCMS is a monitoring system provided with a SAP database. CCMS provides the following types of data: alerts, performance values, and status attributes.
A correlation event refers to a set of data points that has been identified and associated with a specific alert, event, or performance metric. In other words, the data has been correlated, which might be (1) a correlated alert (also referred to as a Correlex Alert), (2) a correlated event (also referred to as a Correlex Event), or (3) a correlated performance data (also referred to as a Correlex Performance Data or Metric).
Correlation tree refers to the executable hierarchical decision tree as implemented in the present invention.
“Correlex” is a trademark of Tidal and is used to refer to the innovative technology of using a plurality of executable decision trees to analyze data.
Data provider (also referred to as a data source) can be any application, system, or program that provides data that may generate alerts, events, performance metrics or any other information. One example of a data provider is CCMS.
Decision tree refers to the well-known hierarchical decision tree having multiple levels of nodes. Each level has data nodes and branches to lower level nodes.
Microsoft Operations Manager (MOM) refers to a system framework offered by Microsoft Corp.
SAP, as used herein, refers to a database marketed by the well-known database solution company, SAP AG.
Tree instance refers to an active decision tree, i.e., a tree that has been started and is currently executing.
In the computing enterprise environment, there are multiple applications and operating systems running and sharing resources with each other. The applications and systems are sending status messages, alerts, and performance data to multiple consoles, often flooding and overrunning such consoles with excessive information and making it very difficult for systems operators to respond. Moreover, with excessive information, the operator has difficulty distinguishing minor alerts from critical problems and events.
In
As shown in
The present invention can monitor data points from multiple data sources as shown in
In
Step 2: Capture data points from the data sources S42. For example, if SAP is being monitored, the data from CCMS will be captured by the invention. All the data points from the data sources being monitored are captured and processed together.
Step 3: Match data points to the data nodes in the correlation trees S43. As data points are captured, they are matched to the correlation trees loaded in the system. If any of the data points match any of the data nodes of the correlation trees, the data points will be tagged as “of interest” and held in waiting until requested by a correlation tree.
Step 4: Start execution of certain correlation trees S44. Each correlation tree has an anchor data node. If an incoming data point matches the anchor data node of a correlation tree, then the tree becomes a “tree instance” and the correlation tree is started. Once started, the tree begins executing by traversing the data nodes as it moves down the tree. Each traversed data node will request specific data and wait for the data to become available. Depending on the availability and analysis of the data, a data node will output a particular result, which will determine how the tree will branch and continue down the tree. Once an action node is reached at the end of a tree branch, a correlation of data will occur and a diagnostic report and will be generated. The diagnostic report may also include additional data.
Step 5: Report correlated data and recommend a course of action S45. When an action node is reached, then all the data associated with an alert, event or performance metric has occurred. At this point, a correlation event is reported, along with a diagnostic report to provide additional information and recommendations to the system operator.
Step 6: Clean up “old” data S46. Data points that are not used by the data tree or have expired are deleted on a routine basis. “Old” data is not reported in order to reduce the amount of unnecessary information to the system operator. However, if desired, certain defaults can be changed so that “old” data is reported to the operator.
An example correlation tree is shown in
For example, in
Not all incoming data points will result in a correlation. Some data will not match any data nodes, and other data, which match data nodes of interest, will not be used because the interested tree may not execute at all or the particular branch of the matched tree instance did not execute. Some matched data points will not be used because of the lifespan associated with the data points will expire.
Every correlation tree definition contains one or more data node definitions. Each data node definition contains, among other things: (1) data attributes of the requested data, (2) the source of the data, and (3) the time window and the time window reference node. A data node executes only if its correlation tree is executing and the data node has been traversed. In
In an ideal world, data points associated with an event would appear more or less in order after the start of the monitoring of an event. For example, in
In
For example, in node 2 N2 the requested data type is D1 and it has to occur with 300 seconds of the time window reference node or Node 1 N1. In Node 3 N3, the requested data type is D2 and it has to occur within 500 seconds of Node 1 N1. In Node 4 N4, the request data is D3, and it must occur within 300 seconds of N2. In Node 5 N5, the requested data is D4 and it has to occur within 300 seconds of N4. As shown, each lower level data node has a time window that is relative to the time of an ancestor node along the same branch of the tree.
As shown in
In the
If a data point matches a data node of a correlation tree that is not currently executing S706, the data is tagged as “of interest” to the correlation tree, and a lifespan is determined S707 based on the time window specified in the data node. The tagged data point is held in a data holding bin waiting for a data request S708 from the correlation tree. If a request is made, the data will be presented to the requesting data node for processing.
Periodically a clean-up program will execute to check the lifespan of the data points that are tagged to trees that are not executing. If the lifespan has been exceeded, then the data point is deleted S709, unless it is also tagged to a currently executing tree.
If a data point does not match any of the data nodes of the correlation trees then the data point is discarded S710. In one implementation of the invention, prior to discarding the data point, the invention will report the data to the system operator.
In
If a data point is matched to a correlation tree that is executing, e.g., Tree 2808, then the data point will be held in the data holding bin until it is requested by the executing tree. The data point will not be deleted even if the lifespan has expired. If no executing trees match the data point, then the data point will be marked for deletion once the lifespan has expired.
In
In another embodiment of the invention shown in
The correlation engine 1010 match the data points to the correlation forest 1011, and the dispatcher 1012 executes the correlation trees. The results from the execution of the correlation trees are reported by a Tidal Enterprise Framework 1013, MOM transporter 1014, OpenView transporter 1015, AM transporter 1016, or Remedy transporter 1018 to multiple and different management frameworks such as: Horizon database 1018, MOM 1019, OpenView from HP 1020, AppManager from NetIQ 1021, and Remedy from BMC Software 1022. The different management frameworks may have a Horizon extension 1023, 1024, and 1025.
Associated with the Correlex engine is a knowledge database 1027 that provides further information and recommendations, in the form of diagnostic reports 1026, to the system operator. Based on the types of alerts, events, or performance data identified by the correlation tree, a corresponding diagnostic report is generated.
In the present invention, correlation trees may be displayed visually to the system operator. Each data node is displayed and shows the data attributes associated with it. The action nodes at the end of a tree branch show the type of correlation event that will be reported to the operator, such as a Correlex Alert, Correlex Event, or Correlex Performance Metric.
If such alert is not available within a certain time window (as specified in data node 2), then a branch to data point 5 occurs, whereby a request for CCMS Performance attribute “Page In” 1104 is initiated. Next, in data node 6, a request for CCMS Performance Attribute: “Page Out” 1105 is issued. Finally, a Correlex Alert is issued for “Low Physical Memory” 1106.
If the CCMS alert for “CPU Utilization” 1103 does occur within a specified time window, then the tree will branch to data node 3, wherein a request for CCMS Performance Attribute: “Users Logged On” 1107 is initiated, followed by “Total Work Process” 1108 as requested by data node 4. Finally, a Correlex Alert of “Too Many Work Processes Alive” 1109 is reported, along with a diagnostic report, as shown in
Correlation trees are defined using the XML programming language.
As shown in
Number | Date | Country | |
---|---|---|---|
60631905 | Nov 2004 | US |