Some embodiments described herein relate to root cause analysis, and in particular, to recommending potential root causes of failure using a visual timeline.
When a failure of an operation such as a component or an input connection occurs in a computing system, typically a change management database (CMDB) is used to find the root cause and do impact analysis of the infrastructure affected by the operation failure. The CMDB is used to store configuration items (i.e., components, connections, services, etc.) and their relationships such as connections between the configuration items.
Often, change orders are used to document changes to the configuration items and connections between the configuration items. These changes are typically versioned for purposes such as auditing and tracking. Over a period of time, there can be numerous changes to a configuration item with multiple change orders directly and indirectly associated with the configuration item.
When finding the root cause of failure of an operation in a CMDB system, the current state of the configuration items are used. Correlating the configuration item's versioning (i.e., change history) in conjunction with the current state is tedious and error prone, especially when there are multiple change orders directly and indirectly associated with the configuration item.
Some embodiments are directed to a method by a computer of a computing system for providing one or more root cause analysis (RCA) graphs associated with a root cause of failure. The method includes receiving an indication of an operation that failed in a computer system associated with the operation. Starting at a baseline state of the operation and ending at a current state of the operation, a number of change orders that change one or more configuration items that are associated with the operation are determined. A baseline state root cause analysis (RCA) graph is displayed. The baseline state RCA graph represents a last known state of the operation where the operation passed testing. The baseline state RCA graph has a plurality of configuration items and connections between the plurality of configuration items in a first area of a display and a change order timeline in a second area of the display. For each of the number of change orders and responsive to receiving a user selection of a change order in the change order timeline, a RCA graph of a state of the computer system associated with the operation is displayed. The RCA graph of the state of the computer system associated with the operation has a number of the plurality of configuration items that remain associated with the operation and connections between the number of the plurality of configuration items displayed in the first area. Any of the number of the plurality of configuration items that remain that were changed by the change order are highlighted. The RCA graph also has an indication of the change order displayed in the change order timeline, and a potential cause listing in a third area of the display. The potential cause listing displays a list of configuration items that are highlighted and a percentage indication by each configuration item in the list of configuration items. Each percentage indication represents a calculated percentage that the configuration item is a potential cause of the failure.
The method may further include displaying a number of new configuration items added by the change order in the first area and connections between the number of new configuration items and the number of the plurality of configuration items displayed in the first area and highlighting any of the number of new configuration items that were changed by the change order.
The RCA graph of the state of the computer system associated with the operation may further include a suggestion action displayed in a fourth area of the display responsive to a percentage indication of a configuration item listed in the third area of the display being above a first threshold level.
The method may further include responsive to a highest calculated percentage being below a second threshold level, displaying on a currently displayed RCA graph a suggestion that a higher number of configuration item levels should be displayed. The method, for each RCA graph, displays the number of configuration item levels when the RCA graph is being displayed responsive to receiving an indication of a number of configuration item levels to display.
Corresponding configuration management system configured to recommend potential root causes of failure of an operation of a computer system are disclosed. In some embodiments, the configuration management system includes an incident problem management engine configured to receive a message indicating an operation that failed in the computer system. The configuration management system further includes a configuration management database (CMDB) configured to receive and store change orders; store associations of configuration items with change orders and operations of computer systems; and store information regarding a baseline state for a plurality of operations performed by the computer system, each baseline state representing a last known state of an operation of the plurality of operations in which the operation passed testing, the information for each baseline state comprising a listing of a plurality of configuration items and connections between the plurality of configuration items. The configuration management system further includes a change management engine is configured to responsive to receiving an indication of the message indicating the operation that failed, fetch information from the CMDB regarding a baseline state for the operation that failed. The change management is further configured to identify, from the CMDB and starting at a baseline state of the operation and ending at a current state of the operation, a number of change orders that change one or more configuration items that are associated with the operation. The change management is further configured to, for each of the number of change orders, fetch information from the CMDB regarding configuration items changed by the change order. The change management is further configured to display a baseline state root cause analysis (RCA) graph in a first area of a display and a change order timeline in a second area of the display. The change management is further configured to responsive to receiving a user selection of a change order in the timeline, display a RCA graph of a state of the computer system associated with the operation, wherein the RCA graph of the state of the computer system associated with the operation includes: a number of the plurality of configuration items that remain associated with the operation and connections between the number of the plurality of configuration items displayed in the first area, wherein any of the number of the plurality of configuration items that remain that were changed by the change order are highlighted; a number of new configuration items when the change order indicates new configuration items have been added and connections between the number of new configuration items and the number of the plurality of configuration items displayed in the first area, wherein any of the number of new configuration items that were added by the change order are highlighted; an indication of the change order displayed in the change order timeline; and a potential cause listing in a third area of the display, wherein the potential cause listing displays a list of configuration items that are highlighted and a percentage indication by each configuration item in the list of configuration items, each percentage indication representing a calculated percentage that the configuration item is a potential cause of the failure.
The CMDB may further be configured to receive a new change order, parse the change order to determine configuration items changed by the new change order and to determine configuration items added or deleted from the computer system, and store information regarding the configuration items changed by the new change order and information regarding the configuration items added or deleted from the computer system.
The configuration management engine may further be configured to responsive to a highest calculated percentage being below a second threshold level, display on a currently displayed RCA graph a suggestion that a higher number of configuration item levels should be displayed; and responsive to receiving an indication of a number of configuration item levels to display, for each RCA graph, displaying the number of configuration item levels in the RCA graph being displayed.
It is noted that aspects of the inventive concepts described with respect to one embodiment may be incorporated in different embodiments although not specifically described relative thereto. That is, all embodiments or features of any embodiments can be combined in any way and/or combination. These and other objects or aspects of the present inventive concepts are explained in detail in the specification set forth below.
Advantages that may be provided by various of the concepts disclosed herein include reducing occurrence of errors in determining a root cause of failure of an operation, reducing load on the networks used by: displaying changes made by a change order and a percentage indication that the configuration items changed are the root cause of failure; and providing an indication that a higher percentage possible root cause of failure is in another RAC graph of another change order.
Other methods, devices, and computer program products, and advantages will be or become apparent to one with skill in the art upon review of the following drawings and detailed description. It is intended that all such additional systems, methods, or computer program products and advantages be included within this description, be within the scope of the present inventive concepts, and be protected by the accompanying claims.
The accompanying drawings are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application. In the drawings:
Embodiments of the present inventive concepts now will be described more fully hereinafter with reference to the accompanying drawings. Throughout the drawings, the same reference numbers are used for similar or corresponding elements. The inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the inventive concepts to those skilled in the art. Like numbers refer to like elements throughout.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present inventive concepts.
As used herein, the term “or” is used nonexclusively to include any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes” or “including” when used herein, specify the presence of stated features, integers, steps, operations, elements, or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Some embodiments described herein provide methods or change management systems for recommending potential root cause of failure of an operation of a computer system. According to some embodiments, the configuration management system includes an incident problem management engine configured to receive a message indicating an operation that failed in the computer system. The configuration management system further includes a configuration management database (CMDB) configured to receive and store change orders; store associations of configuration items with change orders and operations of computer systems; and store information regarding a baseline state for a plurality of operations performed by the computer system, each baseline state representing a last known state of an operation of the plurality of operations in which the operation passed testing, the information for each baseline state comprising a listing of a plurality of configuration items and connections between the plurality of configuration items. The configuration management system further includes a change management engine is configured to responsive to receiving an indication of the message indicating the operation that failed, fetch information from the CMDB regarding a baseline state for the operation that failed. The change management is further configured to identify, from the CMDB and starting at a baseline state of the operation and ending at a current state of the operation, a number of change orders that change one or more configuration items that are associated with the operation. The change management is further configured to, for each of the number of change orders, fetch information from the CMDB regarding configuration items changed by the change order. The change management is further configured to display a baseline state root cause analysis (RCA) graph in a first area of a display and a change order timeline in a second area of the display. The change management is further configured to responsive to receiving a user selection of a change order in the timeline, display a RCA graph of a state of the computer system associated with the operation, wherein the RCA graph of the state of the computer system associated with the operation includes: a number of the plurality of configuration items that remain associated with the operation and connections between the number of the plurality of configuration items displayed in the first area, wherein any of the number of the plurality of configuration items that remain that were changed by the change order are highlighted; a number of new configuration items when the change order indicates new configuration items have been added and connections between the number of new configuration items and the number of the plurality of configuration items displayed in the first area, wherein any of the number of new configuration items that were added by the change order are highlighted; an indication of the change order displayed in the change order timeline; and a potential cause listing in a third area of the display, wherein the potential cause listing displays a list of configuration items that are highlighted and a percentage indication by each configuration item in the list of configuration items, each percentage indication representing a calculated percentage that the configuration item is a potential cause of the failure.
As further described in
Initially, at operation 200, the CMDB 102 receives and stores change orders for operations of computer systems that the CMDB 102 services. Each change order documents changes in one or more operations of a computer system serviced by the CMDB 102. A change order may document which users or group of users is affected by the change, a classification of the change order, a type of the change order, an indication of who approved the change order, and what configuration items and relationships between configuration items are changed by the change order.
A configuration item may be a service, a device, a device component, software, a software update, a software patch, and the like. A relationship is the logical relation between two configuration items. For example, a computer server that contains a Windows operating system is a relationship. Whenever a configuration item is changed, a change order documents the change. Each change order may be associated with multiple configuration items.
In an embodiment, a classification of a change order specifies whether the change order is a major incident, an unauthorized change order, an emergency order, or none of the preceding classifications (i.e., is not a major incident, an unauthorized change order, or an emergency order). A major incident is defined by the entity controlling the CMDB 102. Each change order requires specified conditions to be met. If these conditions are not met, the change order is an unauthorized change order. When a business decides a change is urgent, the change order is classified as an emergency order.
The CMDB 102 performs operations on change orders. Turning to
Returning to
At operation 204, the CMDB 102 stores information regarding a baseline state for operations performed by the computer systems serviced by the CMDB 102. A baseline state is the last known state where the operation was working properly.
At operation 206, the incident/problem management engine 100 receives an indication of an operation that failed in a computer system associated with the operation. The failed operation may be a failure of a service or a failure of a device or a failure of a component of a device. The indication may come from a monitoring system that monitors components and services in the computer system and issues alarms when failures occur, from a built-in-test routine, from a user, from a help-desk, etc. At operation 208, the incident/problem management engine 100 notifies the change management engine 104 of the failed operation.
At operation 210, the change management engine 104 transmits a request to the CMDB 102 for information regarding the failed operation. At operation 212, the CMDB 102 transmits the information requested to the change management engine 102. The information requested includes configuration items associated with the failed operation, change orders associated with the operation or the configuration items associated with the failed operation, and a baseline state of the operation.
At operation 214, the change management engine 104 determines, from the information from the CMDB 102, starting at the baseline state of the failed operation and ending at a current state of the failed operation, a number of change orders that changed one or more of the configuration items that are associated with the operation.
Turning to
Returning to
Turning to
Turning to
W1-W7 are user configurable weights, MI is a number of major incidents reported for the change order between the change order number displayed and the current state and is 0 if there are no major incidents, UC is 1 if the change order is classified as an unauthorized change order and 0 if the change order is not classified as a an unauthorized change order, EM is 1 if the change order is classified as an emergency change and 0 if the change order is not classified as an emergency change, AC is 1 if any attribute of the configuration item has been changed and 0 if no attributes of the configuration item has been changed, AD is 1 if a configuration item has been added and 0 if a configuration item has not been added, DE is 1 if any configuration item was removed and 0 if no configuration item was remove, and DI a focal distance the configuration item is from a focal configuration item. For example, in one embodiment, the weights W1 to W7 may be W1=30, W2=10, W3=30, W4=2, W5=10, W6=10, and W7=9. The Pcalc calculation with these weights is
Other weightings can be used. In operation 902, the Pcalc for each configuration item in the list is calculated and normalized using the sum of all Pcalc calculated for the configuration items in the list.
A user may want to change the weightings. Turning to
Returning to
Returning to
The suggestion action 700 has a user-selectable item. Turning to
Turning now to
Turning now to
Turning now to
In the embodiment shown in
In the embodiment shown in
In the embodiment shown in
In the embodiment shown in
Thus, example systems, methods, and non-transitory machine readable media for reducing occurrences have been described. The advantages provided include reducing occurrence of errors in determining a root cause of failure of an operation, reducing load on the networks used by displaying changes made by a change order and a percentage indication that the configuration items changed are the root cause of failure, providing an indication that a higher percentage possible root cause of failure is in another RCA graph of another change order and a link to that RCA graph. The advantages result in faster identification of a root cause of a failure of a failed operation of a computer system.
As will be appreciated by one of skill in the art, the present inventive concepts may be embodied as a method, data processing system, or computer program product. Furthermore, the present inventive concepts may take the form of a computer program product on a tangible computer usable storage medium having computer program code embodied in the medium that can be executed by a computer. Any suitable tangible computer readable medium may be utilized including hard disks, CD ROMs, optical storage devices, or magnetic storage devices.
Some embodiments are described herein with reference to flowchart illustrations or block diagrams of methods, systems and computer program products. It will be understood that each block of the flowchart illustrations or block diagrams, and combinations of blocks in the flowchart illustrations or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart or block diagram block or blocks.
It is to be understood that the functions/acts noted in the blocks may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.
Computer program code for carrying out operations described herein may be written in an object-oriented programming language such as Java® or C++. However, the computer program code for carrying out operations described herein may also be written in conventional procedural programming languages, such as the “C” programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a standalone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer. In the latter scenario, the remote computer may be connected to the user's computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Many different embodiments have been disclosed herein, in connection with the above description and the drawings. It will be understood that it would be unduly repetitious and obfuscating to literally describe and illustrate every combination and subcombination of these embodiments. Accordingly, all embodiments can be combined in any way or combination, and the present specification, including the drawings, shall be construed to constitute a complete written description of all combinations and subcombinations of the embodiments described herein, and of the manner and process of making and using them, and shall support claims to any such combination or subcombination.
In the drawings and specification, there have been disclosed typical embodiments and, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation, the scope of the inventive concepts being set forth in the following claims.