Some embodiments described herein relate to root cause analysis, and in particular to controlling monitoring based on root cause analysis.
Monitoring systems monitor a set of entities at a defined frequency and publish metrics, topology of the set of entities being monitored, and events and alarms associated with the set of entities. These monitoring systems monitor their respective entities at configured intervals without knowledge of an actual situation occurring or a root cause of the actual situation, which can be outside the purview of the monitoring systems.
When a failure of a component or an input connection occurs, the monitoring systems that monitor entities affected by the failure of the component/input connect will continue to monitor the entities at configured intervals by trying to poll the entities and continuously fail to poll them and raise events/alarms at the configured intervals for the same situation until the root cause of the failure is resolved.
Issues that can occur as a result of the continual raising of events/alarms is unnecessary load on the networks used by the monitoring systems. When these monitoring systems are providing data to another monitoring system or manage or analytics platform, all of the events/alarms being sent at the configured intervals will be provided to the other monitoring systems/managers and/or analytic platform, leading to unnecessary loading of the networks used by the systems/managers and/or analytic platform.
Some embodiments are directed to a method in a root cause analysis (RCA) engine of a networked hardware device for instructing registered monitoring systems to stop monitoring symptoms associated with a root cause of a failure. The method includes receiving an alarm on an entity. Correlation domains are fetched based on the correlation domains each having been registered as being associated with the entity and in which the alarm is part of a policy applied to the correlation domains. A determination is made if the alarm is for a root cause failure for one of the entities associated with one of the correlation domains. Responsive to the alarm being for a root cause of failure for the one of the entities associated with the one of the correlation domains, a message is transmitted, via a network interface, to registered monitoring systems for the one of the correlation domains, the message comprising an instruction for the registered monitoring systems to stop monitoring symptom conditions associated with the root cause of failure for entities in the one of the correlation domain.
The method may further include obtaining root causes of failures of the entities of the plurality of entities and indicated connections existing between entities of the plurality of entities. The method determines symptom conditions for each of the root causes of failures that are obtained. The method determines which one of the symptom conditions is a symptom condition of the entity of the plurality of entities having a failure that is the root cause. Rules are further based on the symptom conditions and the symptom condition of the entity having the failure.
The method may further include receiving a clear indication for a second alarm on a second entity. The method fetches second correlation domains based on the second correlation domains each having been associated with the second entity and which the second alarm is part of a second policy applied in each of the second correlation domains. The method determines if the second alarm is for a second root cause for an entity in one of the second correlation domains. Responsive to the second alarm being for the second root cause for the entity in the one of the second correlation domains, the method determines if the second root cause has been cleared and responsive to determining that the second root cause is cleared, transmits, through the network interface, a second message to registered monitoring systems for the one of the second correlation domains. The second message contains an instruction for the registered monitoring systems to restart monitoring symptom conditions associated with the second root cause for entities in the one of the second correlation domains.
Corresponding RCA engines of a hardware computer are disclosed. In some embodiments, the RCA engine includes a processor and a memory coupled to the processor, wherein the memory stores computer program instructions that are executed by the processor to perform operations that include receiving an alarm on an entity. The operations further include fetching correlation domains based on the correlation domains each having been associated with the entity and in which the alarm is part of a policy applied to the correlation domains. The operations further include determining if the alarm is for a root cause of failure for an entity in one of the correlation domains. The operations further include responsive to the alarm being for the root cause for the entity in the one of the second correlation domains, transmitting, via a network interface used by the RCA engine, a message to registered monitoring systems for the one of the correlation domains. The message contains instructions for the registered monitoring systems to stop monitoring symptom conditions associated with the root cause of failure for entities in the one of the correlation domains. The operations further include transmitting, through the network, an indication of a failure of the one of the entities associated with the one of the correlation domains that is the root cause of failure.
The RCA engine may further include for each correlation domain of the correlation domains, configuring the correlation domain based on accessing a topology data structure that defines a plurality of entities including the entity and an indication of connections existing between entities of the plurality of entities, wherein configuration of the correlation domain generates a correlation data structure identifying entities in the plurality of entities with indicated correlations and a policy applied to the correlation domain.
A monitoring system is also described. In one embodiment, the monitoring system includes a processor and a memory coupled to the processor, wherein the memory stores computer program instructions that are executed by the processor to perform operations including transmitting a registration request to a root cause analysis (RCA) engine for monitoring instructions for entities monitored by the monitoring system. The operations further include monitoring the entities. The operations further include receiving, from the RCA engine, a message having an instruction to stop monitoring entities listed in the message. The operations further include stop monitoring the listed entities responsive to receiving the message.
The monitoring system may perform further operations including receiving, from the RCA engine, a second message from the RCA engine containing instructions to resume monitoring the listed entities. The operations further include responsive to receiving the second message, resume monitoring of the listed entities.
It is noted that aspects of the inventive concepts described with respect to one embodiment may be incorporated in different embodiments although not specifically described relative thereto. That is, all embodiments or features of any embodiments can be combined in any way and/or combination. These and other objects or aspects of the present inventive concepts are explained in detail in the specification set forth below.
Advantages that may be provided by various of the concepts disclosed herein include reducing occurrence of events and alarms reported by monitoring systems, reducing load on the networks used by the monitoring systems and unnecessary loading of the networks used by the systems/managers and/or analytic platform in which the events and alarms are sent.
Other methods, devices, and computer program products, and advantages will be or become apparent to one with skill in the art upon review of the following drawings and detailed description. It is intended that all such additional systems, methods, or computer program products and advantages be included within this description, be within the scope of the present inventive concepts, and be protected by the accompanying claims.
The accompanying drawings are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application. In the drawings:
Embodiments of the present inventive concepts now will be described more fully hereinafter with reference to the accompanying drawings. Throughout the drawings, the same reference numbers are used for similar or corresponding elements. The inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the inventive concepts to those skilled in the art. Like numbers refer to like elements throughout.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present inventive concepts. As used herein, the term “or” is used nonexclusively to include any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes” or “including” when used herein, specify the presence of stated features, integers, steps, operations, elements, or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Some embodiments described herein provide methods or RCA engines for controlling a monitoring system to stop and resume monitoring. According to some embodiments, a RCA engine receive an alarm on an entity. Correlation domains are fetched based on the correlation domains each having been associated with the entity and in which the alarm is part of a policy applied to the correlation domains. A determination is made of whether the alarm is for a root cause failure for one of the entities associated with one of the correlation domains. Responsive to the alarm being for a root cause failure for the one of the entities associated with the one of the correlation domains, a message is transmitted, via a network interface, to registered monitoring systems for the one of the entities associated with the one of the correlation domains, the message comprising an instruction for the registered monitoring systems to stop monitoring symptom conditions associated with the root cause of failure for the one of the entities associated with the one of the correlation domain.
The RCA engine 100 also communicates with the topology service 102. In various embodiments, the RCA engine 100 receives a topology data structure from the topology service 106 and determines rules and policies based on the topology data structure and stores them in rule and policy database 110. The RCA engine 100 also configures correlation domains and stores the correlation domains in correlation domains database 112.
As further described in
Initially at step 200, one of the monitoring systems 102 transmits an alarm on an entity 104. The RCA engine 100 receives the alarm on the entity 104 at step 202. At step 204, the RCA engine 100 fetches all corresponding correlation domains based on the correlation domains each having been associated with the entity 104 and in which the alarm is part of a policy applied to the correlation domains. Each correlation domain is a set of entities 104 which are inter-related. An entity 104 may be affected by another entity 104 based on a condition.
In an embodiment, the RCA engine 100 configures the correlation domain. Turning to
The topology data structure in one embodiment is created by the topology service 106. Turning to
Returning to
Turning to
In the embodiment described in
At step 604, a policy is determined based on a combination of the rules and is applied to the topology data structure of the correlation domain. For example, the policy can be a combination of rules numbering anywhere from 2 rules to n rules. The rules and policy are stored in rule/policy database 110. The correlation domain is stored in correlation domain database 112 at step 606.
In the embodiment described in
The topology data structure is a dynamic structure. For example, a monitoring system 102 may add or remove an entity 104 the monitoring system 102 is monitoring. When this occurs, the monitoring system 102 provides an updated topology to the topology service 106. The topology service 106 updates the topology data structure based on changes of topologies provided by the monitoring systems 102. Turning now to
Returning to
Monitoring systems 102 become registered by sending a registration request to the RCA engine 100. The registration request contains an identification of the entities 104 the monitoring system 102 is monitoring. Turning now to
Turning to
Returning to
Turning now to
At step 222, if the root cause is not cleared, the RCA engine 100 returns to step 214 and waits for another clear for an alarm on an entity 104. If the root cause is cleared, the RCA engine 100 at step 224 transmits a message to registered monitoring systems 102 in the subscription list(s) for the corresponding correlation domain(s). The message contains instructions to restart monitoring symptom conditions for entities 104 in the corresponding correlation domains.
The monitoring system 102 receives the message to restart monitoring at step 226. At step 228, the monitoring system 102 restarts monitoring symptom conditions for corresponding entities 104 in which monitoring was stopped.
Turning now to
Turning to
Turning to
At step 1306, the monitoring system 102 receives, from the RCA engine 100, a fourth message having instruction to resume monitoring of the monitored entity 104. At step 1308, response to receiving the fourth message, the monitoring system 102 resumes monitoring of the monitored entity 104.
An example of how the RCA engine 100 receives an alarm and provides the messages to the monitoring system 102 to stop monitoring and resume monitoring shall now be described. Turning now to
Monitoring system 1021 will transmit an alarm on the port 1402. Monitoring system 1022 will transmit an alarm on each of ESX server 1406 and VM 14081, 14082, and 14083 as the monitoring system 1022 is unable to contact them. Monitoring system 1023 will transmit an alarm on each of the applications 14101, 14102, and 14103 as monitoring system 1023 is unable to contact them. The RCA engine 100 receives the alarms from the three monitoring systems 1021, 1022, and 1023. For each of the alarms from monitoring system 1022 and 1023, the RCA engine 100 fetches all corresponding correlation domains based on the correlation domains each having been associated with the entities 1406, 14081, 14082, and 14083, or entities 14101, 14102, and 14103, respectively and in which the alarms are part of a policy applied to the corresponding correlation domains. The RCA engine 100 determines that these alarms are not for a root cause of failure of any of the entities 1406, 14081, 14082, and 14083, or entities 14101, 14102, and 14103, respectively. For example, the policy for correlation domains associated with entities 1406, 14081, 14082, and 14083 may have a rule that indicates that if there are alarms for all of the entities 1406, 14081, 14082, and 14083, then the alarms are for a symptom condition and are not alarms for a root cause failure. The policy for correlation domains associated with entities 14101, 14102, and 14103 may have a similar rule that indicates that if there are alarms for all of the entities 14101, 14102, and 14103, then the alarms are for a symptom condition and are not alarms for a root cause failure. The RCA engine 100 transmits a message to monitoring system 1022 having instructions for the monitoring system 1022 to stop monitoring symptom conditions for entities 1406, 14081, 14082, and 14083. The RCA engine 100 transmits a message to monitoring system 1023 having instructions for the monitoring system 1023 to stop monitoring symptom conditions for entities 14101, 14102, and 14103. The message to the monitoring system 1022 and the message to the monitoring system 1023 may be the same message or different messages. Responsive to receiving the message, the monitoring systems 1022, 1023 stop monitoring symptom conditions for entities 1406, 14081, 14082, 14083, 14101, 14102, and 14103.
For the alarm from monitoring system 1021, the RCA engine 100 fetches all corresponding correlation domains based on the correlation domains each having been associated with the entity (i.e., port 1402) and in which the alarm is part of policy applied to the corresponding correlation domain. The RCA engine 100 determines the alarm is for a root cause failure of the port 1402. For example, the policy for correlation domains associated with entity 1402 may have a rule that if there are alarms for entities 1404, 1406, 14081, 14082, 14083, 14101, 14102, and 14103 and there are no alarms for entity 1400, then the alarm for port 1402 is an alarm for a root cause failure of the port 1402. The RCA engine 100 transmits an indication of a failure of port 1402 to terminals of users, such as technicians, that are responsible for the port 1402. The RCA engine 100 also transmits a message to monitoring system 1021 having instructions to stop monitoring symptom conditions of port 1402.
Once the port 102 has been repaired, the RCA engine 100 will receive a clear alarm for the port 1402 from monitoring system 1021. The RCA engine 100 will transmit a message to monitoring system 1022 having instructions for the monitoring system 1022 to resume monitoring symptom conditions for entities 1406, 14081, 14082, and 14083. The RCA engine 100 transmits a message to monitoring system 1023 having instructions for the monitoring system 1023 to resume monitoring symptom conditions for entities 14101, 14102, and 14103. The message to the monitoring system 1022 and the message to the monitoring system 1023 may be the same message or different messages. Responsive to receiving the message, the monitoring systems 1022, 1023 resumes monitoring symptom conditions for entities 1406, 14081, 14082, 14083, 14101, 14102, and 14103.
The root cause of failure may be for an entity 104 that is not being monitored by monitoring systems 1021, 1022, or 1023. For example, the power for entities 1400-1410 may be provided by the same power supply, which is monitored by a different monitoring system 102. When there is a failure in the power supply, the monitoring systems 1021, 1022, and 1023 will be transmitting alarms for each of the entities 1400-1410. The RCA engine 100 will receive the alarms from monitoring systems 1021, 1022, and 1023 and determine that none of the alarms are for a root cause of failure. For example, the policy associated with the entities 1400-1410 may have a rule that indicates that if every one of the entities 1400-1410 have an alarm, then the alarms are for symptom conditions and not a root cause of failure. The RCA engine 100 will transmit one or more messages to the monitoring systems 1021, 1022, and 1023 having instructions to stop monitoring the symptom conditions for the entities 1400-1410. The RCA engine 100 will receive the alarm for the power supply and determine the alarm is for a root cause of failure. After the power supply is repaired or replaced, the RCA engine 100 will receive a clear alarm for the power supply. The RCA engine 100 will then send one or more messages to the monitoring systems 1021, 1022, and 1023 having instructions to resume monitoring the symptom conditions for the entities 1400-1410. The monitoring systems 1021, 1022, and 1023 will then resume monitoring the symptom conditions for the entities 1400-1410.
In the embodiment shown in
In the embodiment shown in
In the embodiment shown in
In the embodiment shown in
In the embodiment shown in
Thus, example systems, methods, and tangible non-transitory machine readable media for controlling monitoring systems to stop and start monitoring have been described. The advantages provided include reduction in network load of the monitoring systems, reduction in network load of systems using events/alarms provided by the monitoring systems, and the like.
As will be appreciated by one of skill in the art, the present inventive concepts may be embodied as a method, data processing system, or computer program product. Furthermore, the present inventive concepts may take the form of a computer program product on a tangible computer usable storage medium having computer program code embodied in the medium that can be executed by a computer. Any suitable tangible computer readable medium may be utilized including hard disks, CD ROMs, optical storage devices, or magnetic storage devices.
Some embodiments are described herein with reference to flowchart illustrations or block diagrams of methods, systems and computer program products. It will be understood that each block of the flowchart illustrations or block diagrams, and combinations of blocks in the flowchart illustrations or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture including instruction means which implement the function/act specified in the flowchart or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions/acts specified in the flowchart or block diagram block or blocks.
It is to be understood that the functions/acts noted in the blocks may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.
Computer program code for carrying out operations described herein may be written in an object-oriented programming language such as Java® or C++. However, the computer program code for carrying out operations described herein may also be written in conventional procedural programming languages, such as the “C” programming language. The program code may execute entirely on the user's computer, partly on the user's computer, as a standalone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer. In the latter scenario, the remote computer may be connected to the user's computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Many different embodiments have been disclosed herein, in connection with the above description and the drawings. It will be understood that it would be unduly repetitious and obfuscating to literally describe and illustrate every combination and subcombination of these embodiments. Accordingly, all embodiments can be combined in any way or combination, and the present specification, including the drawings, shall be construed to constitute a complete written description of all combinations and subcombinations of the embodiments described herein, and of the manner and process of making and using them, and shall support claims to any such combination or subcombination.
In the drawings and specification, there have been disclosed typical embodiments and, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation, the scope of the inventive concepts being set forth in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
20100332918 | Harnois | Dec 2010 | A1 |
20150095709 | Ramachandra | Apr 2015 | A1 |
20170116014 | Yang | Apr 2017 | A1 |
20180276063 | Mendes | Sep 2018 | A1 |
20190165988 | Wang | May 2019 | A1 |
Entry |
---|
Alexander La rosa, “Root Cause Analysis and Monitoring Tools: A Perfect Match,” Pandorafms Monitoring Blog, Tech Feature, 7 pages, Nov. 16, 2017 https://blog.pandorafms.org/root-cause-anaysis/. |
Number | Date | Country | |
---|---|---|---|
20200065173 A1 | Feb 2020 | US |