The present invention relates to a management system and a management method for managing a monitoring target apparatus included in a computer system and relates to, for example, a management system and a management method for providing a root cause analysis result.
When a computer system is managed, for example, as disclosed in Patent Literature 1, an event as a cause is detected out of plural failures or signs of the failures detected in the system. This is called root cause analysis (RCA). More specifically, in Patent Literature 1, excess of a threshold of a performance value in a managed apparatus is converted into an event using management software and information is accumulated in an event DB.
This management software includes an analysis engine for analyzing causality of plural failure events occurred in the managed apparatus. This analysis engine accesses a configuration DB including inventory information of the managed apparatus, recognizes apparatus internal components present on a path on an I/O route, and recognizes components, which could affect the performance of a logical volume on a host, as a group called “topology”. When an event occurs, the analysis engine applies an analysis rule including a predetermined conditional sentence and an analysis result to topologies and establishes an expansion rule. This expansion rule includes a cause event as a case of performance deterioration in other apparatuses and a related event group caused by the cause event. Specifically, an event described as a cause of a failure in a THEN part of the rule is a cause event and events other than the cause event among condition events described in an IF part are related events.
In the root cause analysis technique explained above, cause analysis processing with a high certainty factor cannot be realized unless a management computer detects events or states concerning a large number of conditions described in the expansion rule. In other words, an event described in a conclusion part of the expansion rule cannot be determined as a root cause one hundred percent unless all events or conditions of a condition part described in the expansion rule are detected. Therefore, even if there is a monitoring target apparatus determined as a failure cause from state information of connected peripheral monitoring target apparatuses, a root cause cannot be specified with a high certainty factor unless a state which should be received from the monitoring target apparatus determined as the root cause is detected.
From such a viewpoint, it is desirable that the management computer detects, as much as possible, events or states which the monitoring target apparatuses could cause.
However, in the management target apparatuses, when the management computer attempts to detect events and states, which should be set as monitoring targets, as many as possible, a processing load, detection time, memory usage, and the like increase. As a result, a problem occurs in that management cost for the monitoring target apparatuses increases.
The present invention has been devised in view of such circumstances and realizes root cause analysis processing with, a high certainty factor while holding down management cost.
In order to solve the problems, in the present invention, besides one or more condition events which could occur in a node apparatus, an additional event different from the condition events is introduced into an analysis rule for a root cause analysis. This analysis rule indicates a relation between the condition events and additional event and a conclusion event recognized as a failure factor according to satisfaction of the condition events and additional event. The additional event is a command for instructing execution of an action for acquiring additional information from the node apparatus according to a satisfaction state of the one or more condition events. A detected event or a state is applied to the analysis rule, a certainty factor as information indicating possibility of occurrence of a failure in the node apparatus is calculated on the basis of satisfaction or non-satisfaction of the condition events and an execution result of the action, and a root cause analysis result is generated. The obtained root cause analysis result is output according to necessity.
According to the present invention, it is possible to realize root cause analysis processing with a high certainty factor while holding down management cost.
Further problems, configurations, operational effects other than those explained above are made clear by modes for carrying out the present invention explained below and attached drawings.
An embodiment of the present invention is explained below with reference to the accompanying drawings. However, it should be noted that this embodiment is only an example for realizing the present invention and does not limit the technical scope of the present invention. In the figures, common components are denoted by the same reference numbers.
In this specification, information used in the present invention is explained by an expression “aaa table”. However, the information may be represented by an expression such as “aaa table,” “aaa list,” “aaa DB” or “aaa queue” or may be represented by an expression other than data structures such as a table, a list, a DB, and a queue. Therefore, in order to indicate that information used in the present invention does not depend on a data structure, in some cases, “aaa table,” “aaa chart,” “aaa list,” “aaa DB,” “aaa queue” and the like are called “aaa information.”
When contents of information are explained, in some cases, expressions “identification information,” “identifier,” “name,” “appellation” and “ID” are used. However, these expressions can be interchanged.
Further, in the following explanation of processing operations of the present invention, in some cases, the explanation is made assuming that “program” or “module” as an operation entity (a subject). However, since the program or the module is executed by a processor to perform decided processing using a memory and a communication port (a communication control apparatus). Therefore, the processing operations may be read as processing with the processor assumed as the operation entity (the subject). Processing disclosed with the program or the module assumed as the subject may be processing performed by a computer (a management computer, etc.) such as a management server or an information processing apparatus. A part of the program or the entire program may be realized by dedicated hardware. Various programs may be installed in computers by a program distribution server or a storage medium.
In the embodiment described in this specification, the size of a system set as a management target is not referred to. However, as the system becomes larger, it is more highly likely that failures occur in plural places simultaneously and frequently. Therefore, when the present invention is applied to a large system, the effects of the present invention can be further enjoyed.
<Concept of a Root Cause Analysis>
(1) General Method
However, since an error in the storage apparatus 1_102 is easily detected, the condition events 1032 and 1034 are satisfied. However, it is not easy to detect an error from iSCSI_Disk1 in the server 1_101. This is because, even when LU1 of the storage apparatus 1_102 allocated to U:¥ of the server 1101 is inaccessible, since an OS of the server 1_101 attempts to perform write processing in a cache and carryout detailed writing in an actual Disk, an error of the iSCSI_Disk1 cannot be detected in some cases. Therefore, if an error of the iSCSI_Disk1 cannot be detected, an analysis result by the expansion rule 103 cannot be obtained with a certainty factor of 100%.
However, even if such an error can be detected from the application 1, the error is not always a root cause of a failure. In other words, an information amount is insufficient for specifying such a root cause of a failure and a correct root cause analysis cannot be performed.
(2) Method of the Present Invention
(i) Therefore, in the present invention, a new expansion rule based on a new idea is introduced.
By applying such a new rule, even an error not easily detected can be checked. Therefore, it is possible to realize a root cause analysis with reliability higher than that of the general method.
The same applies in
In
In this way, a new action execution result is set as an additional event in the conventional condition events. Therefore, it is possible to improve a situation in which an information amount is insufficient and an analysis result cannot be trusted.
(ii) In the monitoring target apparatuses, depending on types of events or states (e.g., metric such as a processing amount processed by a component per unit time and an error or a failure occurred in the component), contents of the events or the states are stored in different apparatus internal information (tables or logs in the apparatuses).
However, costs in management (e.g., a load, time, and a memory capacity) for a management computer to detect event or state content are different depending on apparatus internal information or a protocol for transmitting information necessary for detecting an event or a state from a monitoring target apparatus to the management computer. These costs in management depend on types of the monitoring target apparatus and a component. Therefore, in some cases, the costs can be easily acquired in a certain apparatus but cannot be easily acquired in another apparatus.
In the embodiment of the present invention, even when there is fluctuation in processing for detecting an event or a state in this way, as explained above, separately from the check of whether or not a condition event of an expansion rule is satisfied, an action is executed and whether or not an error is satisfied is checked on the basis of a result of the action.
(iii) Other Characteristics
The management computer is configured to detect, after detecting event or state content defined in advance, additional event or state content for executing the action. In order to detect the additional event or state content, the management computer executes additional information collection processing on the monitoring target apparatus (hereinafter referred to as “executes an action”). As explained above, this action is carried out when the management computer detects the event or state content defined in advance from the monitoring target apparatus. An event or state content group as conditions for executing the action (condition events in an RCA expansion rule), content of the action to be executed, and the number of satisfied condition events necessary for the execution of the action are defined in advance as an action rule. The action rule is expanded according to an actual environment (an action expansion rule) in the same manner as the RCA expansion rule and the action is executed according to a condition event detection situation. A condition part of the RCA expansion rule includes anew, in addition to the conventional event or state content detected from the monitoring target apparatus, an execution result of the action executed according to the action expansion rule.
“Under processing” of action execution may be controlled. It is likely that the same action is requested by plural action rules. For example, a server A is requested by a certain action rule to perform system log investigation and, while executing the system log investigation, requested by another action rule to perform the same system log investigation. While an action by a certain action rule is being processed, when the same action is requested by another action rule, rather than the processing being redundantly executed, after a result of the action already under processing is obtained, the result is diverted.
Further, if an execution result of the same action is obtained without the same action being executed many times, the execution result is diverted. It is likely that the same action is requested plural times at a short interval. Within time in which an investigation result can be regarded as the same (this is called an action effective period), a result of the last execution is diverted. The action effective period may be set different according to a type of an action. For example, it is assumed that content of an action is “investigate whether “DB lockout error” occurs in a monitoring target apparatus in a range of recent one hour”. In this case, for example, within one hour from time when the action is executed, even when the same action is requested, a result of the action executed first is diverted.
<System Configuration>
Host computers 10000 to 10010 receive, for example, an I/O request for a file from a not-shown client computer connected to the host computers 10000 to 10010 and realize accesses to storage apparatuses 20000 to 20010 on the basis of the I/O request. The management server (a management computer) 30000 manages the operation of the entire computer system 1.
The WEB browser starting server 35000 communicates with a GUI display processing module 32400 of the management server 30000 via the network 45000 and displays various kinds of information on a WEB browser. A user refers to the information displayed on the WEB browser on the WEB browser starting server to manage apparatuses in the computer system. However, the management server 30000 and the WEB browser starting server 35000 may be configured by one server.
<Internal Configuration of the Host Computer>
A job application 13100 and an operating system 13200 are stored in the storage resource 13000.
The job application 13100 uses a storage area provided from the operating system 13200 and performs data input and output (hereinafter referred to as I/O) to and from the storage area.
The operating system 13200 executes processing for causing the job application 13100 to recognize, as storage areas, logical volumes on the storage apparatuses 20000 to 20010 connected to the host computer 10000 via the network 45000.
In
<Internal Configuration of the Network Apparatus>
The network apparatus 40000 includes I/O ports 41000 to 41020 for connection to the host computer 10000 or the storage apparatus 20000 via the network 45000, a management port 41100 for connection to the management server 30000 via the network 45000, a storage resource (a management memory) 42000 for storing various kinds of management information, and a processor 43000 for controlling data and management information in the management memory, which are connected to one another via a circuit such as an internal bus.
The network apparatuses 40000 and 40010 are, for example, IP switches and realize connection among the host computer 10000, the storage apparatus 20000, and the management server 30000.
<Internal Configuration of the Storage Apparatus>
The storage apparatus 20000 includes I/O ports 21000 and 21010 for connection to the host computer 10000 via the network 45000, a management port 21100 for connection to the management server 30000 via the network 45000, a storage resource (a management memory) 23000 for storing various kinds of management information, RAID groups 24000 and 24010 for storing data, and a controller 25000 for controlling data and management information in the management memory, which are connected to one another via a circuit such as an internal bus. The connection of the RAID groups 24000 and 24010 more accurately indicates that storage devices included in the RAID groups 24000 and 24010 are connected to other structures.
A management program 23100 for the storage apparatus and a volume management table 23200 for managing volumes of a magnetic disk are stored in the storage resource 23000. The management program 23100 communicates with the management server 30000 through the management port 21100 and provides the management server 30000 with configuration information of the storage apparatus 20000. The volume management table 23200 is a table for managing information indicating how the volumes are configured.
Each of the RAID groups 24000 and 24010 includes one or plural magnetic disks 24200, 24210, 24220, and 24230. When the RAID group includes plural magnetic disks, the magnetic disks may be formed in a RAID configuration. The RAID groups 24000 and 24010 are logically divided into plural volumes 24100 and 24110.
The logical volumes 24100 and 24110 do not have to be formed in a RAID configuration as long as the logical volumes 24100 and 24110 are configured by using storage areas of one or more magnetic disks. The logical volumes 24100 and 24110 may be storage devices including other storage media such as flash memories instead of the magnetic disks as long as storage areas corresponding to the logical volumes are provided.
The controller 25000 includes, on the inside thereof, a processor which performs control in the storage apparatus 20000 and a cache memory which temporarily stores data exchanged between the controller 25000 and the host computer 10000. The controller 25000 is interposed between an I/O port and a RAID group and performs exchange of data between the I/O port and the RAID group.
The storage apparatus 20000 may have a configuration other than the configuration shown in
<Internal Configuration of the Management Server (the Management Computer)>
In the storage resource 32000, an operating system 32010, a various setting values definition table 32020, an action definition table 32030, an RCA universal rule repository 32040, an RCA expansion rule repository 32050, an action universal rule repository 32060, an action expansion rule repository 32070, a configuration-information-for-node management table 32080, a configuration-information-for-component management table 32090, an event table 32100, an action expansion rule table 32110, an action expansion rule ID-event ID relation table 32120, an RCA expansion rule ID-event ID/action ID relation table 32130, an action execution management table 32140, an event/action expiration management table 32150, an RCA expansion rule table 32160, a conclusion table 32170, a conclusion ID-event ID relation table 32180, a conclusion ID-action ID relation table 32190, a management program 32200, a detected event queue 32210, and an action queue 32220 are stored. Details of the various components stored in the storage resource 3200 are explained later with reference to the drawings. Contents of the components are briefly explained below.
The various setting values definition table 32020 is a table for managing setting values of information necessary for executing root cause analysis processing such as a monitoring interval for a monitoring target apparatus.
An action definition table 32030 is a table for defining content of an action for determining whether or not a condition event introduced anew in the present invention is satisfied.
The RCA universal rule repository 32040 stores a universal rule for a root cause analysis. The RCA expansion rule repository 32050 stores an expansion rule for the root cause analysis obtained by applying configuration information of monitoring target apparatuses to an RCA universal rule.
The action universal rule repository 32060 stores universal rules for actions. The action expansion rule repository 32070 stores an action expansion rule obtained by applying the configuration information of the monitoring target apparatuses to an action universal rule.
The configuration-information-for-node management table 32080 is a table for managing the configuration information of the monitoring target apparatuses (node apparatuses). The configuration-information-for-component management table 32090 is a table for managing configuration information of components of the node apparatuses.
The event table 32100 is a table for managing events occurring in the monitoring target apparatuses and components of the monitoring target apparatuses or states of the monitoring target apparatuses and the components.
The action expansion rule table 32110 is a table for managing a correspondence relation between a used action expansion rule and an executed action.
The action expansion rule ID-event ID relation table 32120 is a table for managing a relation concerning which action is executed when which event occurs (a relation between an action expansion rule and an event related thereto).
The RCA expansion rule ID-event ID/action ID relation table 32130 is a table for managing a relation concerning which RCA expansion rule is applied when which event and action occur (a relation between an RCA expansion rule and an event and an action related thereto).
The action execution management table 32140 is a table for managing execution states of respective actions and last execution results of the actions.
The event/action expiration management table 32150 is a table for managing states (valid or not) of a detected event and an executed action.
The RCA expansion rule table 32160 is a table for managing results of respective kinds of root cause analysis processing.
The conclusion table 32170 is a table for managing a root cause analysis result and a conclusion message corresponding thereto.
The conclusion ID-event ID relation table 32180 is a table for associating a conclusion ID and an event ID and managing conclusions and a detection state of an event.
The conclusion ID-action ID relation table 32190 is a table for associating a conclusion ID and an action ID and managing a relation between conclusions and an action execution result.
The management program 32200 is a program for executing the root cause analysis processing in this embodiment and realizing processing until presentation of an analysis result to an administrator.
The detected event queue 32210 is a queue for accumulating a detected (collected) event. The event table 32100 is updated on the basis of the detected event.
The action queue 32220 is a queue for accumulating actions determined to be executed according to the action expansion rule. For example, the actions are executed in order of input to the action queue 32220.
For example, the management server (the management computer) 30000 includes a keyboard, a pointer device, and the like as input devices and includes a display, a printer, and the like as output devices. However, the management server 30000 may include other apparatuses. It is also possible that a serial interface or an Ethernet interface is used as a substitute of the input and output device, a computer for display including a display, a keyboard, or a pointer device is connected to the interface, information for display is transmitted to the computer for display and information for input is received from the computer for display to perform display on the computer for display or an input is received to substitute input and display in the input and output device.
In this specification, in some cases, a set of one or more computers which manage the computer system (an information processing system) 1 and display the information for display is referred to as management system. When the management server 30000 displays the information for display, the management server 30000 is the management system. A combination of the management server 30000 and the computer for display (e.g., the WEB browser starting server 35000 shown in
<Various Setting Values Definition Table (TBL_PROPERTY)>
In the example shown in
<Configuration-Information-for-Node Management Table (TBL_NODE)>
The node ID 32081 is identification information for specifying a monitoring target apparatus. The node type 32082 is information for specifying a type of the monitoring target apparatus. The node name 32083 is information indicating a name of the monitoring target apparatus. The IP address 32084 indicates an IP address used in accessing the monitoring target apparatus. The authentication information 32085 is information including, for example, an administrator ID and a password. And used for authentication processing executed when the management server 30000 accesses the monitoring target apparatus.
<Configuration-information-for-component management table (TBL_COMPO)>
The component ID 32091 is identification information for specifying a component included in the monitoring target apparatus. The component type 32092 is information indicating a type of the component included in the monitoring target apparatus. The component name 32093 is information indicating a name of the component included in the monitoring target apparatus. The parent node ID is information indicating the monitoring target information including the component.
<RCA Universal Rule>
The RCA universal rule and the action universal rule explained later are rules indicating a relation between a combination of one or more condition events which could occur in a monitoring target apparatus included in the computer system 1 and a conclusion event set as a failure cause with respect to the combination of the condition events. In other words, the RCA/action universal rules are rules indicating that, when an event occurs in the condition part, content described in the conclusion part could be a root cause of a failure.
In general, an event propagation model for specifying a cause in a failure analysis describes, in an “IN-IF-THEN” format, a combination of events predicted to occur as a result of a certain failure and a cause of the events. RCA/action universal rules are not limited to those shown in
The IN clause 320411 is information for specifying a type of a pattern for specifying, when the RCA universal rule is expanded, how the RCA universal rule is expanded.
The IF clause 320412 includes, as information concerning a condition event, a relation among nodes or components (nodes or components arranged according to conditions have a relation with one another), event or states detected in the respective nodes or components as conditions, or results obtained by executing an action defined in the action universal rule in the respective nodes.
The THEN clause 320413 indicates an event or a state of a node or a component as a conclusion (a root cause) when the events or the states indicated by the IF clause 320412 are detected or the execution results of the action are true.
For example, an RCA universal rule Rule-3 32043 indicates that “DiskDrive of Storage is an error” indicated by the THEN clause 320413 is a root cause at a certainty factor of the number of detected events/the number of condition events if, in a topology expanded in Pattern 7 designated by the IN clause 320411, any one of events “a result obtained by executing Action A in Server is true”, “an error in LU of Storage”, “an error in Volume of Storage”, and “an error in DiskDrive of Storage” indicated by the IF clause 320412 can be detected.
There is a relation that, if an event to the IF clause (the condition part) 320412 is detected, an event of the THEN clause (the conclusion part) 320413 is a root cause of a failure and, if a status of the THEN clause 320413 is normal, a problem of the IF clause 320412 is solved.
<Action Universal Rule>
The IN clause 320611 is information for specifying a type of a pattern for specifying, when the action universal rule is expanded how the action universal rule is expanded. An expansion pattern name separately defined is shown.
The IF clause 320612 includes, as information concerning a condition event for action execution, a relation among nodes or components (nodes or components arranged according to conditions have a relation with one another), events or states detected in the respective nodes or components as conditions, or the number of detected events or the number of detected states necessary for execution of an action.
The THEN clause 320613 indicates an action executed when the events or the states included in the IF clause 320612 are detected by a number necessary for execution of the action.
for example, an action universal Rule-1_32061 indicates that “execute Action A in Server” indicated by the THEN clause 320613 if, in a topology expanded in Pattern 5 designated in the IN clause 320611, two or more of events or states of “an error in LU of Storage”, “an error in Volume of Storage”, and “an error in DiskDrive of Storage” indicated by the IF clause 320612 are detected.
A result of the action executed according to the action rule is included in the condition part of the RCA rule. The management server 30000 creates an RCA expansion rule and an action expansion rule from a configuration information management table and universal topology information (e.g., Server(LAN_ADAPTER)-Server(iSCSI_DISK) and Server(ScsiDiskDrive)-Storage(STORAGE_LU)-Storage(STORAGE_VOLUME)-Storage(STORAGE_DISK)) included in the RCA universal rule and the action universal rule.
<Creation of an Expansion Rule>
A pattern 2_1510 indicates a procedure for obtaining LU, Volume, and DiskDrive of a Storage related to a Drive of a Server and is used for expansion of the RCA universal rule 2_32042. In
It is seen from the Server Connection table 1520 shown in
In this way, an expansion rule indicating whether or not an execution result of the Action A is satisfied and in which Disk of which Server, an error in which LU of which Storage, an error in which Volume in which Storage, and an error in which DiskDrive in which Storage are condition events and an error in which DiskDrive of which Storage is set as a root cause is derived.
Therefore, the RCA expansion rules Exp 2-1 and Exp 2-2 are generated from the RCA universal rule 2_32042 (see
<RCA Expansion Rule>
As explained with reference to
For example, the RCA expansion rule of EXP 2-1 shown in
<Action Expansion Rule>
In the same manner as the RCA expansion rule, the action expansion rule is generated by applying, according to the pattern information used in the IN clause 320611 and separately defined, collected configuration information of monitoring target apparatuses and components thereof to an action universal rule. Like EXP-Act 2-1 and EXP-Act 2-2, in some cases, plural action expansion rules are generated from one action universal rule.
For example, an action expansion rule of Exp-Act 1-1 shown in
<Event Table (TBL_EVT)>
The event table 32100 includes an event ID 32101, a node ID 32102, a component ID 32103, an event or state 32104, a detection state 32105, and a last detection time 32106 as structure items. In the event table 32100, the event ID 32101, the node ID 32102, the component ID 32103, and the event or state 32104 are fixed information input from the beginning. The detection state 32105 indicating whether events (E1 to E10; all assumed events are listed up) are detected by the event collection processing and the detection time 32106 of the detection state 32105 are input. If the events are detected, the detection state 32105 is changed from undetected to detected.
In this embodiment, an example of types of events or states of the monitoring target apparatus is as explained below. The events means specific events concerning what occurred in which component and when as explained below. The states of the monitoring target apparatus may be states of the monitoring target apparatus or may be states of a component.
(i) Events
(a) States of the monitoring target apparatus (which may be referred to as node), a component (a hardware component) included in the monitoring target apparatus, a software component such as a program to be executed, or a component (a logical component) logically generated by processing of the hardware component or/and the software component have changed. In the following explanation, when the hardware component, the software component, and the logical component are not distinguished, the components are simply referred to as component.
(b) Processing/state different from normal processing/state has occurred in the node or the component.
(ii) States
An example of the states of the monitoring target apparatus is as explained below.
(a) Possibility of normal operation of the component. In other words, this may be a state indicating presence or absence of occurrence of a failure in the component.
(b) A measurement value (metric) concerning the component. For example, the temperature of the component or a processing amount (IOPS), the number of transactions of a database, a transfer data amount per unit time, etc.) processed by the component per unit time. When the state is considered as an event, in some cases, occurrence of an event is considered on the basis of the metric and a threshold.
Action definition table (TBL_ACT_DEF)>
The action type 32031 is an item for specifying a type of an action (an action name). The action range 32032 is an item for specifying in which range of time in the past from a point of action execution detection a system log is retrieved. The valid period 32033 is an item for specifying in which length of time after a relevant action execution result is obtained the same execution result is used as an action execution result (without actually executing an action). The action content 32034 defines content which should be executed concerning an action corresponding to the action type 32031. {%1} and {%2} are arguments and are replaced with parameter values designated during action execution.
For example, in
<Action Execution Management Table (TBL_ACT)>
The action ID 32141 is identification information for specifying an action. As in the action definition table 32030 (
The management server 30000 stores a relation between an action to be executed and an execution target of the action in an action table in advance on the basis of an action expansion rule. The management server 30000 manages an execution state of the action in the action execution management table 32140.
The action is executed when events equivalent to the number of detected events defined in the action expansion rule are detected as shown in an action expansion rule table (
As explained above, the action execution management table 32140 stores the last time execution result (a recent execution result) 32145 of an action. When an execution request for a certain action is made, if the action is executed within time in a retrieval range defined for the action, a result of an executed action is diverted. For example, in the action execution management table 32140, it is assumed that an execution request for A3 in the action ID 32141 is made at 2010/6/8 18:10. A valid period of an action C executed in A3 is 20 minutes and result decision time of the last time execution is 2010/6/8 17:57 (within 20 minutes), A3 is not executed again and a last time execution result is used.
<Action Expansion Rule Table (TBL_EXP_ACT)>
The action expansion rule table 32110 includes an action expansion rule ID 32111, an execution action ID 32112, a number of detected events or states necessary for action execution 32113, a number of detected events or states 32114 as structure items.
The action expansion rule ID 32111 is identification information for specifying action expansion rules. The execution action ID 32112 is identification information for specifying actions which should be executed according to the action expansion rules. The number of detected events or states necessary for action execution 32113 is information indicating how many events or states indicated by IF clauses of the action expansion rules should be present in the event table 32100 when the action is executed. The number of detected events or states 32144 is information indicating the number of detected events or states indicated by the IF clauses of the action expansion rules in the event table 32100.
For example, in Exp-Act 1-1, it is indicated that the number of events necessary for action execution is two and two events are currently detected. In Exp-Act 2-2, it is indicated that the number of events necessary for action execution is one but no event is currently detected. Therefore, it is seen that an action A1 is executed concerning Exp-Act 1-1 and an action A2 is not executed concerning Exp-Act 2-2.
<Action Expansion Rule ID-Event ID Related Table (TBL_ACT_EVT)>
The action expansion rule ID 32121 is identification information for specifying the action expansion rules. The event ID 32122 is identification information for specifying events related to the action expansion rules. The action expansion rule ID corresponds to the identification information (Exp-Act 1-1, Exp-Act 2-1, etc.) of the action expansion rules stored in the action expansion rule repository 32070. The event ID 32122 corresponds to the event ID 32101 of the event table 32100.
It is seen from the action expansion rule ID-event ID relation table 32120 that, for example, events E5, E6, and E7 are related to the action expansion rule Exp-Act 1-1 as condition events.
<RCA Expansion Rule ID-Event ID/Action ID Relation Table (TBL_RCA_EVT_ACT)>
The RCA expansion rule ID 32131 is identification information for specifying RCA expansion rules. The event ID/action ID 32132 is identification information for specifying events and actions related to the RCA expansion rules. The RCA expansion rule ID corresponds to the identification information (Exp 1-1, Exp 2-1, etc.) of the RCA expansion rules stored in the RCA expansion rule repository 32050. The event ID/action ID 32132 corresponds to the event ID 32101 of the event table 32100 and the action ID 32141 of the action execution management table 32140.
It is seen from the RCA expansion rule ID-event ID/action ID relation table 32130 that, for example, the action A1 and the events E5, E6, and E7 are related to the RCA expansion rule Exp 2-1 as condition events.
<Event/Action Expiration Management Table (TBL_EVT_ACT_EXPIRATION)>
The event ID/action ID 32151 is identification information for specifying an event and an action included in an RCA expansion rule and an action expansion rule. All events and actions are stored in the item.
The state 32152 indicates whether detected events and actions are valid or invalid. Besides expired events and actions, states of undetected events and actions are managed as invalid. When there is a change in this state 32152, the management server 30000 (the management program 32200) increases or reduces the number of detected events/number of satisfied actions 32164 in the RCA expansion rule table 32160 explained later (see
The expiration 32153 is information indicating expiration of an event and an action. Concerning an event, this expiration 32153 is time obtained by adding an event valid period (e.g., 30 minutes) of the various setting values definition table 32020 to time when event satisfaction is detected. Concerning an action, the expiration 32153 is time obtained by adding each valid period 32033 (e.g., 5 minutes or 20 minutes: the valid period is different depending on a type of an action) defined in the action definition table 32030 at time when action satisfaction is detected.
Unless a root cause is dealt with and solved, every time event monitoring is performed, event or action satisfaction corresponding to the event monitoring is detected. In this case, in the event/action expiration management table 32150, the state 32152 is kept valid and the expiration is managed to be extended.
<RCA Expansion Rule Table (TBL_RCA)>
The RCA expansion rule ID 32161 is identification information for specifying an RCA expansion rule. In the item, all the RCA expansion rules included in the RCA expansion rule repository 32050 are stored.
The conclusion ID 32162 is identification information for specifying THEN clauses (conclusion parts) of RCA expansion rules. Content of a conclusion (a root cause) corresponding to the conclusion ID 32162 is shown in the conclusion table 32170 explained later (
The total number of events/actions 32163 is information indicating a total number of condition events and actions included in IF clauses (condition parts) of the RCA expansion rules.
The number of detected events/number of satisfied actions 32164 is infatuation indicating a sum of the number of events and the number of satisfied actions among condition events and actions included in the IF clauses of the expansion rules.
The certainty factor 32165 is information indicating accuracy of an RCA analysis result, in other words, a degree of a program (a trouble) and is obtained by dividing the number of detected events/number of satisfied actions 32164 by the total number of events/actions 32163. This certainty factor 32165 indicates at which level of accuracy a relevant trouble cause could be a root cause.
In
Item values of the RCA expansion rule ID 32161, the conclusion ID 32162, and the total number of events/actions 32163 are fixed. Item values of the number of detected events/number of satisfied actions 32164 and the certainty factor 32165 fluctuate.
<CONCLUSION Table (TBL_ROOT_CAUSE>
The conclusion ID 32171 is identification information for specifying a conclusion used in an RCA result. Identification information of all conclusions to be used (which present by a number equivalent to types of THEN part of an expansion rule) is stored in the item.
The conclusion message 32172 is information obtained by converting content of a THEN clause (a conclusion part) of an RCA expansion rule into a message and is present by a number equivalent to types of a THEN clause of an expansion rule.
The node ID 32173 is identification information for specifying a monitoring target apparatus in which a root cause of a failure is present included in a conclusion corresponding to the node ID 32173.
The component ID 32174 is identification information for specifying a component in which a root cause of a failure is present included in a conclusion corresponding to the component ID 32174.
The present rank 32175 is information indicating priority of a failure which should be dealt with. For example, a rank is determined in order from a failure with a highest certainty factor.
The certainty factor 32176 is information indicating a certainty factor calculated by root cause analysis (RCA) processing. For example, after the RCA expansion table 32165 is generated, the certainty factor 32176 is inserted into the conclusion table 32170.
The expansion rule ID 32177 used for a certainty factor calculation is information for specifying all RCA expansion rules used in calculating a certainty factor leading to a conclusion corresponding to the expansion rule ID 32177. An RCA expansion rule stored in the item space is not limited to one. Identification information of all RCA expansion rules leading to the same conclusion is stored in the item space. However, when plural certainty factors are obtained from plural RCA expansion rules, a value of a maximum certainty factor is stored.
<Conclusion ID-Event ID Relation Table (TBL_ROOT_CAUSE_EVT)>
The conclusion ID-event ID relation table 32180 is a table for managing a relation between a conclusion and a detection state of an event and includes a conclusion ID 32181, an event ID 32182, a detection state 32183, and a detection time 32184 as structure items.
The conclusion ID 32181 is identification information for specifying conclusions corresponding to all condition events (excluding actions) included in an RCA expansion rule. When there are plural condition events included in the same RCA expansion rule, only the number of condition events having the same conclusion is stored in the item space.
The event ID 32182 is information indicating all events included in the RCA expansion rule.
The detection state 32183 is information indicating detection states of events. A state is determined on the basis of the detection state 32105 of the event table 32100. In an initial state, in this detection state 32183, all states are set to undetected. When an event is detected, the setting is changed to “detected.”
The detection time 32184 indicates time when events are detected.
<Conclusion ID-Action ID Relation Table (TBL_ROOT_CAUSE_ACT)>
The conclusion ID 32191 is identification information for specifying conclusions corresponding to all actions as condition events included in an RCA expansion rule. Since actions are not always included in all RCA expansion rules as condition events, not all conclusions are stored in the item space.
The action ID 32191 is information indicating all the actions included in the RCA expansion rule.
The execution result 32193 is information indicating execution results of the actions. Content of the execution results is satisfied, not satisfied, or unexecuted (−).
The execution result decision time 32194 indicates time when an execution result of an action is decided.
<Management Program>
The management program 32200 is a program for executing management of a management target apparatus including, for example, configuration information management processing, event collection processing, collected (detected) event processing, action execution processing, expiration management processing, and GUI processing.
The configuration information management processing is processing for transmitting a configuration information acquisition request to monitoring target apparatuses and storing configuration information (node configuration information and component configuration information) returned from the monitoring target apparatuses respectively in the configuration-information-for-node management table 32080 and the configuration-information-for-component management table 32080.
The event collection processing is processing for collecting information concerning events or states detected from the monitoring target apparatuses.
The event collection processing is processing for collecting events occurred within a predetermined period (e.g., within an event monitoring time interval) from the monitoring target apparatuses.
The event collection processing is processing for determining an RCA expansion rule to which the events collected by the event collection processing is applied and calculating a certainty factor on the basis of the number of condition events, the number of detected events, and the number of satisfied actions in the RCA expansion rule. The collected event processing is processing for determining, on the basis of the detected events, whether an action specified in the RCA expansion rule is executed.
The action execution processing is processing for applying, when execution of an action is determined by the collected event processing, the detected events to an action expansion rule to execute the action and determining whether or not the action is satisfied.
The expiration management processing is processing for determining whether a collected (detected) event and a satisfied action expires and invalidating expired event and action.
The GUI processing is processing for generating an RCA result output screen (a system monitoring console (
As shown in
Although not explained above and not shown in
<Outline of Periodic Execution Processing by the Management Program>
First, the management program 32200 executes management program initialization processing (S301). Details of this processing are explained in detailed with reference to
The management program 32200 performs schedule check (S302). Specifically, the management program 32200 checks setting items and setting values defined in the various setting values definition table 32020, monitors monitoring target apparatuses, and checks whether it is timing for executing processing for collecting events from the monitoring target apparatuses (e.g., every 5 minutes) or timing for performing processing for determining whether there are expired event and action among collected events and satisfied actions (e.g., every 30 minutes).
Subsequently, the management program 32200 determines it is execution timing for the event collection processing or timing for checking expiration (S303). When it is neither the execution timing nor the timing for checking expiration, the management program 32200 returns to S302.
In the case of the timing for the event collection processing, the management program 32200 executes the event collection processing (S304). Details of the event collection processing are explained with reference to
On the other hand, in the case of the timing for checking expiration, the management program 32200 executes the expiration management processing (S305). Details of the expiration management processing are explained with reference to
<Management Program Initialization Processing>
First, the management program 32200 reads the various setting values definition file 32020 and the action definition file 32030 (S3010) and reads an RCA universal rule and an action universal rule (S3011).
The management program 32200 accesses monitoring target apparatuses included in the computer system 1, acquires configuration information of the apparatuses and components thereof, and stores the configuration information respectively in the configuration-information-for-node management table 32080 and the configuration-information-for-component management table 32090 (S3012).
The management program 32200 applies the configuration information acquired in S3012 to the RCA universal rule and the action universal rule read in S3011 to generate an RCA expansion rule and an action expansion rule, and generates the conclusion table 32170 anew (S3013). As explained with reference to
Subsequently, the management program 32200 initializes the event table 32100, the action execution management table 32140, and the event/action expiration management table 32150 (S3014). Specifically, in the event table 32100, fixed relevant information is inserted into the event ID 32101, the node ID 32102, the component ID 32103, and the event or state 32104. All states are undetected in the detection state 32105. The last detection time 32106 is set to blank. In the action execution management table 32140, fixed relevant information is inserted into the action ID 32141, the action type 32142, and the action target 32143. The execution state 32144 is set to on standby or Null (−). The last time execution result 32145 and the last execution result decision time 32146 are set to blank or Null (−).
The management program 32200 generates the action expansion rule ID-event ID relation table 32120 on the basis of the generated action expansion rule (S3015) and further generates the RCA expansion rule ID-event ID/action ID relation table 32130 on the basis of the generated RCA expansion rule (S3016).
Further, the management program 32200 initializes the action expansion rule table 32110 and the RCA expansion rule table 32160 (S3017). Specifically, in the action expansion rule table 32110, fixed relevant information is inserted into the spaces of the action expansion rule ID 32111, the execution action ID 32112, and the number of detected events or states necessary for action execution 32133 and the space of the number of detected events or states 32114 is set to blank. In the RCA expansion rule table 32160, on the basis of the RCA expansion rules, fixed relevant information is inserted into the spaces of the RCA expansion rule ID 32161, the conclusion ID 32162, and the total number of events/actions 32163 and the spaces of the number of detected events/number of satisfied actions 32164 and the certainty factor 32165 are set to blank.
The management program 32200 generates the conclusion ID-event ID relation table 32180 and the conclusion ID-action ID relation table 32190 (S3018). The conclusion ID-event ID relation table 32180 is a table for managing conclusions and detection states of related events necessary for determination leading to the conclusions. Therefore, at a stage of initialization, events (excluding actions) related to THEN clauses (conclusion parts) are extracted from the RCA expansion rules, fixed relevant information is inserted into the spaces of the conclusion ID 32181 and the event ID 32182, the space of the detection state 32183 is set to undetected, and the space of the detection time is set to blank or Null (−). The conclusion ID-action ID relation table 32190 is a table for managing conclusions and execution results of related actions necessary for determination leading to the conclusions. Therefore, at a stage of initialization, actions related to THEN clauses (conclusion parts) are extracted from the RCA expansion rules including actions, fixed relevant information is inserted into the spaces of the conclusion ID 32191 and the action ID 32192, the space of the detection state 32183 is set to undetected, and the space of the detection time 32184 is set to blank or Null (−).
Finally, the management program 32200 initializes the detected event queue 32210 and the action queue 32220 (S3019).
<Event Collection Processing>
First, the management program 32200 transmits an event collection request to monitoring target apparatuses and checks events or states of a monitoring target apparatus (a node) and a component set as collection targets this time (S3040).
The management program 32200 determines whether information acquired in S3040 corresponding to the node ID 32102 and the component ID 32103 of the event table (TBL_EVT) 32100 coincides with information indicated by the event or state 32104 (S3041).
When it is determined that the information acquired in S3040 does not coincide with the information indicated by the event or state 32104 (in the case of No in S3041), the management program 32200 ends the processing concerning the combination of the node ID and the component ID and shifts the processing to the event collection processing for the next node ID and component ID.
On the other hand, when it is determined that the information acquired in S3040 coincides with the information indicated by the event or state 32104 (in the case of Yes in S3041), the management program 32200 adds the event ID 32101 corresponding to the information to the detected event queue 32210 (S3042).
At this stage, generated events are simply collected. Reflection of information on the event table 32100 is executed in the detected event processing shown in
<Collected Event Processing>
First, the management program 32200 extracts one event ID from the detected event queue 32210 (S331).
The management program 32200 sets, concerning the extracted event, the last detection time 32106 of the event table (TBL_EVT) 32100 to the present time (S332). As the present time, for example, time when the management program 32200 inputs an event ID corresponding to the detected event queue 32210, time when the management program 32200 extracts an event ID corresponding to the detected event queue 32210 from the detected event queue 32210, time when the management program 32200 requests, during the event collection processing, monitoring target apparatuses to return detected events or states, time when the management program 32200 checks the returned events or states in S3040, and time when events or states are detected in the monitoring target apparatuses (time when the events or the states are described in a log) are conceivable.
The management program 32200 inputs, concerning the event, a value obtained by adding a setting value (e.g., 30 minutes) of the event valid period of the various setting values definition table (TBL_PROPERTY) 32020 to the last detection time (the present time) set in S332 to the space of the expiration 32153 of the event/action expiration management table (TBL_EVT_ACT_EXPIRATION) 32150 (S333). When the expiration 32153 is set, time measurement for determining validity of the event is started.
Subsequently, the management program 32200 determines whether the detection state 32105 of the event table (TBL_EVT) 32100 corresponding to the event is undetected (S334).
When the detection state 32105 is detected (in the case of No in S334), if the next event is present in the detected event queue, the management program 32200 subsequently executes processing concerning the event and, if the next event is not present, the management program 32200 ends the processing. In other words, when the detection state 32105 is detected, information in the spaces of the last detection time 32106 and the expiration 32153 concerning the event is simply updated.
On the other hand, when the detection state 32105 is undetected (in the case of Yes in S334), the management program 32200 changes, concerning the event, the detection state 32105 of the event table (TBL_EVT) 32100 from undetected to detected (S335).
Further, the management program 32200 executes addition processing for the number of detected events or states 32114 of a relevant action expansion rule in the action expansion rule table (TBL_EXP_ACT) 32110 (S336) and addition processing for the number of detected events/number of satisfied actions 32164 of a relevant RCA expansion rule in the RCA expansion rule table (TBL_RCA) 32160 (S337). Details of S336 and S337 are respectively explained with reference to
If an event ID is still present in the detected event queue 32210, the management program 32200 repeats the processing of S311 to S337. If no event ID is present in the detected event queue 32210, the management program 32200 ends the collected event processing.
<Details of the Addition Processing for the Number of Detected Events of an Action Expansion Rule>
First, the management program 32200 refers to the action expansion rule ID-event ID relation table (TBL_ACT_EVT) 32120 and searches for an action expansion rule ID to which a relevant event (an event as a processing target) is related (S3360). Concerning all action expansion rule IDs acquired in this processing, processing in S3361 to S3363 explained below is executed.
The management program 32200 selects one action expansion rule ID and adds 1 to the number of detected events or states 32114 corresponding to the action expansion rule ID in the action expansion rule table (TBL_EXP_ACT) 32110 (S3361).
Subsequently, the management program 32200 refers to the action expansion rule table (TBL_ACT_EVT) 32110 and determines whether the number of detected events or states 32114 corresponding to the action expansion rule ID reaches (is equal to or larger than) the number of detected events or states necessary for action execution 32113 (S3362).
When the number of detected events or states 32114 does not reach the number of detected events or states necessary for action execution 32113 (in the case of No in S3362), the processing shifts to processing concerning the acquired action expansion rule IDs. If all the action expansion rule IDs acquired in S3360 are already processed, the management program 32200 ends the addition processing. If there is an unprocessed action expansion rule ID, the management program 32200 repeats the processing of S3361 to S3363.
When the number of detected events or states 32114 reaches the number of detected events or states necessary for action execution 32113 (in the case of Yes in S3363), the management program 32200 adds the relevant execution action ID 32112 of the action expansion rule table (TBL_ACT_EVT) 32110 to the action queue 32220 (S3363).
First, the management program 32200 refers to the RCA expansion rule ID-event ID/action ID relation table 32130 and searches for an RCA expansion rule ID to which a relevant event (an event as a processing target) is related (S3370). Processing in S3371 and S3372 is executed concerning all RCA expansion rule IDs acquired in this processing.
The management program 32200 selects one RCA expansion rule ID and adds 1 to the number of detected events/number of satisfied actions 32164 corresponding to the RCA expansion rule ID 32161 in the RCA expansion rule table (TBL_RCA) 32160 (S3371).
The management program 32200 divides the number of detected events/number of satisfied actions 32164 of the RCA expansion rule by a total number of events/actions and sets a value obtained by the division as the certainty factor 32165 (S3372).
If all the RCA expansion rule IDs acquired in S3370 are already processed, the management program 32200 ends the addition processing. If an unprocessed RCA expansion rule ID is present, the management program 32200 repeats the processing in S3371 and S3372.
<Expiration Management Processing>
First, the management program 32200 refers to the event/action expiration management table (TBL_EVT_ACT_EXPIRATION) 32150 and determines, concerning one event ID/action ID, whether the relevant expiration 32153 is before the present time (an event or an action is expired) (S3050). The present time means time when the expiration management processing is started concerning the event ID/action ID.
When the event or action is not expired (in the case of No in S3050), the processing shifts to processing of the next event ID/action ID or ends.
When the event or action is expired (in the case of Yes in S3050), the management program 32200 sets the space of the expiration 32153 of the event ID/action ID to blank or Null (−) (S3051) and further sets the space of the state 32152 from valid to invalid (S3052).
Subsequently, the management program 32200 executes subtraction processing for the number of detected events or states 32114 of a relevant action expansion rule in the action expansion rule table (TBL_EXP_ACT) 32110 (S3053) and addition processing for the number of detected events/number of satisfied actions 32164 of a relevant RCA expansion rule in the RCA expansion rule table (TBL_RCA) 32160 (S3054). Details of S3053 and S3054 are respectively explained with reference to
If an unprocessed event ID/action ID is still present, the management program 32200 repeats the processing in S3050 to S3054. If no unprocessed event ID/action ID is present, the management program 32200 ends the expiration processing.
<Details of the Subtraction Processing for the Number of Detected Events or States 32114 of an Action Expansion Rule>
First, the management program 32200 refers to the action expansion rule ID-event ID relation table (TBL_ACT_EVT) 32120 and retrieves an action expansion rule ID to which a relevant event (an event as a processing target) is related (S30530). Processing in S30531 explained below is executed concerning all action expansion rule IDs acquired in this processing.
The management program 32200 selects one action expansion rule ID and subtracts 1 from the number of detected events or states 32114 corresponding to the action expansion rule ID in the action expansion rule table (TBL_EXP_ACT) 32110 (S30351).
If all the action expansion rule IDs acquired in S30530 are already processed, the management program 32200 ends the addition processing. If an unprocessed action expansion rule ID is present, the management program 32200 repeats the processing in S30351.
<Details of the Subtraction Processing for the Number of Detected Events/Number of Satisfied Actions of an RCA Expansion Rule>
First, the management program 32200 refers to the RCA expansion rule ID-event ID/action ID relation table (TBL_RCA_EVT_ACT) 32130 and retrieves an RCA expansion rule ID to which a relevant event (an event as a processing target) is related (S30540). Processing in S30541 and S30542 explained below is executed concerning all RCA expansion rules ID acquired in this processing.
The management program 32200 selects one RCA expansion rule ID and subtracts 1 from the number of detected events/number of satisfied actions 32164 corresponding to the RCA expansion rule ID 32161 of the RCA expansion rule table (TBL_RCA) 32160 (S30541).
The management program 32200 divides the number of detected events/number of satisfied actions 32164 of a relevant RCA expansion rule by a total number of events/actions and sets a value obtained by the division as the certainty factor 32165 (S30542).
If all the RCA expansion rule IDs acquired in S30540 are already processed, the management program 32200 ends the addition processing. If an unprocessed RCA expansion rule ID is present, the management program 32200 repeats the processing in S30541 and S30542.
<Action Execution Processing>
First, the management program 32200 extracts one execution action ID from the action queue 32220 (S390), refers to the action execution management table (TBL_ACT) 32140, and determines whether the execution state 32144 of the execution action ID is under execution (S391). When the same action is executed by another event, if execution of the same action is processed from the present event, the management program 32200 ends the same action executed earlier. The present action is not executed and an action execution result of the last time is diverted as an action execution result of this time.
When the execution state is under execution (in the case of Yes in S391), the management program 32200 ends the processing concerning the action ID.
When the execution state is not under execution (in the case of No in S391), the management program 32200 refers to the event/action expiration management table (TBL_EVT_ACT_EXPIRATION) 32150 and determines whether the state 32152 is valid (S392).
When the state is valid (in the case of Yes in S392), the management program 32200 ends the processing concerning the action ID. In this case, since an execution result of the same action has not expired yet, the same execution result is diverted to the execution action ID. Consequently, processing is made efficient without executing the same action many times even if the same action execution command is issued within a predetermined time.
When the state is invalid (in the case of No in S392), the management program 32200 executes an action corresponding to the action ID (S393). Details of this action execution processing are explained with reference to
<Details of the Action Execution Processing>
The management program 32200 sets, in the action execution management table (TBL_ACT) 32140, the space of the execution state 32144 corresponding to the action ID 32141 of the processing target to under execution (S39300).
The management program 32200 sets the space of the last time execution result 32145 to blank (S39301) and further sets the space of the last execution result decision time 32146 to blank (S39392). This is because, if an execution result of this time is obtained, an execution result of the last time and information concerning last execution result decision time are unnecessary.
The management program 32200 executes an action specified by an action ID set as a processing target (S39303). Content of the action to be executed is specified by the action type 32031 and the action content 32034 of the action definition table (TBL_ACT_DEF) 32030.
When an execution result is obtained, the management program 32200 sets, according to content of the execution result, satisfaction or not in the space of the last time execution result 32145 of the action execution management table (TBL_ACT) 32140 (S390304) and sets the present time in the space of the last execution result decision time 32146 (S39305). The present time means time specified within a series of processing related to action execution such as time when an action ID is extracted from an action queue, time when execution of an action is actually started, time when a check request for a system log is transmitted to a relevant monitoring target apparatus, time when a reply to the request is received from the monitoring target apparatus, and time when whether or not an action is satisfied is decided from the received replay.
Further, the management program 32200 sets, in the event/action expiration management table (TBL_EVT_ACT_EXPIRATION) 32150, the space of the state 32152 corresponding to the action ID as the processing target to valid (S39306) and adds the valid period 32033 of the relevant action type 32031 defined by the action definition table 32030 to the last execution result decision time set in S39305 and sets the expiration 32153 (S39307).
Subsequently, the management program 32200 determines whether the action execution result obtained in S39303 is satisfaction (S39308). When the action execution result is not satisfied (in the case of No in S39308), the processing shifts to S39310.
When the action execution result is satisfaction (in the case of Yes in S39308), the management program 32200 adds 1 to the number of detected events/number of satisfied actions 32164 of the RCA expansion rule table (TBL_RCA) 32160 (S39309). Details of S39309 are the same as the processing explained with reference to
Finally, the management program 32200 sets the space of the execution state 32144 of the action execution management table (TBL_ACT) 32140 to on standby and ends the action execution processing (S39310).
<RCA Result Output Processing>
The management program 32200 refers to the conclusion table 32170, acquires the information 32172 to 32177 corresponding to conclusion IDs set as targets, and subjects the information 32172 to 32177 to GUI processing and displays the information on a display screen (S410). Examples of the GUI screen include system monitoring consoles shown in
<Conclusion Table Update Processing>
First, the management program 32200 acquires, concerning one conclusion ID, at least one RCA expansion rule ID 32161 having the same conclusion ID 32161 in the RCA expansion rule table (TBL_RCA) 32160 (S420) and acquires values of the certainty factor 32165 corresponding to the acquired RCA expansion rule ID 32161 (S421).
The management program 32200 sets a maximum among the certainty factors acquired in S421 as a value of the present certainty factor 32176 in the conclusion table (TBL_ROOT_CAUSE) 32170 (S422). In some cases, plural RCA expansion rules leading to the same conclusion are present. However, a result with the highest certainty factor (accuracy of a root cause analysis result) among the RCA expansion rules is selected.
The management program 32200 sets the RCA expansion rule ID 32161 having the certainty factor 32165 selected in S422 in the space of the expansion rule ID 32171 used for the certainty factor calculation for the conclusion table (TBL_ROOT_CAUSE) 32170 (S423). In the conclusion table (TBL_ROOT_CAUSE) 32170, only one RCA expansion rule corresponds to one conclusion ID. However, an RCA expansion rule other than an RCA expansion rule indicating the present certainty factor may be input. However, in this case, it is necessary to clearly indicate which RCA expansion rule provides the present certainty factor.
The processing in S420 to S423 is executed on all the conclusion IDs 32171 of the conclusion table (TBL_ROOT_CAUSE) 32170. Information of the present certainty factor 32176 equivalent to the number of the conclusion IDs 32171 is obtained.
The management program 32200 sets the present rank 32175 in order from the conclusion ID 32171 having the largest present certainty factor among the obtained plural present certainty factors (S424).
<Conclusion ID-Event ID Relation Table Update Processing>
First, the management program 32200 selects one set of the conclusion ID 32181 and the event ID 32182 from the conclusion ID-event ID relation table (TBL_ROOT_CAUSE_EVT) 32180 and sets the detection state 32183 of the conclusion ID-event ID relation table (TBL_ROOT_CAUSE_EVT) 32180 to a value (detected or undetected) same as the detection state 32105 corresponding to the event ID 32101 in the event table (TBL_EVT) 32100 (S430).
Further, the management program 32200 sets a value of the detection time 32184 of the conclusion ID-event ID relation table 32180 to a value same as the last detection time 32106 corresponding to the event ID 32101 in the event table (TBL_EVT) 32100 (S431).
As explained above, all kinds of information of the detection state 32183 and the detection time 32184 in the conclusion ID-event ID relation table (TBL_ROOT_CAUSE_EVT) 32180 are updated.
<Conclusion ID-Action ID Relation Table Update Processing>
First, the management program 32200 selects one set of the conclusion ID 32191 and the action ID 32192 from the conclusion ID-action ID relation table 32190 and sets the execution result 32193 of the conclusion ID-action ID relation table (TBL_ROOT_CAUSE_ACT) 32190 to a value (satisfied or not satisfied) same as the last time execution result 32145 corresponding to the action ID 32141 in the action execution management table (TBL_ACT) 32140 (S440).
Further, the management program 32200 sets a value of the execution result decision time 32194 of the conclusion ID-action ID relation table (TBL_ROOT_ACT) 32190 to a value same as the last execution result decision time 32146 corresponding to the action ID 32141 in the action execution management table 32140 (S441).
As explained above, all kinds of information of the execution result 32193 and the execution result decision time 32194 in the conclusion ID-action ID relation table (TBL_ROOT_CAUSE_ACT) 32190 are updated.
<Example of an RCA Result Output Screen>
The RCA result output screen (a present result: list display) 450 includes an RCA result type plane 451 and an RCA result list display plane 452.
In the RCA result plane 451, only a present analysis result is shown as a type. However, the RCA result plane 451 is not limited to this and may include a past RCA result as a type. In the list display plane 451, a list of RCA results sorted according to the present rank 32175 of the conclusion table 32170 is displayed.
The RCA result output screen (a present result: detailed display) 460 includes an RCA result type plane 461 and an RCA result detailed display plane 462.
The RCA result type plane 461 has content in which the RCA result type plane 451 is descried more in detail. In
Detailed content of a selected root cause is displayed in the RCA result detailed display plane. For example, when one root cause is selected from the list display of the root causes of the RCA result type plane 461, detailed content of the root cause is displayed in the RCA result detailed display plane 462. In the example shown in
As explained above, in this embodiment, a separately-defined action is set in an expansion rule, a certainty factor is calculated on the basis of whether or not a condition event and an action execution result are satisfied, and an RCA result is generated. This action is processing for checking whether or not a substitute condition event equivalent to a condition event not easily detected in the conventional expansion rule is satisfied. This action is, for example, an action for checking a system log in a monitoring target apparatus and detecting presence or absence of an error or the like. Necessity of action execution is determined according to whether a predetermined number f condition events other than the action defined in the expansion rule are satisfied (different according to an expansion rule) (an action expansion rule). Consequently, since presence or absence of even an error not easily detected is checked by another kind of means, it is possible to provide an RCA result with a higher certainty factor. Even when an information amount is small and a root cause analysis cannot be performed, since an action execution result can be included in the root cause analysis processing as additional information, it is possible to provide an RCA result with a higher certainty factor. Further, since an action rule is simply introduced anew, it is possible to reduce cost (a processing load, a consumed memory capacity, and processing time) in management of a computer system can be reduced. The action expansion rule is configured to include condition events of an RCA expansion rule corresponding to the action expansion rule. Necessity of action execution is determined according to the number of satisfied condition events. However, the action expansion rule is not limited to this. The action expansion rule may include events or states not coinciding with the condition events included in the corresponding RCA expansion rule. Specifically, when events or states a, b, and c and an action X are included in the RCA expansion rule as condition events, as condition events of an action rule for determining execution of the action X, events or states d, e, and f may be included in addition to or separately from at least a part of the events or states a, b, and c.
In this embodiment, it is sequentially managed concerning whether detected condition events and execution results of actions are valid or invalid. A certainty factor is sequentially calculated again according to a change in a state (a change from invalid to valid or a change from valid to invalid) of the condition events and the action execution results. Consequently, it is possible to provide certainty factor information with higher reliability.
Further, during execution of an action or within a set time from a point when an execution result of an action is acquired, when execution of actions same as the action is continuously instructed, the same action is not executed and an execution result of the action already acquired is used as an execution result of the same action. In this way, the same execution result is diverted to the same action execution request within a fixed period. Consequently, it is possible to save useless processing and hold down management cost of a computer system while maintaining accuracy of a certainty factor at fixed or higher accuracy.
In this embodiment, the list display (
The present invention is not limited to the embodiment per se. At an implementation stage, the components can be modified and embodied without departing from the spirit of the present invention. Various inventions can be formed according to appropriate combinations of plural components disclosed in the embodiment. For example, several components may be deleted from all the components described in the embodiment. Further, components of different embodiments may be combined as appropriate.
A part or all of the components, the functions, the processing units, the processing means, and the like described in the embodiment may be realized by hardware by, for example, designing the components, the functions, the processing units, the processing means, and the like with integrated circuits. The components, the functions, and the like may be realized by a processor interpreting and executing programs for realizing the respective functions. Information concerning programs, tables, files, and the like for realizing the functions and the like can be stored in a recording or storing device such as a memory, a hard disk, or an SSD (Solid State Drive) or a recording or storing medium such as an IC card, an SD card, or a DVD.
Further, in the embodiment explained above, control lines and information lines necessary for explanation are shown. Not all control lines and information lines are shown in terms of a product. All the components may be connected to one another.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2010/068717 | 10/22/2010 | WO | 00 | 1/21/2011 |