This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2017-210989, filed on Oct. 31, 2017, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an event investigation assist method an event investigation assist device.
In recent years, in order to shorten a time required for a system operator to deal with an alert when an alert occurs in a system such as, for example, an information processing system, a mechanism that acquires and visualizes various pieces of information such as logs, configuration information, and performance information from the system has begun to be distributed. There are various kinds of alerts ranging from a fatal error of a system to a simple status notification. When an alert occurs, the operator acquires various pieces of information from the system and determines whether there is a fault in the system.
However, when the amount of information provided by the system becomes large, an inexperienced operator may not know the procedure and type of information to be acquired to determine whether there is a fault in the system. In other words, an inexperienced operator may not know a procedure of investigation when an alert occurs.
For example, while an alert occurs frequently in the operation of a cloud system, an operator determines the presence or absence of an influence on a service when the alert occurs. In order to determine the presence or absence of the influence on the service, the operator investigates a graph of various resource usage statuses including a CPU use rate, a response time of a storage, and the like. However, an inexperienced operator may not know the procedure and the graph of usage status of resources to be investigated in order to determine the presence or absence of the influence on the service when an alert occurs. Therefore, the investigation procedure of the graph to be performed when an alert occurs is used by preparing the graph based on the past experiences of various operators.
There is an incident management system that visualizes the situations such as a fault influence range in a target system having a configuration in which cloud environments, fault tolerance, or the like is considered. The incident management system has a first function of generating a screen that visualizes the incident situation including the configuration of the target system and the fault influence range using configuration information and incident information, and providing the screen to a terminal of a person in charge. Further, the incident management system also has a second function of setting a configuration including constituent parts designed in consideration of the fault in the target system in the configuration information as a configuration management model.
There is a technique that indicates a sufficiency of monitoring with respect to the target devices and monitoring items in the system. In the technique, a monitoring server receives operation data from the device, causes an administrator terminal to output the received operation data according to a viewpoint instructed from the administrator terminal, and allows a user to monitor the device and the monitoring items by outputting the operation data. In addition, the monitoring server generates a first evaluation value including first information and a first index indicating a sufficiency of monitoring based on the operation data, output setting, an access log, and a first period. Further, the monitoring server generates a second evaluation value including second information and a second index indicating a sufficiency of monitoring based on the first evaluation value, and generates a third evaluation value including third information and a third index indicating the sufficiency of monitoring based on the second evaluation value. In addition, the monitoring server generates data that displays the first, second, and third evaluation values.
There is a fault investigation information apparatus which investigates the fault against a computer fault. The fault investigation information apparatus includes an operating environment setting unit that sets in advance an operating environment for a fault investigation in a setting table, a log collection unit that collects investigation information in accordance with the contents of the setting table when a fault occurs, and a trace collection unit that collects trace information at the time of designating the operation. Further, the fault investigation information apparatus includes an investigation information recording unit that outputs the investigation information collected by the log collection unit and the trace collection unit to a storage medium according to the contents of the setting table.
Related techniques are disclosed in, for example, Japanese Laid-Open Patent Publication No. 20012-038028, Japanese Laid-Open Patent Publication No. 2012-238213, and Japanese Laid-Open Patent Publication No. 10-260861.
According to an aspect of the present invention, provided is an event investigation assist device including a memory and a processor coupled to the memory. The processor is configured to store a procedure model in the memory. The procedure model associates event information with an investigation procedure and a required time for each of events that occur in a system. The event information is information regarding a relevant event. The investigation procedure is a procedure for investigating whether the relevant event is a fault. The required time is a time spent for performing investigation in accordance with the investigation procedure. The processor is configured to store a learning model in the memory. The learning model associates a first reliability with each of investigation contents for each of the events. The investigation contents are included in the investigation procedure. The processor is configured to accept first event information. The processor is configured to calculate a second reliability for each of first investigation procedures based on the learning model and the procedure model. The first investigation procedures are associated with the first event information in the procedure model. The processor is configured to determine a recommended investigation procedure from among the first investigation procedures based on the second reliability and the required time. The processor is configured to display the recommended investigation procedure.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
There is a possibility that an investigation procedure may be generated with a low reliability in a technique of generating and using the investigation procedure of a graph at the time of an occurrence of an alert based on various past experiences of an operator, which is problematic. When an experienced operator generates an investigation procedure, the investigation procedure may not be efficient and the reliability of the investigation procedure may become low.
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Further, the embodiments do not limit a disclosed technology.
First, an outline of an event investigation assist device according to a first embodiment will be described.
In
The event investigation assist device according to the first embodiment makes a rule for the alert information in association with three graphs and the investigation time. The event investigation assist device according to the first embodiment learns a plurality of rules generated by a plurality of operators regarding the same alert.
Then, when the similar alert occurs, the event investigation assist device according to the first embodiment displays the graph and a procedure to be investigated using a learning result. Operator B performs an investigation by referring to the investigation procedure displayed by the event investigation assist device according to the first embodiment and determines whether there is the influence on the service. Operator B may perform the investigation according to a different investigation procedure from a case of performing the investigation according to the investigation procedure displayed by the event investigation assist device according to the first embodiment. In addition, the event investigation assist device according to the first embodiment feeds back and learns the investigation procedure and the investigation time by operator B.
As described above, the event investigation assist device according to the first embodiment learns and uses the investigation procedure and the investigation time by the operator when the alert occurs. Further, the graph is displayed by searching a database that stores performance data of a resource of the cloud system. Accordingly, a search expression of the database corresponds to the graph.
In the meantime, as illustrated in
Next, a functional configuration of the event investigation assist device according to the first embodiment will be described.
The investigation history reception unit 11 receives past alert information. The investigation history storing unit 12 stores the alert investigation history received by the investigation history reception unit 11 in the investigation history storage unit 13. The investigation history storage unit 13 stores the alert investigation history.
The symbol “No.” represents a number which identifies the alert investigation history. The occurrence time represents a date and time when the alert occurs. The occurrence place represents a place where the alert occurs in the cloud system. The alert type represents a type of alert. Here, the occurrence place and the alert type are collectively referred to as an alert class. The operator ID represents an identifier which identifies the operator that performs the investigation for the alert.
The investigation history is information on the investigation item and the investigation time. The investigation history includes the search expression, a search time, and a reference time. The search expression is a search expression which the operator uses to display the graph and is the investigation item. The search time represents a time when the operator inputs the search expression. The reference time represents a time when the operator refers to the graph and represents a time when the graph is displayed on a screen. A unit of the reference time is second (s). One investigation history includes one or more combinations of the search expression, the search time, and the reference time.
For example, an alert identified by “0” occurs at “10/24 10:12”, the occurrence place is “physical host”, the alert type is “communication delay with the storage”, and the alert is investigated according to “tetsuya”. “tetsuya” displays a graph on “10:21” using “search expression A”, refers to “215 seconds”, displays a graph at “10:25” using “search expression B”, and refers to “123” seconds.
The application procedure estimation unit 14 estimates the search expression, the procedure, and the investigation time which the operator uses for the investigation from the investigation history for each alert investigation history to generate a procedure model. In this case, the application procedure estimation unit 14 sets the search expressions corresponding to the graphs which the operator may estimate simultaneously watching in the same procedure. The application procedure estimation unit 14 estimates that the operator simultaneously watches the graphs when there is an overlap exceeding a predetermined threshold at a time displayed between the graphs referred to by the operator.
The application procedure estimation unit 14 estimates a time required by the operator for the investigation (required time) based on a first search time, a last search time, and the reference time. In addition, the application procedure estimation unit 14 stores the generated procedure model in the procedure model storage unit 15. The procedure model storage unit 15 stores the procedure model.
The symbol “No.” represents a number which identifies the procedure model. The occurrence time represents a date and time when the alert corresponding to the procedure model occurs. The occurrence place represents a place where the alert corresponding to the procedure model occurs in the cloud system. The alert type represents a type of alert corresponding to the procedure model. The operator ID represents an identifier which identifies the operator that performs the investigation according to the investigation procedure of the procedure model.
The investigation procedure is one or more sets of the application procedure and the search expression group. The application procedure is a procedure of investigation contents in the investigation procedure, that is, the investigation procedure. The search expression group is one or more search expressions which the operator uses to display one or more graphs and is the investigation contents. The required time is a time which the operator requires for the investigation. A unit of the required time is second (s).
For example, in a procedure model identified by “1”, a type which occurs in a “virtual storage” at “10/25 14:02” corresponds to an alarm which is “path disruption with VM”, and the alert is investigated by “tetsuya”. In “tetsuya”, the graph is first displayed by using “search expression B”, two graphs are then displayed by using “search expression C” and “search expression D”, and then, the graph is displayed by using “search expression E”. A time which “tetsuya” requires for the investigation is “700” seconds.
The search expression reliability calculation unit 16 calculates the number of times of use for one or more procedure models for each alert class for each search expression application procedure and generates the learning model by using the calculated number of times as reliability of a search expression group in the application procedure. Here, the search expression group application procedure is for each search expression group and for each application procedure. In addition, the search expression reliability calculation unit 16 stores the generated learning model in the learning model storage unit 17. The learning model storage unit 17 stores the learning model.
For example, when “path disruption with VM” occurs in “virtual storage”, “search expression C” and “search expression D” are used first, the reliability is “364”. “search expression C” is used first and the reliability is “46”.
The procedure extraction unit 18 accepts alert information from the operator, acquires the learning model and the procedure model associated with the alert class, and applies the learning model to the procedure model to calculate the reliability of each procedure model. In addition, the procedure extraction unit 18 calculates a score obtained by dividing the reliability by the required time for each procedure model and displays the procedure model having a highest score on a display device as a recommended investigation procedure. As more operators use the procedure model, the investigation time becomes shorter, and the score becomes higher. Further, the procedure model and the learning model correspond to the rule illustrated in
The feedback unit 19 accepts the investigation procedure and the investigation time in which the operator actually performs the investigation based on the recommended investigation procedure, updates the procedure model storage unit 15 at the accepted investigation procedure and investigation time, and updates the learning model storage unit 17 by using the updated procedure model storage unit 15. Further, the feedback unit 19 may accept the alert investigation history from the operator.
In the case where the accepted investigation procedure is the same as the recommended investigation procedure, the feedback unit 19 increases the reliability of the learning model and when the accepted investigation procedure is different from the recommended investigation procedure, the reliability of the learning model corresponding to the application procedure in which the search expression group is different in the recommended investigation procedure is reduced. The feedback unit 19 accepts a weight used when increasing or decreasing the reliability, and increases or decreases a value multiplied by the weight. The details of updating the learning model will be described below using an example.
The investigation history reception unit 11, the investigation history storing unit 12, and the application procedure estimation unit 14 generate the procedure model by performing preprocessing. The search expression reliability calculation unit 16 learns the procedure model and generates the learning model. The procedure extraction unit 18 specifies the recommended investigation procedure using the learning model and displays the specified recommended investigation procedure. The feedback unit 19 updates the procedure model and the learning model based on the evaluation by the operator of the recommended investigation procedure.
Next, a flow of processing by the event investigation assist device 1 will be described.
In addition, the application procedure estimation unit 14 generates the procedure model from the stored alert investigation history (step S2). Further, the search expression reliability calculation unit 16 calculates the sum of the number of times of use of the procedure model for each search expression group application procedure and generates the learning model representing the reliability for each application procedure of the search expression group (step S3).
As described above, the application procedure estimation unit 14 generates the procedure model and the search expression reliability calculation unit 16 generates a learning model representing the reliability for each application procedure of the search expression group, so that the event investigation assist device 1 may specify and display the recommended investigation procedure.
The feedback unit 19 accepts the investigation procedure performed based on the recommended investigation procedure and the weight for increasing/decreasing the reliability from the operator (step S14). Moreover, the feedback unit 19 updates the procedure model storage unit 15 and the learning model storage unit 17 based on the received investigation procedure and weight for increasing/decreasing the reliability (step S15).
As described above, the procedure extraction unit 18 calculates the reliability of each procedure model and calculates the score by dividing the calculated reliability by the required time, and as a result, the event investigation assist device 1 may specify and display the recommended investigation procedure.
The application procedure estimation unit 14 calculates an overlap time from the search time of a search expression Xk (k≠m, k=1 to M) different from the search expression Xm and the reference time (step S22). The overlapping time=min (Tk+Ak, Tm+Am)−max (Tk, Tx) when the search time of the search expression Xm is Tm, the reference time is Am, the search time of the search expression Xk is Tk, and the reference time is Ak.
The application procedure estimation unit 14 determines whether the overlapping time is greater than a threshold value (step S23) and when the overlapping time is greater than the threshold value, the application procedure estimation unit 14 adds the search expression Xk to the search expression set Z (step S24). Further, the application procedure estimation unit 14 adds 1 to k and determines whether k is larger than M (step S25). When k is not greater than M, the process returns to step S22.
Meanwhile, when k is greater than M, the application procedure estimation unit 14 determines whether the search expression set Z is different from an immediately preceding (n-th) application procedure (step S26). Here, an initial value of n is 0. In addition, when the search expression set Z is different from the immediately preceding (n-th) application procedure, 1 is added to n and the search expression set Z is added in the n-th application procedure (step S27).
The application procedure estimation unit 14 empties the search expression set Z (step S28), adds 1 to m, and adds the search expression Xm having a next earlier search time to a search set (step S29). Moreover, the application procedure estimation unit 14 determines whether m is larger than M (step S30). When m is not larger than M, the process returns to step S22 and when m is larger M, a total required time is calculated and the process ends (step S31). Here the required time=max(Tm+Am)−T1.
As described above, the application procedure estimation unit 14 may generate the procedure model by specifying the search expression having the same application procedure and calculating the required time.
Next, an example by the event investigation assist device 1 will be described with reference to
The application procedure estimation unit 14 extracts data related to each alert class from the investigation history storage unit 13.
The application procedure estimation unit 14 estimates the application procedure of the search expression group based on the search time and the reference time. The application procedure estimation unit 14 estimates that the operator refers to the plurality of graphs at the same time and determines that the application procedures of the plurality of search expressions are the same when the time overlapped and displayed among the plurality of graphs is larger than a predetermined threshold value. Further, the application procedure estimation unit 14 estimates the required time based on a first search time, a last search time, and the reference time. In addition, the application procedure estimation unit 14 generates the procedure model and stores the procedure model in the procedure model storage unit 15.
The search expression reliability calculation unit 16 calculates the sum of the number of times of use for each search expression group to calculate the reliability for each application procedure of the procedure model. In addition, the search expression reliability calculation unit 16 generates the learning model and stores the generated learning model in the learning model storage unit 17.
The procedure extraction unit 18 extracts the learning model and the procedure model relating to the alert class of the occurring alert from the learning model storage unit 17 and the procedure model storage unit 15, respectively.
The procedure extraction unit 18 applies the learning model to each procedure model to calculate the reliability of each procedure model and calculates the score of each procedure model by dividing the calculated reliability by the required time. Here, applying the learning model to each procedure model indicates setting the reliability for each search expression group application procedure of the learning model as the reliability for each search expression group application procedure of the procedure model.
When the application procedure of the procedure model is only “1”, the reliability of the application procedure “1” of the procedure model is the reliability of the procedure model and when the application procedure of the procedure model is plural, the sum of the reliability for each application procedure of the procedure model is the reliability of the procedure model.
The procedure extraction unit 18 displays the procedure having the highest score as the recommended investigation procedure. In
When the operator refers to the recommended investigation procedure to input information relating to the actually performed investigation in the event investigation assist device 1, the feedback unit 19 updates the procedure model storage unit 15 and the learning model storage unit 17 based on the input information.
In
As illustrated in
In
As illustrated in
As described above, when insertion or replacement of the search is performed, the feedback unit 19 updates the learning model by adding the negative evaluation to the modified portion of the recommended investigation procedure, so that the feedback unit 19 may appropriately reflect the modification of the recommended investigation procedure to the learning model.
As described above, in the first embodiment, the investigation history storing unit 12 stores the alert investigation history extracted from the alert information in the investigation history storage unit 13. In addition, the application procedure estimation unit 14 generates the procedure model using the alert investigation history stored by the investigation history storage unit 13 and stores the generated procedure model in the procedure model storage unit 15. In addition, the search expression reliability calculation unit 16 generates the learning model using the procedure model stored in the procedure model storage unit 15 and stores the generated learning model in the learning model storage unit 17.
The procedure extraction unit 18 acquires the procedure model and the learning model corresponding to the alert class of the occurring alert from the procedure model storage unit 15 and the learning model storage unit 17, respectively, and applies the learning model to each procedure model to calculate the reliability of each procedure model. In addition, the procedure extraction unit 18 calculates the score of each procedure model based on the reliability of each procedure model and the required time, and displays the procedure model with the highest score as the recommended investigation procedure.
Therefore, when the alert occurs, the event investigation assist device 1 may present to the operator an investigation procedure with a higher reliability for acquiring information required for determining whether there is a fault.
In the first embodiment, since the search expression reliability calculation unit 16 generates the learning model based on the number of investigations performed by the procedure model, the number of investigations by the operator may be reflected to the learning model.
In the first embodiment, since the feedback unit 19 updates the procedure model and the learning model based on the investigation procedure actually performed by the operator, the feedback unit 19 may keep the reliabilities of the procedure model and the learning model high.
In the first embodiment, when the search expression group in the application procedure in the investigation procedure actually performed by the operator for the recommended investigation procedure is different from the search expression group in the application procedure in the recommended investigation procedure, the feedback unit 19 reduces the reliability of the search expression group application procedure of the corresponding learning model. Therefore, the feedback unit 19 may keep the reliability of the learning model high.
In the first embodiment, the application procedure estimation unit 14 estimates that the operator refers to the plurality of graphs at the same time and determines that the application procedures of the plurality of search expressions are the same as each other when the time overlapped and displayed among the plurality of graphs is larger than a predetermined threshold value. Therefore, the application procedure estimation unit 14 may reflect the investigation performed by the operator by referring to the plurality of graphs to the procedure model.
However, in the first embodiment, although a case where all operators may trust in the same way has been described, the reliability of the operator is different depending on an experience of the operator, etc. Therefore, in the second embodiment, an event investigation assist device that reflects the reliability of the operator in the learning model will be described.
The operator reliability calculation unit 20 calculates an operator reliability based on an operation engagement history 3 and transfers the calculated reliability to the search expression reliability calculation unit 26. The operation engagement history 3 is an index representing how much the operator is involved with each component of a server, a storage, a network, or the like of the cloud system.
The update time is a year and a month when the operation engagement history 3 is updated. The operator ID is an identifier which identifies the operator. The physical server is a time when the operator engages in a task regarding the physical server. The physical storage is a time when the operator engages in a task regarding the physical storage. The physical network is a time when the operator engages in a task regarding the physical network.
The virtual server is a time when the operator engages in a task regarding the virtual server. The virtual storage is a time when the operator engages in a task regarding the virtual storage. The virtual network is a time when the operator engages in a task regarding the virtual network. The units of the physical server, the physical storage, the physical network, the virtual server, the virtual storage, and the virtual network are the time.
For example, the operator reliability calculation unit 20 adds an engagement time of the component related to the alert class to set the operator reliability.
The search expression reliability calculation unit 26 calculates the reliability for each search expression group application procedure with respect to each procedure model for each alert class based on the reliability which the operator reliability calculation unit 20 calculates with respect to each operator, and generates the learning model by using the reliability for each calculated search expression group application procedure.
As illustrated in
Next, a flow of processing by the event investigation assist device 2 will be described.
The application procedure estimation unit 14 generates the procedure model from the stored alert investigation history (step S42). Further, the operator reliability calculation unit 20 calculates the operator reliability for each alert class (step S43). In addition, the procedures of the processing of steps S42 and S43 may be reversed. Moreover, the search expression reliability calculation unit 26 calculates the reliability for each search expression group application procedure of the procedure model based on the operator reliability and calculates the sum of the reliability for each search expression group application procedure to generate the learning model indicating the reliability for each search expression group application procedure (step S44).
As described above, the search expression reliability calculation unit 26 calculates the reliability for each search expression group application procedure of the procedure model based on the operator reliability and calculates the sum of the reliability for each search expression group application procedure to generate the learning model. Therefore, the event investigation assist device 2 may generate the learning model based on the reliability of the operator.
Next, an example by the event investigation assist device 2 will be described with reference to
The procedure extraction unit 18 applies the learning model to each procedure model to calculate the reliability of each procedure model and calculates the score of each procedure model by dividing the calculated reliability by the required time.
For example, as illustrated in
The procedure extraction unit 18 displays the procedure having the highest score as the recommended investigation procedure. In
When the operator refers to the recommended investigation procedure to input information relating to the actually performed investigation in the event investigation assist device 2, the feedback unit 19 updates the procedure model storage unit 15 and the learning model storage unit 17 based on the input information.
In
As illustrated in
As illustrated in
As described above, in the second embodiment, the operator reliability calculation unit 20 calculates the operator reliability for each alert class by using the operation engagement history 3. The search expression reliability calculation unit 26 calculates the reliability for each search expression group application procedure with respect to each procedure model for each alert class based on the reliability which the operator reliability calculation unit 20 calculates with respect to each operator and generates the learning model by using the reliability for each calculated search expression group application procedure. Therefore, the event investigation assist device 2 may display the recommended investigation procedure based on the reliability of the operator.
In the first and second embodiments, the event investigation assist device has been described. However, by implementing the configuration of the event investigation assist device with software, it is possible to obtain an event investigation assist program having the same function. Accordingly, a computer executing the event investigation assist program will be described.
The main memory 51 is a memory that stores a program or a result during the execution of the program. The CPU 52 is a CPU that reads and executes the program from the main memory 51. The CPU 52 includes a chip set having a memory controller.
The LAN interface 53 is an interface which connects the computer 50 to another computer via a LAN. The HDD 54 is a disk device that stores the program or data and the super IO 55 is an interface that connects an input device such as a mouse or keyboard. A DVI 56 is an interface that accesses a liquid crystal display device and an ODD 57 is a device that performs reading and writing of a DVD.
The LAN interface 53 is connected to the CPU 52 by PCI express (PCIe), and the HDD 54 and the ODD 57 are connected to the CPU 52 by serial advanced technology attachment (SATA). The super IO 55 is connected to the CPU 52 by low pin count (LPC).
The event investigation assist program executed in the computer 50 is stored in the DVD as an example of a storage medium readable by the computer 50, read from the DVD by the ODD 57, and installed in the computer 50. Alternatively, the event investigation assist program is stored in the database of another computer system connected via the LAN interface 53 and read from the database and installed in the computer 50. In addition, the installed event investigation assist program is stored in the HDD 54, read by the main memory 51, and executed by the CPU 52.
In the embodiment, the case where it is determined whether there is a fault in the cloud system has been described, but the present disclosure is not limited thereto and the embodiment may be similarly applied to the case where it is determined whether there is a fault in another system.
In the embodiment, the case where the search expression group is used as investigation contents has been described, but the present disclosure is not limited thereto and the embodiment may be similarly applied even to the case where one or more investigation items are investigated as the investigation contents.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to an illustrating of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2017-210989 | Oct 2017 | JP | national |