The present invention relates to an analysis device, an analysis method and a program.
In recent years, a microservice architecture has been widely provided in which applications for providing services such as web or ICT services are divided for each feature as components and the components communicate with each other to make a chained operation. For microservice management, not only metric or log monitoring at a resource level but also monitoring at an application level is required. For example, event logs occurred while running an application and the metrics in the application (including the number of HTTP requests, the number of transactions and the waiting time for each request) are aggregated and monitored in the application, whereby it is possible to support anomaly detection and root cause analysis in a complicated microservice.
As an example of an application-level monitoring scheme, visualization of component traces for one request to the application has been proposed. This is called tracing. Non Patent Literatures 1 and 2 respectively disclose black box-based tracing software that acquires operation history data without modifying the application itself. Non Patent Literatures 3 and 4 respectively disclose annotation-based tracing software that acquires operation history data by modifying the application. By visualizing of various microservice traces as a series of flows and displaying to a maintenance engineer or a developer, it is possible to help discover unusual traces and root causes of anomalies.
Application-level monitoring data keeps accumulating every time an application runs, and thus it is not practicable for a person to check each piece of data in real time.
The inventors have proposed a method of estimating an inter-component dependency and creating a service graph representing dependencies between all components across the service by a Petri net in “Proposal of Service Graph Buildup based on Trace Data of Multiple Services” (IEICE Journal, Vol. 119, No. 438). Accordingly, it is possible to construct the service graph representing the inter-component dependency using the monitoring data.
Abnormal behaviors can be discovered by detecting monitoring data that does not follow the constructed service graph, and it is impossible to manually check a myriad pieces of monitoring data piece by piece to find anomalies.
The present invention is intended to deal with the problems stated above, and an object thereof is to extract abnormal monitoring data.
According to one aspect of the present invention, provided is an analysis device for detecting anomalies in a service that implements specific features by means of a chained operation of multiple components, the analysis device including: an extraction unit configured to extract a processing start event and a processing end event from monitoring data and generate a firing sequence arranging the events in chronological order, the monitoring data including information on a series of processing in the service; and a detection unit configured to determine whether the event arranged in the firing sequence can be fired in a service graph illustrating a dependency between components constituting the service, and detect anomalies in a case where there is a non-fired event.
According to the present invention, it is possible to extract abnormal monitoring data.
Hereinbelow, the present embodiment will be described with reference to drawings.
Referring to
A monitored service 100 includes a plurality of components and implements specific features by a chain operation of the multiple components. A component is a program that has an interface capable of exchanging requests and responses with other components and is implemented in various program languages.
The service monitoring device 20 is a device for monitoring the monitored service 100 at an application level, and for visualizing traces of the components for one request. The service monitoring device 20 can adopt technologies described in Non Patent Literatures 1 to 4. For example, the service monitoring device 20 records processing in each component of the monitored service 100 as a span element, and visualizes a flow of operations in the monitored service 100 for one request as trace data (hereinafter sometimes also referred to as “monitoring data”). A code for carrying a label is embedded in each component of the monitored service 100 to acquire the span element. The service monitoring device 20 displays the visualized trace data to a maintenance engineer. The maintenance engineer can check application-level behaviors of the monitored service 100 with the visualized trace data.
The monitoring data distribution device 30 receives the monitoring data from the service monitoring device 20, and distributes the monitoring data to the service graph generation device 40 or to the service graph analysis device 10 according to an operation phase of the maintenance control system. More particularly, the monitoring data distribution device 30 distributes the monitoring data to the service graph generation device 40 in a learning phase, and to the service graph analysis device 10 in a detection phase. A service graph is updated based on the monitoring data from the service graph generation device 40 in the learning phase. The monitoring data is checked in the service graph by the service graph analysis device 10 in the detection phase. A service graph is a graph structure representing dependencies between components constituting the monitored service 100. The service graph can be used to represent a state transition of flows of operations in the monitored service 100. The monitoring data distribution device 30 switches distribution destinations of the monitoring data based on an instruction from the control device 60.
The service graph generation device 40 receives the monitoring data in the learning phase, estimates inter-component dependencies from the monitoring data, updates the service graph based on the estimated dependencies, and stores the service graph in the service graph retention device 50.
The service graph retention device 50 retains the service graph. The service graph retained by the service graph retention device 50 is displayed to the maintenance engineer, or used by the service graph analysis device 10 to analyze the monitoring data. A normal label is given to the service graph retained by the service graph retention device 50 in the detection phase, and is removed from the service graph in the learning phase. The service graph to which the normal label is given corresponds to a normal model in which the graph update converges and is determined.
The developer develops and updates the monitored service 100 in development environment 110. When updating the monitored service 100, the development environment 110 sends an update timing notification to the control device 60.
The control device 60 switches between the learning phase and the detection phase on the basis of update information received from the development environment 110 and the convergence determination of the service graph. Specifically, when receiving a notification indicating that the monitored service 100 has been updated from the development environment 110 during the detection phase, the control device 60 shifts to the learning phase and issues an instruction to switch a distribution destination of the monitoring data to the service graph generation device 40. The control device 60 determines the update convergence of the service graph retained by the service graph retention device 50 during the learning phase, shifts to the detection phase when determining that the service graph update has converged, and issues an instruction to switch the distribution destination of the monitoring data to the service graph analysis device 10.
The service graph analysis device 10 receives the monitoring data in the detection phase, and determines whether a behavior is abnormal by checking executability of a state transition of the monitoring data in the service graph. When the abnormal behavior is detected, the service graph analysis device 10 presents the analysis result to the maintenance engineer.
A configuration of the service graph analysis device 10 will be described with reference to
The extraction unit 11 extracts all the processing start and processing end events from the monitoring data, and sorts the extracted events in chronological order to create a firing sequence to be checked.
In a case where a suspicious event in which the anomaly is detected is received from the detection unit 12, the extraction unit 11 lists resources used by the suspicious event from the monitoring data as suspicious resources.
The detection unit 12 checks whether each event in the firing sequence created from the monitoring data in the service graph retained by the service graph retention device 50 can be fired, determines that the abnormal behavior has occurred in a case where there is a non-fired event in the firing sequence, and extracts the suspicious event leading to a failure cause state.
When the detection unit 12 detects the abnormal behavior, the display unit 13 presents the analysis result obtained by visualizing the suspicious event and the suspicious resources to the maintenance engineer.
The service graph generated from the trace data (monitoring data) will be described below. The service graph analysis device 10 checks the firing sequence generated from the monitoring data using the service graph.
The trace data is a set of span elements constituting a series of processing from a request for the monitored service 100 to a response. For example, one piece of trace data from one request made by an end user to the monitored service 100 to a response is obtained. The span element is data in which time data of processing of each component and a parent-progeny relationship are recorded.
Referring to
The service graph generation device 40 estimates an inter-component dependency from time information of each span element of the trace data, and represents a component-level service graph of the entire monitored service 100 by a Petri net on the basis of the estimated dependency. The Petri net is a two-part directed graph having two types of nodes, place and transition, connected by arcs. A variable called a token is given to the place. A state of the entire Petri net represented by the number of tokens held by each place is referred to as a marking. In particular, a marking in the initial state of the Petri net is referred to as an initial marking. The transition transfers tokens of all the places existing before a certain place to all the successive places by firing. The transition firing causes the Petri net to transition from the initial marking to the next marking.
In the present embodiment, a Petri net of one component is defined as illustrated in
Specifically, three types of states taken by the component include “unprocessed”, “in-process”, and “processed”, which are associated with places. A state transition of the component is represented by moving a token by firing (processing start or processing end) of the inter-place transition. The token is a black circle arranged at the unprocessed place in
The inter-component dependency can be represented by adding an arc and a place to the Petri net of the components illustrated in
A parent-progeny relationship between components A and B can be represented as illustrated in
An order relation between the components A and B can be represented as illustrated in
An exclusive relationship between the components A and B can be represented as illustrated in
The service graph analysis device 10 extracts processing start and processing end events from the trace data to create a firing sequence, sets an initial marking of the service graph, and checks whether events in the firing sequence can be sequentially fired. If there is a non-fired event, it is determined as an abnormal behavior.
A flow of processing of the maintenance control system will be described with reference to a sequence diagram shown in
When receiving the monitoring data from the monitoring data distribution device 30 in step S1, the extraction unit 11 extracts processing start and processing end events from the monitoring data, creates a firing sequence sorted in chronological order, and transmits the firing sequence to the detection unit 12 in step S2.
The detection unit 12 acquires a service graph from the service graph retention device 50 in step S3, and the detection unit detects an anomaly by sequentially shifting the service graph from an initial marking according to the firing sequence in step S4.
The detection unit 12 transmits the check result of the firing sequence to the extraction unit 11 in step S5. In a case where the anomaly is detected, the detection unit 12 notifies the extraction unit 11 of a suspicious event.
In a case where the detection unit 12 detects the anomaly, in step S6, the extraction unit 11 extracts suspicious resources corresponding to the suspicious event from the monitoring data, and transmits anomaly occurrence information including the suspicious event and the suspicious resources to the display unit 13.
The display unit 13 presents the analysis result including the suspicious event and the suspicious resources to a maintenance engineer in step S7.
In a case where the detection unit 12 detects no anomaly, the processing of steps S6 and S7 is not performed.
A processing flow of the service graph analysis device 10 will be described below with reference to flowcharts shown in
When the extraction unit 11 receives the monitoring data in step S11 of the flowchart shown in
In step S13, the detection unit 12 checks the type of a root span and sets the initial marking of the service graph. The root span is a span element at which processing is initiated first. The initial marking is, for example, a state in which one token is placed at an unprocessed place in a subgraph corresponding to the root span.
All the events in the firing sequence are processed in chronological order, and the detection unit 12 searches, from the service graph, for a transition corresponding to a processed event and checks a firing availability of the transition in step S14. In a case where all the input places of the transition have tokens, the processed event can be fired.
In a case where the processed event can be fired, the detection unit 12 updates the marking of the service graph in step S15.
If all the events in the firing sequence can be fired, the detection unit 12 determines that normal operations are detected from the monitoring data, and notifies the extraction unit 11 that only the normal operations are discovered in the monitoring data in step S16.
In a case where the firing sequence includes a non-fired event, the detection unit 12 determines that the monitoring data contains an abnormal operation and advances the processing to the flowchart shown in
In step S21 of the flowchart shown in
In step S23, the extraction unit 11 refers to the monitoring data corresponding to the suspicious event and extracts suspicious resources. The monitoring data may include resource information such as IP address of a virtual machine executing the processing. The extraction unit 11 lists a union of resources used by the suspicious event as suspicious resources. A cause event and a cause resource can be identified in a simple case. However, in a case where there is a plurality of waiting processes and there are many suspicious events that can be causes, cause resources may not be identified.
The display unit 13 visualizes and presents the suspicious event and the suspicious resources to a maintenance engineer in step S24. The display unit 13 may visualize and present the monitoring data determined to be abnormal to the maintenance engineer.
As described above, the service graph analysis device 10 according to the present embodiment includes the extraction unit 11 configured to extract the processing start event and the processing end event from the monitoring data and generate the firing sequence arranging the events in chronological order, the monitoring data including information on a series of processing in the monitored service 100; and the detection unit 12 configured to determine whether the event arranged in the firing sequence can be fired in the service graph illustrating the dependency between the components constituting the monitored service 100, and detect anomalies in a case where there is the non-fired event. In the service graph, states before, during, and after processing of the components are represented as places in a Petri net, processing start and processing end of the components are expressed as transitions in the Petri net, and inter-component dependencies are denoted by arranging new nodes and arcs between the Petri nets of the components. In the service graph having a non-fired event in a firing sequence, the detection unit 12 detects, as a component in which an anomaly has occurred, a component corresponding to a subgraph including a place in which a token is arranged. Accordingly, the abnormal monitoring data can be extracted using the service graph.
As the service graph analysis device 10 described above, a general-purpose computer system can be used, for example, including a central processing unit (CPU) 901, a memory 902, a storage 903, a communication device 904, an input device 905, and an output device 906 as illustrated in
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/000481 | 1/8/2021 | WO |