This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2016-196731, filed on Oct. 4, 2016, the entire contents of which are incorporated herein by reference.
The present invention relates to an incident analysis program, an incident analysis method, an information processing device, a service identification program, a service identification method, and a service identification device.
A service system is constructed by combining a server (or a physical machine), storage, an operating system (an OS), and an application program. A conventional service system is constructed using in-house hardware resources, and therefore, when an incident occurs in the service system, the cause of the incident is identified by analyzing all messages and error content generated in the service system.
WO 2014/033894 and WO 2014/020908 describe incident detection.
One aspect of the present embodiment is a non-transitory computer-readable storage medium that stores therein an incident analysis program for causing a computer to execute a process comprising:
generating a new incident-related request database by extracting,
that are issued at an occurrence time of a new incident occurred in one of the plurality of first service systems and
that are issued from an issuing source first service system to an issuing destination second service system,
the issuing source first service system and the issuing destination second service system being related to a first service system serving as an occurrence source of the new incident;
extracting,
the transition tendency of response times being calculated for the new incident-related request data in the new incident-related request database and for an past incident-related request data in the past incident-related request database, both of which have the same issuing source and issuing destination; and
identifying and outputting information indicating a second service system estimated to be responsible for the past incident in the extracted past incident-related request database, as a second service system estimated to be responsible for the new incident.
According to the first aspect described above, it is possible to estimate a service system that is responsible for an incident occurring in the service systems constructed in a server center of a different cloud vendor.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
As cloud computing services (referred to hereafter as cloud services) become more widespread, service systems are being constructed using hardware, an OS, middleware, and so on provided by a server center of a cloud service vendor (referred to hereafter as a cloud vendor) that provides a cloud service. When a service system is constructed by connecting a plurality of service systems constructed in the server centers of a plurality of cloud vendors to each other by a network, it is particularly difficult to analyze the cause of an incident.
For example, when an incident occurs in a service system in a first server center of a first cloud vendor, an operator of the first cloud vendor can ascertain error content in the first server center, but it is difficult for the operator to ascertain error content in a second server center of a second cloud vendor that is different to the first cloud vendor. As a result, it is impossible to determine which of the service systems in the second server center is responsible for the incident.
For example, the service system S_A on the side of the server center of the cloud service CS_1 provides a web service of an electronic commerce site, the service system S_B provides a web service of a customer management site of the electronic commerce site, and the service system S_C provides a web service of a busyness management site of the electronic commerce site. Meanwhile, the service system S_1 on the side of the server center of the cloud service CS_2 provides a database service for the electronic commerce site, the service system S_2 provides a load balancer service, and the service system S_3 provides a monitoring service for monitoring the service systems S_B, S_C.
In this service system, a user terminal device 34 of the service system accesses the first service systems S_A, S_B, S_C via the network NW and uses the services provided respectively thereby. In response to access from the user terminal device 34 of the service system, the first service systems S_A, S_B, S_C respectively issue requests as appropriate to the database system S_1, the load balancer S_2, the busyness management service S_3, and so on serving as the second service systems, and execute processing needed by the respective services on the basis of responses to the requests. Therefore, when the responses to the requests issued by the first service systems S_A, S_B, S_C are delayed, the responses by the service system of the first service systems S_A, S_B, S_C to the user terminal device 34 may also be delayed.
Meanwhile, a user terminal 32 of the first cloud service accesses the cloud portal site 12 via the network NW, and initiates construction and activation of the first service systems S_A, S_B, S_C by asking the cloud service management device 11 to generate and activate the virtual machines VM0 to VM5. Further, an operator terminal 30 of the first cloud service CS_1 accesses the cloud service management device 11 via the network NW in order to perform operation management on the first service systems S_A, S_B, S_C. Operation management includes analyzing incidents occurring in the service systems S_A, S_B, S_C and so on.
The centers of two cloud services provided by different vendors typically do not reveal error information generated in the centers of the respective cloud services to each other so that the error information remains confidential. Therefore, when the operator of the first cloud service CS_1 analyzes an incident occurring in the first service systems S_A, S_B, S_C, the operator can ascertain all of the error information generated by the first service systems S_A, S_B, S_C in the server center of the first cloud service CS_1, but is unable to ascertain error information generated by the second service systems S_1, S_2, S_3 in the server center of the second cloud service CS_2. As a result, identification of the service system that is responsible for the incident either involves a large number of steps, or is difficult or impossible.
The auxiliary storage device group stores a cloud service management program 20, an incident analysis program 22, a request management database 24, an incident-related request database 25, and an incident database 26. The cloud service management program 20 and the incident analysis program 22 are expanded in the main memory 15 and executed by the processor 14. The cloud service management device 11 illustrated in
The processor 14 executes the cloud service management program 20 to cause hypervisors HV_1 and HV_2 of the physical machines PM_0 to PM_2 to activate the virtual machines VM_0 to VM_5 constituting the respective service systems in response to a request for activating the first service systems SC_A to SC_C from the user terminal device 32 of the cloud service, for example. Further, the operator terminal device 30 is capable of monitoring error messages from the respective service systems in response to a request for monitoring the first service systems SC_A to SC_C from the operator terminal device 30 of the cloud service.
The processor 14 executes the incident analysis program 22 to add data relating to a newly occurring incident to the incident database 26 in response to an incident report from the user terminal device 34 of the service system, for example. Moreover, the processor 14 issues a plurality of dummy requests, each having a first service system as an issuing source and a second service system as an issuing destination at predetermined time intervals, and adds request data including response times and response messages to the dummy requests and so on to the request management database 24. Furthermore, the processor 14 generates the incident-related request database 25 in response to an incident cause analysis request from the operator terminal device 30 of the cloud service, and identifies the second service system estimated to be responsible for the incident on the basis of the behavior (the response times and messages) of the requests relating to the new incident.
The incident analysis program 22 includes a request issuing program 221. The processor executes the request issuing program 221 to issue dummy requests having the first service systems S_A to S_C as issuing sources and the second service systems S_1 to S_3 as issuing destinations successively at predetermined time intervals. The request issuing program 221 measures the response times to the successively issued dummy requests, and obtains corresponding response messages.
The incident analysis program 22 includes a request data collection program 222. The processor executes the request data collection program 222 to collect request data associating each dummy request with the issuing source service system (S_S) and issuing destination service system (D_S) of the request, the request issuing time (time), and the response time (RT), and add the collected request data to the request management database 24. A response message (MES) to the request may be included in the request data.
The processor executes the incident cause estimation program 223 to generate the incident-related request database 25 including the request data at the new incident by extracting the request data generated at the occurrence time of the new incident from the request management database 24. The extracted request data may affect the operation or running of the service system, that is the occurrence source of the new incident. The request data that may affect the operation or running will be described below. Accordingly, the incident-related request database 25 includes a group of incident-related requests generated at the occurrence time of the new incident, and a group of incident-related requests generated at the occurrence times of past incidents.
Further, when the processor executes the incident cause estimation program to, when a new incident occurs, add, to the incident data base 26, incident data in which the occurrence time (time) of the incident, the first service system (S_S) serving as the occurrence source of the incident, and a phenomenon (PH) caused by the incident are associated with each other. The incident-related request database 25 is associated with each of the incidents in the incident database 26. Further, with respect to a past incident, the incident database 26 includes information indicating the responsible service system (CoI) estimated to be responsible for the incident, and with respect to the new incident, information indicating the responsible service system estimated by the incident cause estimation program will be added to the incident database 26.
[Request Data Collection]
First, the processor of the management server 10 executes the request issuing program 221 and the request data collection program 222 at all times to issue the dummy requests at predetermined time intervals (S1) and output logs including the response times and response messages to the dummy requests (S2). Then, the processor collects request data which associates each dummy request, the first service system and second service system serving respectively as the issuing source and issuing destination thereof, the issuing time thereof, the response time and response message and add the request data to the request management database 24 (S3).
Further, in the example illustrated in
In the example illustrated in
The incident analysis program 22 of the incident analysis device 13 provided in the first cloud service includes the request issuing program 221 and the request data collection program 222. When the request issuing program 221 is executed, six dummy requests DR as described above are issued to the second service systems at certain time intervals of five minutes or the like, whereupon logs of responses (response messages and response times) to the respective requests are output. Further, when the request data collection program 222 is executed, request data in which the times (issuing times or measurement times), issuing source service systems, issuing destination service systems, response messages, and response times of the six dummy requests are associated with each other are collected and added to the request management database 24.
Hence, the request management database collects the request data relating to the dummy requests in accordance with combinations of the issuing source service system and the issuing destination service system. Further, by issuing the dummy requests are issued periodically by executing the request issuing program 221 and the request data collection program 222, separately to requests normally issued by the user systems of the cloud services, the conditions of the service systems in the server center of the second cloud service can be gathered from the information of the responses to the dummy requests while minimizing the effect on the operation or running of the user system.
[Incident Data Collection]
Returning to
[Incident Cause Estimation]
Meanwhile, a list of past incidents and the newly occurring incident is displayed on the operator terminal device 30 of the cloud service (S6). When the operator specifies the newly occurring incident and issues an analysis request in relation thereto on the operator terminal device (S7), the processor executes the incident cause estimation program to implement the following processing. The following processing does not have to be implemented when a new incident occurs, and may be implemented at a predetermined timing following the occurrence of a new incident. Note, however, that when the processing is implemented immediately after the occurrence of a new incident, the result would be useful for estimating the cause of an incident occurring subsequently, and therefore the processing is preferably implemented immediately after occurrence.
[Extraction of Incident-Related Request Data]
First, upon execution of the incident cause estimation program, the processor generates the incident-related request database 25 including the request data generated at the new incident by extracting, from the request management database 24, the request data that is generated at the occurrence time (an occurrence time block) of the new incident and is related to the first service system serving as the occurrence source of the new incident (S8).
In the example illustrated in
Returning to
Furthermore, the processor extracts the PaaS side service systems relating to the extracted IaaS side service systems (S_1 and S_2) from the request management database (S8_2). More specifically, the issuing source service systems S_A, S_B of the requests having the extracted service systems as issuing destinations are extracted. In the example illustrated in
The processor then generates a new incident-related request database 25 by extracting the request data that relates to the PaaS side service (S_A) serving as the occurrence source of the incident and the extracted PaaS side services (S_A, S_B) and IaaS side services (S_1, S_2) from the request management database 24 (S8_3).
To describe the extracted request data, in the example illustrated in
When a problem occurs in the responses to the requests R_A1 and R_B1 but no problems occur in the responses to the requests R_A2 and R_B2, the IaaS side service system S_1 may be estimated as the cause. Further, when a problem occurs in the responses to the requests R_A1 and R_A2 but no problems occur in the responses to the requests R_B1 and R_B2, the PaaS side service system S_A may be estimated as the cause. Hence, the processor generates the new incident-related request database 25 by extracting and adding the request data of the dummy requests needed to pinpoint the service system estimated to be responsible for the new incident from the request management database 24 in accordance with the time block of the new incident.
The new incident-related request database 25 depicted in
[Request Data Normality Determination]
Returning to
In
[Calculation of Response Variation Rate]
Returning to
The processor then calculates the variation rate of the response times of the detected pair of requests (S10_2). The variation rate of the response times is determined by dividing a difference between the respective response times of the pair of requests by the issued time difference therebetween. In the case of the two response times indicated by circles in
Variation rate=(10−2)/(10:05−10:00)=1.6 s/min.
Accordingly, the calculation result 1.6 is recorded in the response variation rate column of the request data obtained after the variation (S10_3). The values in the response variation rate column of
[Detection of Similar Past Incidents]
Returning to
Next, correlations between the response variation rates (response time variation rates) of the pair of request at the new incident and the past incidents are calculated using a correlation coefficient such as that illustrated below, for example.
Correlation coefficient=[{Σ(F(k)−F′)(G(k)−G′)}/n]÷[√{Σ(F(k)−F′)2/n}√{Σ(G(k)−G′)2/n}]
Here, the two roots (√) of the divisors are respectively square roots of {Σ(F(k)−F′)2/n} and {Σ(G(k)−G′)2/n}. Further, n denotes a number of samples, Σ denotes an accumulation of the n samples, F(k) is the response variation rate waveform of the new incident, G(k) is the response variation rate waveform of the past incident, and F′ and G′ denote average values.
More specifically, the response variation rates of the adjacently issued pair of requests having the issuing source S_B and the issuing destination S_1 in the new incident-related request database depicted in
The above correlation coefficient is calculated in relation to values at respective sample points on two waveforms in order to determine a correlation between the two waveforms. Typically, a correlation coefficient between 0.4 and 0.7 is considered to indicate a close correlation, and a correlation coefficient between 0.7 and 1.0 is considered to indicate a very close correlation.
Taking an interval between a final sample point SPL1 at which a normal determination is made and a first sample point SPL2 at which an abnormal determination is made in determination step S9, described above, as the estimated occurrence time of the incident, the correlation between the two waveforms is determined from the above correlation coefficient formula in relation to variation rates at a plurality of sample points before and after the estimated occurrence time, which are associated with each other using the estimated occurrence time as a reference.
For example, when the response time variation rates at the sample points SPL1 and SPL2, which are considered to have the greatest effect on the similarity between the incidents, are identical or similar, the waveforms of the two incidents are determined to be closely correlated. By determining whether or not the response time variation rate waveforms during the normal period at or before the sample point SPL1 and the response time variation rate waveforms during the abnormal period at or after the sample point SPL2 are identical or similar, the precision with which a similar past incident is extracted can be improved.
By determining whether or not the correlation value is high in this manner, a determination can be made as to whether or not the temporal waveforms (patterns) of the response time variation rates of the corresponding requests of two incidents are similar using the incident occurrence time as a reference.
The response times and response time variation rates in the new incident-related request DB depicted in
In the new incident-related request DB depicted in
However, the two incidents have identical occurrence source service systems and phenomena but different response times, while the waveforms of the response time variation rates are similar. In this case, therefore, according to this embodiment, the two incidents are determined to be similar incidents caused by the same responsible service system. As a feature of the service systems constructed in the cloud services is that the load balancer shortens the response times by implementing scale-out processing and lengthens the response times by implementing scale-in processing (i.e. reducing the number of virtual machines) as appropriate. Therefore, when the correlation between the two incidents is checked, the correlation between the response time variation rates rather than the correlation between the response times themselves is preferably checked in order to reduce the effect of the control executed by the load balancer.
Further, when calculating the correlation, a similar past incident may be detected by calculating four correlations between the response time variation rates of the four dummy requests relating to the new incident and the past incident.
Returning to
According to the examples illustrated in
According to the first embodiment, as described above, the incident analysis device of the management server issues dummy requests addressed to the second service systems of the second cloud service from the first service systems of the first cloud service at predetermined time intervals, and adds request data associating the dummy requests with information indicating the issuing sources and issuing destinations thereof, the response messages and response times thereto, and the issuing times thereof to the request management DB. Then, when an incident occurs, a new incident-related request DB is generated by extracting from the request management DB the request data generated during the incident occurrence time block in relation to the service system serving as the occurrence source of the incident. A past incident having a correlation with the new incident in terms of the respective response time variation rates of the dummy requests thereof is then detected from among past incidents, whereupon the service system that was responsible for the past incident is estimated to be responsible for the new incident.
A feature of the cloud service is that the configurations of the respective service systems vary over time. In this embodiment, variation in the response times due to these changes in the configurations of the service systems is taken into account such that the correlation between the incidents is checked using the correlation between the response time variation rates.
In a modified example of the first embodiment, during the processing for identifying a similar past incident to the new incident, as well as determining the correlation between the response time variation rates of requests having the same issuing source and issuing destination among the incident-related request data, whether or not the incidents have identical response messages and identical normal/abnormal determination results may also be used as determination references. Furthermore, the presence of a correlation may be determined using these determination references in relation to each of a plurality of requests having a plurality of combinations of issuing sources and issuing destinations.
When the occurrence of a new incident is reported (S4), the processor adds data relating to the new incident to the incident database 26 (S5). Further, when the operator terminal device 30 of the cloud service specifies the new incident from the incident list display screen (S6) such that an analysis request is received in relation to the incident (S7), the processor executes the incident cause estimation program in order to generate the incident-related request database 25 in relation to the new incident (S8). The new incident-related request database is generated in an identical manner to the first embodiment.
Further, the processor determines that the respective requests in the new incident-related request database are abnormal when the response times thereof exceed the threshold set in relation to the average value, determines that the requests are normal when the response times thereof are within the threshold, and then adds the determination results to the database (S9). This processing is also identical to the processing of the first embodiment.
Finally, the processor estimates the second service system that is responsible for the incident on the basis of the response times, response messages, and normal/abnormal determination results of the requests in the new incident-related request database (S20). The service estimated to be responsible is then displayed by the incident management interface (S13).
In this case, the processor, upon execution of the incident cause estimation program, estimates that a problem has occurred in the service system S_1, and that the service system S_1 is responsible for a new incident.
The service system that is responsible for the new incident may be estimated by performing a similar analysis to that described above on the basis of the issuing sources and issuing destinations of the requests having “Bad Request” as the response message thereto.
According to the second embodiment, as described above, even though it is not possible to obtain error information and operation information relating to the second service systems constructed in the second cloud service operated by someone else, the incident analysis device of the first cloud service operated by oneself issues dummy requests having the first service systems as issuing sources and the second service systems as issuing destinations at predetermined time intervals, and accumulates request data including response information relating thereto in the request management database. Then, when an incident occurs, the request data that affect the service system serving as the occurrence source of the incident are extracted and analyzed, and as a result, the second service system estimated to be responsible for the incident is identified.
The incident analysis program, incident analysis method, and incident analysis device described above respectively correspond to a program, a method, and a device for identifying a responsible incident.
The “service” in claims 13, 14, and 15 corresponds to a service system, and the “request” corresponds to a request. Further, the “incident relating to a response time to a request issued by the service” corresponds to an incident in which the response time to the request issued by the service system is long. Furthermore, the “service identification information” and the “output service identification information” correspond to information identifying the service system that issued the request and output information identifying the service system that issued the request.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2016-196731 | Oct 2016 | JP | national |