This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2018-238854, filed on Dec. 20, 2018, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a recording medium and an analysis apparatus.
In a system in which a plurality of virtual machines run on a physical server, when an abnormality such as a delay in processing occurs, an analysis of the cause is carried out. In this case, the cause is analyzed by using trace data or the like including operation states of the virtual machines, for example.
Related art is disclosed in Japanese Laid-open Patent Publication No. 2015-139699, Japanese Laid-open Patent Publication No. 2017-129931, Japanese Laid-open Patent Publication No. 2014-170482 and Japanese Laid-open Patent Publication No. 2013-171542.
According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores therein an analysis program for causing a computer to execute a process including: acquiring information capable of identifying functions in operation which is obtained by a sampling by a plurality of operating systems at each first time interval with respect to programs in operation; totaling a number of pieces of the acquired information for each function; generating time-series data indicating the number of pieces of the information at each second time interval for the function whose number of pieces of the information satisfies a prescribed condition; analyzing a causal relationship between the functions based on the time-series data; and outputting an analysis result of the causal relationship between the functions.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
For example, a delay time at each of a plurality of user terminals is recorded, and a request from a user terminal is executed on condition that the delay time is within a threshold value.
For example, based on trace data obtained on a physical computer, trace data obtained on a virtual computer running on the physical computer is edited.
For example, trace information including an operation state of a process on a virtual machine is associated with a symbol map for identifying processes operating on a plurality of virtual machines.
For example, in a large-scale distributed processing system, a delayed process is extracted, and a location where input/output of the data related to the extracted process was executed is specified.
There is a possibility that the cause of an abnormality generated in a program operating on any one of a plurality of operating systems (OSs) is present in a program operating on another OS. For example, when an abnormality occurs during the operation of a program on a virtual machine, there is a possibility that the cause of occurrence of the abnormality is present in another virtual machine, a host OS, or the like. In this case, even when the cause of occurrence of the abnormality is analyzed by using an analysis tool, it is not easy to analyze the cause thereof in a case where the analysis tool does not support the analysis of the system including the plurality of OSs.
A cause of occurrence of an abnormality in a program of a system including a plurality of OSs may be easily analyzed, as one aspect of the present disclosure.
For example, it is assumed that a user of a virtual machine detects an abnormality of a program operating on the virtual machine, and contacts an administrator of a host apparatus. The abnormality of the program is, for example, a situation that a processing time of a specific function is prolonged, and a processing delay occurs. The administrator examines a performance profile of the virtual machine in which the abnormality has occurred, and consequently specifies a function in which the abnormality has occurred. Investigation methods such as an instruction trace, a function trace, and a memory dump are used in order to specify a cause of occurrence of the abnormality. However, there is a possibility that an abnormality occurs in a function operating on a virtual machine, and the cause of the occurrence of the abnormality is present at the outside of the virtual machine (another virtual machine, a hypervisor, a host OS, or the like). In such a case, the above-mentioned methods do not support the analysis of a system including a plurality of OSs, so that it is difficult to specify the cause of the abnormality.
In addition, each of the above methods has a problem that the processing overhead is large. When the function trace and the memory dump are used, an amount of work of the user increases because the program is modified to insert a hook point or the like, and is recompiled in advance. When the function trace is used, there is a risk that the object to be examined is limited to a specific type of program (a kernel or specific application). In order to specify the analysis of the abnormality by using the above methods, a user familiar with the program contents is required to analyze the trace data, source code, and the like, whereby the amount of work of the user increases.
An embodiment will be described below with reference to the accompanying drawings.
The host apparatus 1 includes a host OS 11 and virtual machines 12 to run on the host OS 11. In the example illustrated in
Each virtual machine 12 includes an OS 13, and one or plurality of applications 14 to operate on the OS 13. Each application 14 is implemented by a program including one or a plurality of functions. Each virtual machine 12 is operated by, for example, one or a plurality of virtual central processing units (CPUs) (not illustrated).
In the following description, when the host OS 11 and the OS 13 of the virtual machine 12 are not distinguished from each other, they may be simply referred to as an OS.
The analysis apparatus 2 and the host apparatus 1 are able to communicate with each other via a communication network such as a LAN or WAN. The analysis apparatus 2 analyzes the abnormality of the program operating on the virtual machine 12 by using the information collected from the host apparatus 1.
The measuring unit 21 measures latencies for the plurality of OSs. The measuring unit 21 transmits a prescribed command (for example, a ping command) to the plurality of OSs (the host OS 11 of the host apparatus 1 and the OSs 13 of the virtual machines 12 illustrated in
When the instruction unit 22 has received an abnormality report from a performance monitoring tool, for example, the instruction unit 22 transmits an instruction to execute sampling to the analysis target OSs (the host OS 11 of the host apparatus 1 and the OSs 13 of the virtual machines 12). The performance monitoring tool is mounted in the analysis apparatus 2 in advance to monitor the performance of each OS, and sends a report when an abnormality is detected.
The acquisition unit 23 acquires, from each of the plurality of OSs, information capable of identifying functions in operation that the OSs have obtained by sampling at each first time interval with respect to the programs in operation. The first time interval is, for example, 1 ms. Hereinafter, information capable of identifying a function in operation may be referred to as a sample. The acquisition unit 23 stores the acquired sample in the storage unit 28 while associating the acquired sample with the time of sampling executed by each OS.
The totaling unit 24 totals the number of samples having been acquired within a predetermined time (for example, 30 seconds) for each function. The totaling unit 24 excludes a period of time based on the latency of each OS measured by the measuring unit 21, from the totaling target time of the number of samples.
For example, the totaling unit 24 selects a predetermined number (for example, three) of functions in a descending order of the number of samples from among the results of totaling in each OS, as generation targets of time-series data. Alternatively, the totaling unit 24 may select, for example, a predetermined number (for example, 10) of functions in a descending order of the number of samples from among the results of totaling in all the OSs, as generation targets of time-series data.
The generation unit 25 generates time-series data indicating the number of samples at each second time interval for the function whose number of samples satisfies a prescribed condition. The function whose number of samples satisfies the prescribed condition is a function selected by the totaling unit 24 based on the number of samples, as described above, for example. The second time interval is, for example, one second. The generation unit 25 excludes the period of time based on the latency of each OS measured by the measuring unit 21, from the generation target of the time-series data.
The analysis unit 26 carries out a causal relationship analysis of the plurality of functions based on the time-series data generated by the generation unit 25. In the causal relationship analysis, the analysis unit 26 performs Bayesian estimation by using, for example, the ratio (frequency) of the number of samples indicating a function within the second time interval to the total number of samples within the second time interval as an operation probability of the function. The analysis unit 26 performs Bayesian estimation to calculate, when any one of the functions is operated, a probability that another function has been operated.
The output unit 27 outputs a result of the causal relationship analysis having been carried out by the analysis unit 26. The output unit 27 may be, for example, a display device to display the result of the causal relationship analysis. The output unit 27 may transmit, for example, the causal relationship analysis result to another information processing apparatus or the like.
The storage unit 28 stores various kinds of data related to the process carried out by the analysis apparatus 2. The storage unit 28 stores, for example, the measurement results of the latencies measured by the measuring unit 21, the samples acquired by the acquisition unit 23, the result of totaling obtained by the totaling unit 24, the time-series data generated by the generation unit 25, and the causal relationship analysis result obtained by the analysis unit 26.
The sampling driver 34 acquires (samples) information from the application 14 at an interval (first interval) corresponding to the interrupt from the PMC 31. The information acquired by the sampling driver 34 is information (sample) capable of identifying an operating program, function or the like, and it is a process identification (PID), an instruction address, or the like, for example.
When an abnormality report indicating a performance degradation is received from the performance monitoring tool, the analysis apparatus 2 transmits an instruction to execute sampling to the OS 13 from the instruction unit 22. After receiving the sampling execution notification, the OS 13 executes sampling, as illustrated in
In the case where the virtual machine 12 has a performance monitoring tool, the OS 13 executes sampling for a predetermined time after receiving a report indicating an abnormality from the performance monitoring tool, and then transmits the acquired samples to the analysis apparatus 2. In the case where the virtual machine 12 has the performance monitoring tool, the OS 13 may continuously execute sampling all the time, and when a report indicating an abnormality is received from the performance monitoring tool, the OS 13 may transmit the samples having been collected for a predetermined time until the point in time of receiving the above report, to the analysis apparatus 2.
The sampling executed by the OS 13 of the virtual machine 12 has been described, and it is considered that the same is applied to the sampling executed by the host OS 11.
After the instruction unit 22 transmits the sampling execution instruction, there is a possibility that the time (latency) until each of the OSs starts sampling is different from each other for each OS. Therefore, the timings at which the OSs start sampling may be different from each other. Accordingly, the measuring unit 21 transmits a ping command to each OS, and stores, as a latency, a half of the period of time until the response is received, in the storage unit 28 in advance. It is considered that the latency measured by the measuring unit 21 corresponds to a period of time from the time when the analysis apparatus 2 transmits a sampling execution instruction to the host apparatus 1 to the time when the OS starts the sampling.
In the example illustrated in
As described above, the analysis apparatus 2 excludes the period of time based on the latency of each OS from the totaling target time of the number of samples and from the time-series data generation target time. This makes it possible for the analysis apparatus 2 to suppress a drop in accuracy of the analysis result due to the influence of the latency, without setting for time start-point adjustment, time synchronization, or the like of the plurality of OSs.
The totaling unit 24 sorts the functions in a descending order of the number of samples, for example, and selects N (for example, three) functions from the top as a time-series data generation target (for example, the functions inside a broken line frame in
The totaling unit 24 carries out a similar totaling process using the samples acquired from each of the OSs, and generates a result of totaling for each OS similar to the result of totaling as represented in
Note that the generation unit 25 may generate time-series data for all the functions in the same manner as in
In
Note that the time-series data represented in
In the data represented in
It is to be noted that there is a possibility that even functions that operate on the same virtual machine 12 use different physical CPUs. Therefore, the generation unit 25 may generate time-series data in which physical CPUs are distinguished.
As in the example represented in
Next, an example of calculations carried out by the analysis unit 26 will be described in detail. The analysis unit 26 calculates a probability that a certain function is a cause of operation of another function by using Bayesian estimation, for example. For example, P(A) is a probability that A occurs, P(B) is a probability that B occurs, P(A|B) is a probability that A occurs after B occurs, and P(B|A) is a probability that B occurs after A occurs. In this case, the following Formula (1) is established.
P(A|B)=P(B|A)×P(A)/P(B) (1)
In the Formula (1), P(A) is referred to as a priori probability, and P(A|B) is referred to as a posterior probability. In a case where an event B occurs after an event A, P(A|B) represents a probability that the event A has occurred when the event B occurs.
Assuming that a frequency (%) of a function F at a certain time t is taken as P(F(t)), P(F(t)) is expressed by the following Formula (2). Here, a period of time from t−1 to t is an example of the second time interval. That is, the frequency of the function F is the ratio of the number of samples indicating the function F within the second time interval to the total number of samples within the second time interval.
P(F(t))=(the number of samples of the function F from t−1 to t)/(the total number of samples of all the functions from t−1 to t) (2)
When Bayesian estimation illustrated in Formula (1) is used while A being F(t) and B being F(t+1), the following Formula (3) is obtained.
P(F(t)|F(t+1))=P(F(t+1)|F(t))×P(F(t))/P(F(t+1)) (3)
In a case where there exists a plurality of functions, when each function is represented as Fi (i=1, 2, . . . ) or Fj=1, 2, . . . ), the following Formula (4) is obtained.
P(Fi(t)|Fj(t+1))=P(Fj(t+1)|Fi(t))×P(Fi(t))/P(Fj(t+1)) (4)
P(F2(t)|F1(t+1))=P(F1(t+1)|F2(t))×P(F2(t))/P(F1(t+1)) (5)
Since, of the three terms on the right side in Formula (5), P(F2(t)) and P(F1(t+1)) are frequencies with respect to the respective functions, the analysis unit 26 is able to calculate them by using a formula similar to Formula (2). Since P(F1(t+1)|F2(t)) is a probability that F1(t+1) occurs after F2(t) occurs, the analysis unit 26 is able to calculate it by using the data from the time t to time t+1 on the samples held in time series, in the manner as the following Formula (6).
P(F1(t+1)|F2(t))=(the number of samples of F1 present next to the samples of F2)/(the total number of samples of F2) (6)
The route 1 of P(F2(t−1)|F1(t+1))-->P(F2(t−1)|((F2(t)|F1(t+1)))=P((F2(t)|F1(t+1))|F2(t−1))×P(F2(t−1))/P(F2(t)|F1(t+1))) (7)
Note that P((F2(t)F1(t+1))|F2(t−1)) in Formula (7) indicates a path from F2(t−1) to F2(t) in the path 1 as illustrated in
Similarly, the analysis unit 26 calculates a probability indicated by the route 2 illustrated in
P(B(t+2)|A(t+1))=0.8 (8-1)
P(B(t+2)|B(t+1))=0.1 (8-2)
P(B(t+2)|C(t+1))=0.2 (8-3)
P(A(t+1)|A(t))=0.03 (8-4)
P(B(t+1)|A(t))=0.04 (8-5)
P(C(t+1)|A(t))=0.01 (8-6)
A calculation example of a probability P(A(t)|B(t+2)) telling that the cause of the operation of the function B at the time t+2 is the operation of the function A at the time t, will be described by using the data illustrated in
P(A(t)|B(t+2))=P(A(t)|A(t+1)|B(t+2))+P(A(t)|B(t+1)|B(t+2))+P(A(t)|C(t+1)|B(t+2))=P(A(t)|A(t+1)|B(t+2)))+P(A(t)|(B(t+1)|B(t+2)))+P(A(t)|(C(t+1)|B(t+2))) (9)
The analysis unit 26 converts each term of Formula (9) as illustrated in the following Formulae (10-1) to (10-3) using Bayesian estimation.
P(A(t)|(A(t+1)|B(t+2)))=P((A(t+1)|B(t+2))|A(t))×P(A(t))/P(A(t+1)|B(t+2)) (10-1)
P(A(t)|(B(t+1)|B(t+2)))=P((B(t+1)|B(t+2))|A(t))×P(A(t))/P(B(t+1)|B(t+2)) (10-2)
P(A(t)|(C(t+1)|B(t+2)))=P((C(t+1)|B(t+2))|A(t))×P(A(t))/P(C(t+1)|B(t+2)) (10-3)
In the example illustrated in
P((A(t+1)|B(t+2))|A(t))=P(A(t+1)|A(t))=0.03 (11-1)
P((B(t+1)|B(t+2))|A(t))=P(B(t+1)|A(t))=0.04 (11-2)
P((C(t+1)|B(t+2))|A(t))=P(C(t+1)|A(t))=0.01 (11-3)
As illustrated in
P(A(t+1)|B(t+2))=P(B(t+2)|A(t+1))×P(A(t+1))/P(B(t+2))=0.8×0.1/0.6=0.13 (12-1)
P(B(t+1)|B(t+2))=P(B(t+2)|B(t+1))×P(B(t+1))/P(B(t+2))=0.1×0.7/0.6=0.12 (12-2)
P(C(t+1)|B(t+2))=P(B(t+2)|C(t+1))×P(C(t+1))/P(B(t+2))=0.2×0.2/0.6=0.07 (12-3)
As described above, the analysis unit 26 is able to calculate Formulae (10-1) to (10-3) as in the following Formulae (13-1) to (13-3).
P(A(t)|(A(t+1)|B(t+2)))=P((A(t+1)|B(t+2))|A(t))×P(A(t))/P(A(t+1)|B(t+2))=0.03×0.5/0.13=0.12 (13-1)
P(A(t)|(B(t+1)|B(t+2)))=P((B(t+1)|B(t+2))|A(t))×P(A(t))/P(B(t+1)|B(t+2))=0.04×0.5/0.12=0.17 (13-2)
P(A(t)|(C(t+1)|B(t+2)))=P((C(t+1)|B(t+2))|A(t))×P(A(t))/PP(C(t+1)|B(t+2))=0.01×0.5/0.07=0.07 (13-3)
The analysis unit 26 applies, to Formula (9), the calculation results of Formulae (13-1) to (13-3), thereby obtaining a calculation result as indicated by Formula (14) for P(A(t)|B(t+2)).
P(A(t)|B(t+2))=P(A(t)|(A(t+1)|B(t+2)))+P(A(t)|(B(t+1)|B(t+2)))+P(A(t)|(C(t+1)|B(t+2)))=0.12+0.17+0.07=0.36 (14)
Therefore, the probability that the cause of the operation of the function B at the time t+2 is the operation of the function A at the time t is 0.36 (36%).
Although the example in which the analysis unit 26 carries out a causal relationship analysis of a plurality of functions by using Bayesian estimation is described above, the analysis unit 26 may carry out a causal relationship analysis by using a method other than Bayesian estimation. The analysis unit 26 may carry out a causal relationship analysis of a plurality of functions by using, for example, the randomized controlled trial (RCT).
In a case where an abnormality occurs in any function, a user may estimate which function has caused the occurrence of the abnormality by referring to the causal relationship analysis result. For example, in the case where an abnormality occurs in a function 5-VM3, the user may recognize that the function with the highest cause probability is a function 2-VM1 by referring to the row of the function 5-VM3 in
An output form of the causal relationship analysis result is not limited to the example in
The acquisition unit 23 acquires, from each of the plurality of OSs, information capable of identifying functions in operation obtained by the sampling executed by the OSs at each first time interval (for example, 1 ms) with respect to programs in operation (step S104). The totaling unit 24 totals the number of samples acquired within a predetermined time (for example, 30 seconds) for each function (step S105). The totaling unit 24 excludes a period of time based on the latency of each OS measured by the measuring unit 21, from the totaling target time of the number of samples. The totaling result of the totaling unit 24 is data including the number of samples and frequencies as represented in
The generation unit 25 generates time-series data indicating the number of samples at each second time interval for the function whose number of samples satisfies a prescribed condition (step S106). The generation unit 25 excludes the period of time based on the latency of each OS measured by the measuring unit 21, from the generation target time of the time-series data. The time-series data generated by the generation unit 25 is, for example, the data represented in
The analysis unit 26 carries out a causal relationship analysis of the plurality of functions based on the time-series data generated by the generation unit 25 (step S107). For example, the analysis unit 26 carries out a causal relationship analysis by Bayesian estimation, in which the ratio (frequency) of the number of samples indicating the function within the second time interval to the total number of samples within the second time interval is used as an operation probability of the above function.
The output unit 27 outputs a result of the causal relationship analysis carried out by the analysis unit 26 (step S108). The output unit 27 is, for example, a display device to display the causal relationship analysis result. The output unit 27 may transmit, for example, the causal relationship analysis result to another information processing apparatus or the like.
As described above, the analysis apparatus 2 is able to easily analyze the causes of abnormalities that occurred in the functions operating on the plurality of OSs. For example, when an abnormality occurs in a function operating on the virtual machine 12 as in the example illustrated in
In the case of the functions using the same physical CPU, it is readily assumed that, when one of the functions is operated, the other one of the functions is affected by that. However, in the case where a plurality of functions uses different physical CPUs, it is not easy to analyze a causal relationship between the functions. For example, in the example illustrated in
The analysis apparatus 2 of the present embodiment analyzes a causal relationship of the functions based on the number of samples acquired from the plurality of OSs. Therefore, it is possible to analyze a causal relationship even between the functions (the functions 1-1 and 1-2) configured to operate on the different virtual machines and to use the different physical CPUs, as in the example illustrated in
Next, an example of a hardware configuration of the analysis apparatus 2 will be described.
The processor 111 executes a program loaded to the memory 112. An analysis program configured to carry out the process in the embodiment may be applied to the program to be executed.
The memory 112 is, for example, a random-access memory (RAM). The auxiliary storage device 113 is a storage device configured to store various kinds of information. For example, a hard disk drive, a semiconductor memory, or the like may be used as the auxiliary storage device 113. The analysis program configured to carry out the process of the embodiment may be stored in the auxiliary storage device 113.
The communication interface 114 is coupled to a communication network such as a local area network (LAN) or a wide area network (WAN). The communication interface 114 performs data conversion or the like involved in communication. The communication interface 114 illustrated in
The medium coupling unit 115 is an interface to which a portable recording medium 118 is able to be coupled. An optical disc (for example, a compact disc (CD) or a digital versatile disc (DVD)), a semiconductor memory, or the like may be used as the portable recording medium 118. The analysis program configured to carry out the process of the embodiment may be stored on the portable recording medium 118.
The memory 112, the auxiliary storage device 113, and the portable recording medium 118 are non-transitory computer-readable physical storage media and are not temporary media such as signal carriers.
The input device 116 is, for example, a keyboard, a pointing device, or the like. The input device 116 accepts input of an instruction, information, and so forth from a user.
The output device 117 is, for example, a display device, a printer, a speaker, or the like. The output device 117 outputs an inquiry or an instruction to a user, a processing result, and so forth. The output device 117 illustrated in
The storage unit 28 illustrated in
It is to be noted that the analysis apparatus 2 may not include all of the constituent elements illustrated in
The present embodiment is not limited to the above-described embodiment, and various modifications, additions, and omissions are applicable without departing from the gist of the present embodiment.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2018-238854 | Dec 2018 | JP | national |