a is a conceptual diagram of processing of computing a busy time.
b shows a specific example of the processing of computing the busy time.
a shows how an average processing time for each service changed as time elapsed.
b shows how a residual of estimated values for the average processing time per service changed as time elapsed.
Although the present invention will be described below by way of the best mode for carrying out the invention (hereinafter, referred to as the embodiment), the following embodiment does not limit the invention according to the scope of claims, and all of combination of characteristics described in the embodiment may not necessarily be essential for the solving means of the invention.
The detection apparatus 20 according to this embodiment is intended to detect, from among the plurality of information processing apparatuses 100 included in the information processing system 10, an information processing apparatus 100 in which an abnormality has occurred. Thereby, even in a case where it is difficult to search a cause of occurrence of the abnormality because an internal configuration of the information processing system 10 is complicated, where the occurrence of the abnormality is located can be made known, and problem solution can be expedited.
The acquisition unit 200 acquires a plurality of communication packets mutually transmitted and received among the respective information processing apparatuses 100 in a predetermined trial period preceding a period subject to detection of an abnormality. As one example, by acquiring replicated data of communication packets, which are transferred through a communication line within the information processing system 10, from a communication apparatus connected to the communication line, and additionally by executing, for example, a tcpdump command of a UNIX® based operating system, the acquisition unit 200 may generate dump data of the replicated data. Note that it is desirable that this trial period be a period in which no abnormality is occurring in the information processing system 10.
The analysis unit 210 analyzes contents of the communication packets in order to compute an average processing time per service under a normal condition. Specifically, the analysis unit 210 includes a number-of-times computing unit 215 and a busy time computing unit 218. For each of divided periods obtained by dividing the trial period, by using the communication packets having been acquired during the each of the divided periods, the number-of-times computing unit 215 computes, for each of the information processing apparatuses 100 and for each service, the number of calling times when the each service of the information processing apparatuses 100 has been called from other information processing apparatuses 100. For example, whether or not each of the communication packets acquired during each of the divided periods is a communication packet for calling a service is judged by the number-of-times computing unit 215 based on any one of a destination address URL or identification information of the service which are contained in the communication packets, and the number of the communication packets for calling each of the services is computed as the number of calling times for the each of the services by the number-of-times computing unit 215.
Additionally, in each of the divided periods, based on the communication packets acquired during each of the divided periods, the busy time computing unit 218 computes a busy time which is a total amount of time when each of the information processing apparatuses 100 executes transactions. Specifically, the busy time computing unit 218 judges, as an in-processing time period when the each of the information processing apparatuses 100 is processing transactions, a period from when the communication packet for calling any service provided by the information processing apparatuses 100 is acquired to when communication packets for returning processing results for the respective service have been acquired, and computes a length of the in-processing time period as a busy time. In order to more accurately compute the busy time, the busy time computing unit 218 may exclude a predetermined processing wait time period from the in-processing time period. This point will be described later in detail.
For each of the information processing apparatuses 100, the service demand computing unit 220 computes an average processing time per service which minimizes an index indicating a difference between the busy time in each of the divided periods, and a sum of products obtained by multiplying the number of calling times for each service by average processing times of transactions for processing the services in the each of the divided period. Specifically, this index may be a sum of squares of the difference in each of the divided periods. To be more precise, the service demand computing unit 220 generates a normal equation for finding an average processing time per service that minimizes a sum of squares of the differences in the respective divided periods.
Furthermore, with respect to each of the information processing apparatuses 100, the service demand computing unit 220 may compute, in each of the divided periods, a difference between the busy time and a sum of products obtained by multiplying the number of calling times for services respectively by average processing times of transactions processing the services, and compute a variance of the differences in the respective divided periods. For each of the information processing apparatuses 100, the storage unit 230 stores therein the thus computed average processing time per service as previously estimated average processing time per service, and, in addition, stores therein the thus computed variance.
After the trial period has elapsed, in the subject period subjected to detection of an abnormality, the acquisition unit 200 acquires a plurality of communication packets mutually transmitted and received among the information processing apparatuses 100. Based on the plurality of communication packets having been acquired, for each of the information processing apparatuses 100, the number-of-times computing unit 215 computes, for each service, the number of calling times when the each service provided by the information processing apparatuses 100 has been called from other information processing apparatuses 100. The busy time computing unit 218 computes a busy time which is a total amount of time when each of the information processing apparatuses 100 executes transactions which are processing of services. Specific examples of the respective processing are the same as the case with the divided periods.
Here, consider a multidimensional space formed by coordinate axis indicating the number of calling times for each service and a coordinate axis indicating the busy time, coordinate values indicated by the number of calling times and the busy times which are computed in a subject period, and a hyperplane indicated by the average processing times per service which is previously estimated in a trial period. With respect to each of the information processing apparatuses 100, the deviation judging unit 240 judges whether or not the point indicated by the coordinate values deviate from a hyperplane beyond a predetermined criterion. Then, as an information processing apparatus in which an abnormality has occurred, the output unit 250 regards the information processing apparatus that has been judged as having the coordinate values whose point deviates from the hyperplane beyond the predetermined criterion, and output indicating the foregoing information processing apparatuses. Thereby, a user can specify an information processing apparatus which is providing a service taking a particularly longer time than that under a normal condition.
In this processing example, detection of an abnormality is started without providing the trial period. First of all, the acquisition unit 200 acquires a plurality of communication packets mutually transmitted and received among the information processing apparatuses 100 in each of the plural subject periods which sequentially elapse. Every time each of the subject periods elapses, based on the communication packets having been acquired during the subject periods, the number-of-times computing unit 215 computes, for each of the information processing apparatuses 100 and for each service, the number of calling times for the each service. Furthermore, every time each of the subject periods elapses, based on the communication packets having been acquired during the each of the subject periods, the busy time computing unit 218 computes the busy time for each of the information processing apparatuses 100. Every time each of the subject periods elapses, based on the plurality of communication packets having been acquired in all of the elapsed subject periods, the service demand computing unit 220 computes the average processing time per service in each of the information processing apparatuses 100, and stores it in the storage unit 230 as an estimated value of the average processing time per service. The average processing time per service can be computed by applying the process of minimizing a sum of squares of the above described differences with the plural subject periods being assumed as the plural divided periods.
Additionally, when one of the subjected periods has elapsed, the number-of-times computing unit 215 computes, based on a plurality of communication packets having been acquired during this current subject period, the number of calling times for each service and for each of the information processing apparatuses 100. Moreover, based on the plurality of communication packets having been acquired during the current subject period, the busy time computing unit 218 computes the busy time for each of the information processing apparatuses 100. Then, the deviation judging unit 240 judges whether, in a multidimensional space formed by coordinate axis indicating the number of calling times for the respective services and a coordinate axis indicating the busy time, a point corresponding to coordinate values indicated by the number of calling time and the busy time which have been computed in the current subject period is deviating, beyond a predetermined criterion, from a hyperplane indicated by the previously estimated average processing time per service which has been stored in the storage unit 230. By assuming any one of the information processing apparatuses 100 with respect to which the point corresponding to the coordinate values has been judged as deviating from the hyperplane beyond the predetermined criterion to be the information processing apparatus 100 in which an abnormality has occurred, the output unit 250 outputs information indicating the foregoing information processing apparatuses.
Furthermore, in this second processing example, every time the average processing time per service is computed by the service demand computing unit 220, the difference judging unit 260 judges, for each of the information processing apparatuses 100, whether the average processing time per service having been computed immediately before differs, from the currently computed average processing time per service beyond a predetermined criterion. Then, also for any one of the other information apparatuses 100 with respect to which the points corresponding to the coordinate values have been judged as not deviating from the hyperplane, on condition that the foregoing average processing times differ from each other beyond the predetermined criterion, the output unit 250 outputs information indicating the foregoing one of the information processing apparatuses 100 by assuming the foregoing one of the information processing apparatuses 100 to be the information processing apparatus 100 in which an abnormality has occurred in the current subject period. This is performed for the purpose of adequately detecting occurrence of an abnormality even in a case where, after the average processing time per service has been changed, an estimated value thereof is computed immediately in accordance with the change. More specifically, in the case where, after the average processing time per service has been changed, an estimated value thereof is computed immediately in accordance with the change, the hyperplane described in the multidimensional space comes to be immediately changed by the estimated value. In this case, although some abnormality is suspected because of the change of the average processing time per service, the point corresponding to the coordinate values indicated by the observed number of calling times and busy time does not diverge from the hyperplane, and the abnormality cannot be detected by the deviation judging unit 240. In this embodiment, an abnormality of this kind can be detected in a manner allowing the difference judging unit 260 to detect a change in the average processing time per service itself.
Each of the information processing apparatuses 100 will be indicated by an index k, and each of the services will be indicated by an index i. Based on these definitions, the busy time of the information processing apparatus k in the divided period j will be denoted as bjk. Additionally, the number of calling times for the service i provided by the information processing apparatus k will be denoted as ajik. Additionally, the average processing time for the service i provided by the information processing apparatus k will be denoted as dik. A relation expressed by the following equation (2) holds among them.
Note that εjk indicates an observation error of the busy time and the number of calling times for the information processing apparatus k in the divided period j. The service demand computing unit 220 computes, for each of the information processing apparatuses, the average processing time per service which minimizes a sum of squares of these observation errors. That is, for each of the information systems, the service demand computing unit 220 computes dik, i.e., the estimated value of the average processing time per service by generating and solving a normal equation with respect to m simultaneous linear equations assuming dik and εjk as unknowns, the normal equation computing dik and minimizing the sum of squares of εjk.
Furthermore, the service demand computing unit 220 may compute, for each of the information processing apparatuses 100, a difference between the busy time and a sum of products obtained by multiplying the average processing times for service respectively by the number of calling times for the services, and compute a variance of the differences. Processing of this computation can be expressed as the following equation (3). Note that the average processing time per service estimated in the training run will be indicated by appending ̂ to dik.
Next, the acquisition unit 200 acquires, for each of the predetermined subject periods, communication packets transferred in the each of the predetermined subject periods within the information processing system 10 (S310). It is desirable that, by configuring the communication packet to be acquired through such means as a mirror port of a switching hub provided in the information processing system 10, actual communications within the information processing system 10 be made unsusceptible by the acquisition. Subsequently, based on the acquired plural communication packets, for each of the information processing apparatuses 100, the number-of-times computing unit 215 computes for each service the number of calling times when a service provided by the information processing apparatuses 100 has been called by other information processing apparatuses 100 (S320).
Next, based on the communication packets having been acquired during the each of the subject periods, for each of the information processing apparatuses 100, the busy time computing unit 218 computes the busy time which is a total amount of time when transactions, which are processing of services, are executed (S330). A specific example of the computation is shown in
a is a conceptual diagram of the processing of computing the busy time. First of all, for each of combinations of transmission sources and destinations of the communication packets, the busy time computing unit 218 selects a finally transmitted communication packet from among a plurality of communication packets continuously transmitted in the same direction. This is because, when a large size data is transmitted in a state being divided into a plurality of communication packets, these communication packets are considered as a single communication. In
Suppose that only one service is provided by a certain one (referred to as a server) of the information processing apparatuses 100. When that one of the information processing apparatuses 100 receives from another one (referred to as a requester) of the information processing apparatuses a communication packet requesting the service, the busy time computing unit 218 judges a clock time when the communication packet has been transferred to be a starting clock time of the busy time. Furthermore, when a result of processing of the service is returned by the server to the requester in response to the request, the busy time computing unit 218 judges a clock time at that time to be an ending clock time of the busy time.
However, there is a case where, during processing of a transaction thereof, the server returns a confirmation-purpose communication packet to the requester. In this case, the server suspends the transaction for a period thereafter until confirmation responding to the confirmation-purpose communication packet is returned. This period for which the transaction is suspended is a period which occurs because a transmission waiting state of communication packets has occurred or because communication delay has occurred in a communication path. For this reason, this period should not be included in the busy time because the server is not performing the processing of the service during this period. More specifically, if this period is included in the busy time in the server, the busy time in the server becomes longer than usual even when the processing is delayed because of occurrence of an abnormality in the information processing apparatus 100 working as the requester. To be more specific, there is a case where, even when an abnormality has occurred in the information processing apparatus working as the requester, the deviation judging unit 240 judges that an abnormality has occurred in the server. Other than the confirmation-purpose communication packet, there is also a case where a packet for handshake of SSL, or the like, is sent out to the requester.
For this reason, even if a certain period is within a period from when any one of the services has been called to when results of processing for the respective services have been returned, the busy time computing unit 218 excludes the certain period from the busy time if the certain period is a period when, after communication packet corresponding to the respective services currently being processed has been transmitted to other information processing apparatuses 100, communication packets responding thereto have not yet been returned (the requester in the case of
b shows a specific example of the processing of computing the busy time. In the example of
During execution of the transaction 1, the server returns a confirmation-purpose communication packet to the requester 1. At this point, while the number of transactions being executed in the server remains two, the transaction 1 out of these transactions goes into a processing wait state. Such a confirmation-purpose communication packet should be transmitted, for example, in compliance with specifications of a communication protocol, and is not needed in processing an application program providing a service. Accordingly, the number of transactions including those in the processing wait state will be referred to as the number of transactions at the application level, and the number of transactions excluding those in the processing wait state will be referred to as the number of transactions at the protocol level. That is, the number of transactions at the application level is two, and the number of transactions at the protocol level is one.
Subsequently, during execution of the transaction 2, the server returns a confirmation-purpose communication packet to the requester 2. At this point, while the number of transactions being executed in the server remains two, all of these transactions go into the processing wait state. Accordingly, the number of transactions at the application level is two, and the number of transactions at the protocol level is zero. Subsequently, a reply responding to the confirmation-purpose communication packet is transmitted to the server from the requester 1. As a result, the transaction 1 is restarted in the server. Thereby, the number of transactions at the protocol level returns to 1. Furthermore, a reply responding to the confirmation-purpose communication packet is transmitted to the server from the requester 2. As a result, the transaction 2 is restarted in the server. Moreover, the number of transaction at the protocol level returns to two.
In order to detect such a change in a communication state, the busy time computing unit 218 includes, for each of the information processing apparatuses 100, a counter for storing therein the number of transactions at the protocol level. In addition, the busy time computing unit 218 performs the following processing for each of the information processing apparatuses 100. First of all, when the busy time computing unit 218 acquires a communication packet for calling any one of the services provided by the information processing apparatuses 100, it increments the counter corresponding to that information processing apparatus 100. Additionally, when the busy time computing unit 218 acquires a communication packet through which a result of processing of any one of the services provided by that information processing apparatus 100 is returned by that information processing apparatus 100, it decrements the counter. Thereby, the number of transactions at the application level is managed as a counter value.
Furthermore, on condition that the counter value is at least 1, the busy time computing unit 218 decrements the counter value when a confirmation-purpose communication packet is transmitted from the information processing apparatus 100 to other information processing apparatuses 100. Additionally, the busy time computing unit 218 increments the counter value when a reply responding to a confirmation-purpose communication packet is transmitted to that information processing apparatus 100 from another one of the information processing apparatuses 100. Thereby, the number of transactions at the protocol level is managed as the counter value. The busy time computing unit 218 determines, as a busy time at the application level, a period between a clock time when the counter value has changed from 0 to 1, and a clock time when the counter value has changed from 1 to 0. Then, the busy time computing unit 218 excludes, from the busy time at the application level, a time period when the counter value has been 0. A busy time computed as a result of this computation becomes a busy time at the protocol level.
b=a
1+2a2 (4)
Note that, when equation (4) is generalized into a case where n various services from a service an to a service an exist, observation values for the number of calling times and the busy time are expressed as coordinate values indicated by the following expression (5). Here, points corresponding to these coordinate values in the n+1 dimension space come to be distributed in the neighborhood of a hyperplane indicated by the average processing time for each service.
∃k∀(aj1k, aj2k, . . . ajnk, bjk) (5)
The deviation judging unit 240 judges whether a point corresponding to coordinate values indicated by the number of calling times and busy time which have been newly computed in the subject period is deviating from this plane beyond a predetermined criterion. For example, five points of coordinate values in an upper part of
r
jk
=b
jk−Σiαjik{circumflex over (d)}ik (6)
|rjk|>3×{circumflex over (σ)}k (7)
Alternatively, the deviation judging unit 240 may compute the residual indicated in equation (6) plural times in the subject period, and judge, based on whether or not these residuals follow a predetermined distribution, whether the point corresponding to the coordinate values is deviating from the plane. The predetermined distribution is, for example, a normal distribution, and follows equations (8).
rpq=0, pqrrq={circumflex over (σ)}q2δpr, N(0,{circumflex over (σ)}q2) (8)
Note that: < > denotes an ensemble average; δpr, a Kronecker delta; and σq to which ̂ is appended, a standard deviation of estimated errors in the information processing apparatus q. The deviation judging unit 240 may judge, for example, by use of a statistical method such as hypothesis testing, to what degree the plural residuals computed by equation (6) in the subject period follow the distribution of r indicated by equation (8). Thereby, how much distributed the coordinate values of the busy time and the like which have been newly computed are about the hyperplane shown in
Subsequently, the output unit 250 makes judgment on whether or nor an abnormality has occurred in each of the information processing apparatuses 100 (S350). Specifically, the output unit 250 outputs information indicating the each of the information processing apparatuses 100 (S360) on condition that, for that information processing apparatus 100, the point corresponding to the coordinate values expressed by the number of calling times and the busy time which have been computed by the analysis unit 210 is deviating, beyond the predetermined criterion, from the hyperplane indicated by the previously estimated average processing time per service (YES in S350). Note that, if the number of times when the point corresponding to the coordinate values has diverged from the hyperplane beyond the predetermined criterion is only one, the output unit 250 may judge that an abnormality has not occurred. For example, the output unit 250 outputs information indicating the each of the information processing apparatuses 100 (S360) on condition that the number of times when the point corresponding to the coordinate values has diverged from the hyperplane beyond the predetermined criterion has reached a predetermined criterion (for example, three). Thereby, accuracy of abnormality detection can be enhanced by excluding, from cases subjected to the detection, a case where an abnormal one of the busy times has been observed due to an observation error or a loss of a communication packet. On condition that the point corresponding to the coordinate values is not deviating beyond the predetermined criterion (NO in S350), the detection apparatus 20 sets the processing back to S310 and makes the judgment in the succeeding subject periods.
Next, with reference to
In
a shows how the average processing time for each service changed as time elapsed. A horizontal axis thereof indicates an elapsed time (in units of minutes), and a vertical axis thereof indicates estimated values for the average processing time for each service. When a simulated abnormality was caused in the database server after 16 minutes had elapsed since the start of the experiment, the estimated values for the average processing time for each service went gradually changing. A reason why the estimated values gradually change and do not immediately follow a true value is that sufficient transactions to enhance accuracy of the estimation cannot be processed in a short time period. To be more specific, while solving a normal equation for simultaneous linear equations obtained by assigning a certain number of combinations of the busy time b and the number ai of calling times into equation (2) is required in finding the average processing time, a plurality of simultaneous linear equations are required in accurately finding a solution of the normal equation, the plurality of simultaneous linear equations respectively having ratios among the number ai of calling times widely different with one another so as to respectively correspond to cases where transactions of the services are processed with various combination ratios. For this reason, it is rare that the number of calling times widely changes in a short time period, and it inevitably takes time for the estimated values follow the true value.
On the other hand,
As has been described above, with reference to
Next, for each of the information processing apparatuses 100, the deviation judging unit 240 computes an index value indicating to what degree, in a multidimensional space formed by the coordinate axis indicating the number of calling times for the respective services and the coordinate axis indicating the busy time, the point corresponding to coordinate values indicated by the number of calling time and the busy time which have been computed in the current subject period is deviating from the hyperplane indicated by the average processing time per service having been stored in the storage unit 230 (S830). This index value is, for example, the above described residual.
On condition that the point corresponding to the coordinate values is deviating from the hyperplane (YES in S840), the output unit 250 outputs information indicating each of the information processing apparatuses 100 (S880). On the other hand, if the point corresponding to the coordinate values is not deviating from the hyperplane (NO in S840), the service demand computing unit 220 updates the average processing time per service having been stored in the storage unit 230 (S860). To be more specific, based on the plural communication packets having been acquired in the already elapsed subject periods, the service demand computing unit 220 computes the average processing time per service in each of the information processing apparatuses 100, and stores it in the storage unit 230.
Next, the difference judging unit 260 judges, for each of the information processing apparatus 100, whether the average processing time per service having been computed immediately before differs from the currently computed average processing time per service beyond the predetermined criterion (S870). In order to detect a change in the average processing time, a conventional method called change point analysis can be applied. For example, the difference judging unit 260 may detect a change in the average processing time by using a method such as Shewhart control chart, cumulative sum control chart or geometrical moving average. If the difference is equal to or greater than the predetermined criterion (YES in S870), the output unit 250 outputs information indicating the each of the information processing apparatuses 100 (S880). On the other hand, if the difference is not equal to or greater than the predetermined criterion (NO in S870), the detection apparatus 20 sets the processing back to S800, and repeats the judgment with respect to the succeeding subject periods.
The host controller 1082 connects the RAM 1020 with the CPU 1000 and the graphic controller 1075 which access to the RAM 1020 at a high transfer rate. The CPU 1000 operates based on programs stored in the ROM 1010 and the RAM 1020, and controls the respective sections. The graphic controller 1075 obtains image data generated by the CPU 1000 and the like on a frame buffer provided within the RAM 1020, and displays the image data on a display device 1080. Instead of this, the graphic controller 1075 may contain therein a frame buffer for storing image data generated by the CPU 1000 and the like.
The input/output controller 1084 connects the host controller 1082 with the communication interface 1030, the hard disk drive 1040 and the CD-ROM drive 1060 which are relatively high-speed input/output devices. The communication interface 1030 communicates with an external apparatus via a network. The hard disk drive 1040 stores programs and data used by the computer 500. The CD-ROM drive 1060 reads out a program or data from a CD-ROM 1095 and supplies it to the RAM 1020 or the hard disk drive 1040.
Additionally, the relatively low-speed input/output devices including the ROM 1010, the flexible disk drive 1050 and the input/output chip 1070 are connected with the input/output controller 1084. The ROM 1010 stores: a boot program executed by the CPU 1000 at the startup of the computer 500; programs dependent on the hardware of the computer 500; and the like. The flexible disk drive 1050 reads out a program or data from the flexible disk 1090 and supplies it to the RAM 1020 or the hard disk drive 1040 via the input/output chip 1070. The input/output chip 1070 connects the various input/output devices through the flexible disk 1090, and through, for example, a parallel port, a serial port, a keyboard port and a mouse port.
A program provided to the computer 500 is stored in the flexible disk 1090, the CD-ROM 1095 or a recording medium such as an IC card, and is provided by the user. The program is read from the recording medium through at least any one of the input/output chip 1070 and the input/output controller 1084, and is installed in the computer 500 to be executed. Operations which the program causes the computer 500 and the like to execute are the same with those in the detection apparatus 20 which have been described in connection with
The program described above may be stored in an external recording medium. As the recording medium, any one of an optical recording medium such as a DVD and a PD, a magneto-optic recording medium such as an MD, a tape medium, a semiconductor memory such as an IC card, and the like may be used other than the flexible disk 1090 and the CD-ROM 1095. Additionally, the program may be supplied to the computer 500 via the network by using as the recording medium a storage device such as a hard disk and a RAM provided in a server system connected with a dedicated communication network or the Internet.
As has been described above, according to the detection apparatus 20, even in the complicated information processing system 10 where a large number of the information processing apparatuses 100 operate cooperatively with one another, it becomes possible to support trouble handling by observing invariable average processing time for each service, which depend neither on a degree of concentration of transactions nor on a mixture ratio, and thereby quickly and accurately detecting a location where an abnormality has occurred. Additionally, by having data under a normal condition previously collected by conducting the training run in advance, it becomes possible to detect, during an abnormality detection operation, an abnormality with minimal computation which is computation of the residual, and also, it becomes possible to detect an abnormality quickly through an on-line operation. Furthermore, even in a case where the training run is not conducted, abnormalities of various natures can be adequately detected by monitoring both of the residual and the processing time as appropriate. Additionally, accuracy of the abnormality detection can be further enhanced by having not only start and end of the transaction but also a waiting time taken into consideration in the processing of computing, the waiting time occurring in compliance with specifications of a communication protocol.
While the present invention has been described by using the embodiment, a technical scope of the present invention is not limited to the scope described in the abovementioned embodiment. It is apparent to those skilled in the art that various modifications or improvements can be made to the abovementioned embodiment. It is apparent from the scope of claims that embodiments to which such modifications or improvements have been made can also be included in the technical scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2006-197177 | Jul 2006 | JP | national |