The embodiment discussed herein is directed to a technology for supporting operations management in various types of systems, such as, information systems and the like.
In operations management for an information system, statuses of various types of devices which are operations management objectives are monitored, and, for example, occurrence of a writing error in a recording medium is detected as an event. An event which does not directly lead to a system failure is recovered by retrying processing, and therefore, is called “a failure sign”, in order to differentiate it from the system failure. If a failure sign is neglected, a serious system failure may eventually occur, and therefore, an operation called “preventive maintenance”, such as backup of the recording medium or exchange thereof, may be performed when the failure sign appears. Furthermore, in a full duplicated system, services are not suspended even if the system failure occurs, and therefore, an operation called a “troubleshooting” for recovering from the failure may be performed when the system failure occurs.
The preventive maintenance has an advantage of improving system availability, but has a disadvantage of increasing an operational cost. On the other hand, the troubleshooting has a disadvantage of lowering the system availability, but has an advantage of suppressing the operational cost. Therefore, a system operations manager needs to judge, based on the operational cost and system failure occurrence probability, which of the preventive maintenance and the troubleshooting is to be adopted. Thus, as disclosed in Japanese Laid-open (Kokai) Patent Application Publication No. 2004-152017 (Patent Document 1), there has been proposed a technology for calculating a failure risk and a recovery cost of each site, based on a failure rate indicating failure occurrence frequency, probability of failure sign overlooking as a result that a failure sign is overlooked to lead a failure occurrence and cost damage at the failure occurrence, to support the decision of an equipment maintenance method.
The information system has characteristics in that failure sign appearance causes are variedly changed depending on service burdens, usage environments and the like. In this case, from the viewpoint of the operational cost, during the course of from the failure sign appearance to the system failure occurrence, it is desirable to consider a preventive maintenance cost at the failure sign appearance and the system failure occurrence probability at the failure sign appearance, to thereby judge whether or not the preventive maintenance is to be performed. However, in the conventionally proposed technology, since aging deterioration in parts has been regarded as a failure occurrence cause and, accordingly, the failure rate at each site has been fixed, calculation precisions of the preventive maintenance cost at the failure sign appearance and of the system failure occurrence probability at the failure sign appearance have been insufficient. Therefore, even if the preventive maintenance cost and the system failure occurrence probability are offered at the time when the failure sign appears, it has been difficult to objectively judge whether or not the preventive maintenance is to be performed at the time when the failure sign appears.
Therefore, in view of the conventional problems as described above, the present invention has as an object to provide a technology for offering system failure occurrence probability at failure sign appearance and a maintenance cost at the failure sign appearance, both of which are calculated based on past failure sign appearance situations, past failure occurrence situations, and maintenance cost information, to thereby support system operations management.
In the present system operations management supporting technology, past failure sign appearance situations and past failure occurrence situations are referred to, to thereby calculate failure occurrence probability which varies with a subsequent time elapse, for each failure sign appeared until failure occurrence after preventive maintenance or troubleshooting was performed. Furthermore, maintenance cost information for an operations management objective system is referred to, to thereby calculate a short-term troubleshooting cost required for responding to a failure which is associated with the failure sign that appeared in the operations management objective system. Furthermore, the past failure sign appearance situations, the past failure occurrence situations, and the maintenance cost information are referred to, to thereby calculate a short-term first preventive maintenance cost required for the preventive maintenance of the failure associated with the failure sign. At the same time, the failure occurrence probability for when preventive maintenance performance is postponed until the next failure sign appearance is calculated based on the failure occurrence probability of the failure sign according to the number of times until the failure sign appearance after the preventive maintenance or the troubleshooting was performed, in the failure occurrence probability of each failure sign, and also, a short-term second preventive maintenance cost of the preventive maintenance to be performed at the moment of the next failure sign appearance is calculated based on the short-term troubleshooting cost and the first preventive maintenance cost, and also, the failure occurrence probability for when the preventive maintenance performance is postponed until the next failure sign appearance. Then, options in which the short-term first preventive maintenance cost, the short-term second preventive maintenance cost and the troubleshooting cost are associated with the failure occurrence probability are prepared to be offered via an output device.
Thus, the failure occurrence probability and the short-term cost are dynamically calculated taking the past failure sign appearance situations and the past failure occurrence situations into consideration. Then, the options indicating the failure occurrence probability and the short-term cost in the case in which the preventive maintenance is performed at each of the moments of the current failure sign appearance, the next failure sign appearance and the failure occurrence, are prepared to be offered via the output device.
According to the above-described system operations management supporting technology, even in the case in which the present invention is applied to the information system in which the failure sign appearance causes are variedly changed depending on service burdens, usage environments and the like, it is possible to calculate the failure occurrence probability and the short-term cost with high precision.
Furthermore, an operations manager can refer to the options offered via the output device, to thereby grasp a risk and a cost in the case in which the preventive maintenance or the response is performed at each of the moments of the current failure sign appearance, the next failure sign appearance and the failure occurrence. Therefore, in the case in which operation policies are determined as “minimization of cost”, “minimization of out-of-service risk” and the like, the operations manager can refer to the offered information, to thereby objectively judge which of the preventive maintenance and the response is the best by eliminating subjective judgment. Furthermore, irrespective of the knowledge/experience of the operations manager, the response available to the failure sign appearance can be determined.
Hereinafter, the present invention will be described in detail, referring to the appended drawings.
A system operations management supporting apparatus 10 is connected to each operations management objective system 30, such as application servers providing various types of services or the like, via a network 20, such as the Internet, a LAN (Local Area Network), a WAN (Wide Area Network) or the like. In the operations management objective system 30, for example, S.M.A.R.T. (Self Monitoring Analysis and Reporting Technology) standardized in the hard disk industry, software which monitors a CPU (Central Processing Unit) utilization ratio, and the like, are previously installed. Then, in the operations management objective system 30, when a failure sign appearance of system failure is detected, failure sign information 40 containing failure sign contents and failure sign appearance sites is notified to the system operations management supporting apparatus 10.
The system operations management supporting apparatus 10 is constructed by a computer which executes a system operations management supporting program. As illustrated in
The correspondence table 10A is for associating a failure and a responding method with an event being a failure sign, and as illustrated in
In the CMDB 10B, an event log, an incident log and system configuration information are stored for each operations management objective system 30. As illustrated in
Furthermore, in the system operations management supporting apparatus 10, by executing the system operations management supporting program, a troubleshooting information preparing section 10C, a failure occurrence probability calculating section 10D, a troubleshooting cost calculating section 10E, a preventive maintenance cost calculating section 10F, an amount of loss calculating section 10G, an options preparing section 10H and an information offering section 10I are respectively realized.
In the troubleshooting information preparing section 10C, the correspondence table 10A is referred to, so that information representing the failure which is associated with the failure sign and the responding method thereof is prepared. In the failure occurrence probability calculating section 10D, the correspondence table 10A, and also, the event log in the CMDB 10B and the incident log therein, are referred to, so that failure occurrence probability increasing as time elapses is calculated based on past failure occurrence situations. In the troubleshooting cost calculating section 10E, the correspondence table 10A, and also, the incident log in the CMDB 10B and the system configuration information therein, are referred to, so that a troubleshooting cost required for failure recovery is calculated. In the preventive maintenance cost calculating section 10F, the correspondence table 10A, and also, the event log in the CMDB 10B, the incident log therein and the system configuration information therein, are referred to, so that a preventive maintenance cost required for preventive maintenance is calculated. In the amount of loss calculating section 10G, the correspondence table 10A, and also, the incident log in the CMDB 10B and the system configuration information therein, are referred to, so that an amount of loss due to out-of-service is calculated. In the options preparing section 10H, respective outputs from the troubleshooting information preparing section 10C, the failure occurrence probability calculating section 10D, the troubleshooting cost calculating section 10E, the preventive maintenance cost calculating section 10F and the amount of loss calculating section 10G, are input thereto, so that options to be offered to an operations manager are prepared. In the information offering section 10I, the options prepared in the option preparing section 10H are offered via various types of output devices, such as, a monitor, a printer and the like.
Here, failure information preparing means, failure occurrence probability calculating means, troubleshooting cost calculating means, preventive maintenance cost calculating means and amount of loss calculating means are realized, respectively, by the troubleshooting information preparing section 10C, the failure occurrence probability calculating section 10D, the troubleshooting cost calculating section 10E, the preventive maintenance cost calculating section 10F and the amount of loss calculating section 10G. Furthermore, options preparing means and information offering means are realized, respectively, by the options preparing section 10H and the information offering section 10I.
In step 1 (in the drawing, to be abbreviated as “S1”, and the same rule will be applied to the subsequent steps), the correspondence table 10A is referred to, to thereby acquire the failure content which is associated with the failure sign specified by the failure sign information 40 and the responding method thereof. Explaining this process using a specific example, if the failure sign specified by the failure sign information 40 is “I/O error”, the correspondence table 10A is referred to, using “I/O error” as a key, to thereby acquire “HDD failure” and “HDD exchange” as the failure content which is associated with “I/O error” and the responding method thereof.
According to the above-described troubleshooting information preparation processing, when the failure sign information 40 is notified, the system failure which may occur in the future due to the failure sign specified by the failure sign information 40, and the responding method of the system failure, can be specified.
In step 11, the correspondence table 10A is referred to, to thereby acquire the failure sign content and the failure content which are associated with the failure sign specified by the failure sign information 40. Explaining this process using a specific example, if the failure sign specified by the failure sign information 40 is “I/O error”, the correspondence table 10A is referred to, using “I/O error” as a key, to thereby acquire “I/O error>10 times/day” and “HDD failure” as the failure sign content and the failure content associated with “I/O error”.
In step 12, the incident log in the CMDB 10B is referred to, to thereby acquire all of past failure occurrence records corresponding to the failure content, as illustrated in
In step 13, the event log in the CMDB 10B is referred to, to thereby acquire all of the failure signs which correspond to the failure sign content, and also, appeared during a predetermined period of time (for example, six months) from failure occurrence date and time, for each failure specified by each failure occurrence record, as illustrated in
In step 14, the failure signs are sorted using the appearance dates and times as a key, to be lined up in time series. Incidentally, instead of sorting the failure signs, the failure signs may be numbered in time series. Then, as illustrated in
In step 15, 1 is substituted into loop variable n.
In step 16, as illustrated in
In step 17, the periods of time from the failure sign appearance to the failure occurrence are sorted in ascending sequence, and as illustrated in
In step 18, for the failure sign associated with each failure, it is judged whether or not all of the failure signs are processed. Then, if all of the failure signs are processed (Yes), the failure occurrence probability calculation processing is ended, whereas if all of the failure signs are not processed (No), the routine proceeds to step 19.
In step 19, after 1 is added to the loop variable n, the routine returns to step 16.
According to the failure occurrence probability calculation processing as described above, the past failure sign appearance situations and the past failure occurrence situations are referred to, so that the failure occurrence probability which varies with the time elapse after the nth failure sign appearance is calculated as illustrated in
In step 21, the correspondence table 10A is referred to, to thereby acquire the failure sign content associated with the failure sign specified by the failure sign information 40.
In step 22, the incident log in the CMDB 10B is referred to, to thereby acquire, respectively, a failure occurrence interval and a response content (troubleshooting result) until the failure recovery, as illustrated in
In step 23, the equipment cost information and the preventive maintenance information which are contained in the system configuration information of the CMDB 10B are referred to, to thereby calculate a short-term troubleshooting cost for when a response the same as the past response content is performed, as illustrated in
In step 24, the short-term troubleshooting cost is multiplied by the failure occurrence period, to thereby calculate a long-term troubleshooting cost. Here, “long-term” means “over one year” (the same rule will be applied hereinafter).
According to the troubleshooting cost calculation processing as described above, the past failure occurrence situations and the system configuration information are referred to, and when the failure associated with the failure sign occurs, the short-term troubleshooting cost and the long-term troubleshooting cost required for the failure recovery are calculated. Therefore, a cost per one time and a cost per one year required for recovering from the failure occurrence can be obtained.
In step 31, the correspondence table 10A is referred to, to thereby acquire, respectively, the failure sign content which is associated with the failure sign specified by the failure sign information 40 and the responding method thereof.
In step 32, the incident log in the CMDB 10B is referred to, to thereby acquire the date and time of the preventive maintenance before the previous preventive maintenance, which corresponds to the responding method, as illustrated in
In step 33, the event log in the CMDB 10B is referred to, to acquire a period of time from the preventive maintenance before the previous preventive maintenance to the second failure sign appearance, in the failure signs corresponding to the failure sign contents, as illustrated in
In step 34, the period of time from the preventive maintenance before the previous preventive maintenance to the second failure sign appearance is divided by the number of days in one year (365 days), to thereby calculate the failure sign appearance frequency per one year.
In step 35, the equipment cost information in the system configuration information of the CMDB 10B is referred to, to thereby calculate the short-term preventive maintenance cost for when the preventive maintenance the same as the previous preventive maintenance is performed.
In step 36, the short-term preventive maintenance cost is multiplied by the failure sign appearance frequency per one year, to thereby calculate the long-term preventive maintenance cost.
In step 37, the event log in the CMDB 10B is referred to, to thereby acquire a period of time from the second failure sign appearance to the third failure sign appearance after the preventive maintenance before the previous preventive maintenance is performed, as illustrated in
In step 38, the second failure occurrence probability calculated in the failure occurrence probability calculating section 10D is referred to, to thereby acquire the failure occurrence probability corresponding to the period of time from the second failure sign appearance to the third failure sign appearance, as illustrated in
In step 39, the short-term preventive maintenance cost and the long-term preventive maintenance cost for when the preventive maintenance is performed at the moment of the next failure sign appearance, are calculated using the following formulas. Incidentally, in the following formulas, the failure occurrence probability acquired in step 38 is to be called “probability” and the period of time from the second failure sign appearance to the third failure sign appearance is to be called “period of time”.
Short-term preventive maintenance cost at the moment of the next failure sign appearance=probability×troubleshooting cost+(1−probability)×short-term preventive maintenance cost at the moment of the current failure sign appearance
Long-term preventive maintenance cost at the moment of the next failure sign appearance=period of time/12×short-term preventive maintenance cost at the moment of the next failure sign appearance+(12−period of time)/12×long-term preventive maintenance cost at the moment of the current failure sign appearance
According to the preventive maintenance cost calculation processing as described above, the past failure sign appearance situations, the past failure occurrence situations and the system configuration information are referred to, so that the short-term preventive maintenance cost and the long-term preventive maintenance cost at the moment of the current and next failure signs appearance, are calculated. At this time, the long-term preventive maintenance cost at the moment of the next failure sign appearance is calculated, taking the failure occurrence probability of future system failure into consideration, and therefore, it is possible to improve calculation precision.
In step 41, the correspondence table 10A is referred to, to thereby acquire a failure sign appearance site associated with the failure sign specified by the failure sign information 40.
In step 42, the incident log in the CMDB 10B is referred to, to thereby predict a time (down time) required for the failure recovery based on the past troubleshooting results, as illustrated in
In step 43, the server information, the service configuration information, the service price information and the SLA information which are contained in the system configuration information of the CMDB 10B, are referred to, to thereby acquire the services, the service prices and the SLA which are affected by the system failure occurred at the failure sign appearance site. Explaining this process by a specific example, referring to
In step 44, an amount of loss is calculated, based on the services, the service prices and the SLA which are affected by the down time and the system failure. Explaining this process by a specific example, referring to
According to the amount of loss calculation processing as described above, the past troubleshooting results and the system configuration information are referred to, to thereby calculate the opportunity loss and SLA compensation due to the out-of-service are calculated. Then, by adding the opportunity loss and the SLA compensation, a final amount of loss is obtained. Incidentally, needless to say, the SLA compensation is unnecessary in the case in which the services are provided by one's own system.
In step 51, as information to be offered to the manager, as illustrated in
According to the options preparation processing as described above, as the information to be offered to the operations manager, the options comprising the expected failure, the amount of loss which may result from the expected failure, and risks and costs in the responses 1 to 3, can be prepared.
In step 61, the options are output to an output device, such as a monitor, a printer or the like.
According to the information offering processing as described above, the options can be offered to the operations manager in a visually recognizable form.
According to this system operations management supporting apparatus, the failure occurrence probability, and also, the short-term and long-term costs are dynamically calculated, taking the past failure sign appearance situations and the failure occurrence situations into consideration. Therefore, even in the case in which the present invention is applied to an information system in which failure sign appearance causes are variedly changed depending on service burdens, usage environments and the like, it is possible to calculate with high precision the failure occurrence probability, and also, the short-term and long term costs.
Then, the options representing the failure occurrence probability, and also, the short-term and long term costs, are offered to the operations manager, when the responses are performed at the moments of the current failure sign appearance, the next failure sign appearance and the failure occurrence. At this time, the failure which may occur in the future due to the failure sign appearance and the amount of loss which may be charged as a result of this failure occurrence, are contained in these options. Therefore, it is possible for the operations manager to grasp the failure which occurs with the failure sign appearance, and the amount of loss as a result of this failure, and also, the risks and costs for when the responses are performed at the moments of the current failure sign appearance, the next failure sign appearance and the failure occurrence, by referring to the options offered via the output device. Consequently, in the case in which operation policies are determined as “minimization of cost”, “minimization of out-of-service” and the like, the operations manager can refer to the offered information, to thereby objectively judge whether the response is the best by eliminating subjective judgment. Furthermore, irrespective of the knowledge or experience of the operations manager, the response available to the failure sign appearance can be determined.
At this time, the amount of loss as a result of the out-of-service is additionally offered, and therefore, it is possible to determine the responses available at the moment of the failure sign appearance, considering whether or not the amount of loss is permissible. Furthermore, the amount of loss contains the compensation due to the out-of-service, and therefore, it is possible to additionally grasp a risk in the case in which an application server is rented to a service provider company. Furthermore, in each of the responses at the moments of the current failure sign appearance, the next failure sign appearance and the failure occurrence, the short-term cost and the long term cost are offered, and therefore, it is possible to determine the response available at the moment of the failure sign appearance, considering not only the short-term cost but also the long-term cost. In addition, since the content of the failure which may occur in the future due to the failure sign appearance is additionally offered, it is possible to judge whether or not this failure content is fatal.
Incidentally, in the present embodiment, the failure occurrence probability is calculated at the moment of notification of the failure sign information. However, the failure occurrence probability may be previously calculated at appropriate timing. Furthermore, the present invention is applicable not only to the information system but also to various systems as operations management objectives.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2008/059728 | May 2008 | US |
Child | 12954325 | US |