This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2008-084594, filed on Mar. 27, 2008, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an apparatus and method for coping with a fault in an information technology system, and a program therefor.
Now, information technology (herein after IT) problems can be solved on the basis of information about the phenomenon (symptom) of a problem and the like.
In such a problem solving process, if there are multiple coping method work candidates, it is difficult to determine which work candidate to begin with, that is, the priority order of the options. For example, it is required to select which candidate to try first between coping method work candidates, such as “reboot” and “restore from backup”. However, the prioritization of options is performed on the basis of the experience of a maintenance/administration person (hereinafter simply referred to as an administrator) of an IT system, and determination on which selection is optimal depends on the administrator's experience and skill.
Disclosure as below is shown in Japanese Laid-open Patent Publication No. 2000-076071. This system is provided with a case aligning means 20 for hierarchically classifying past cases through a case preparing part, editing part, classification preparing part and editing part and adding attributes such as explanation descriptions to respective classifications. A question display part is provided for inputting the declaration contents of a user, retrieving these contents from a case data base through a retrieval part, collating them with the respective classification attributes and cases, finding the similarity of respective classifications and cases, and displaying information for questioning which classification the case declared by the user belongs to from a question display, based on the similarity for the user. Cases are specified by successively applying the display of questions to the slave classifications of the classification answered by the user.
According to an aspect of an embodiment, a trouble coping apparatus includes an incident registration section which registers information about an incident which has solved a problem, a solution knowledge generation section which generates trouble solution knowledge from the incident information, a risk registration section which registers risk items which are materials for judging appropriateness of selection of a work candidate, with the trouble solution knowledge, a risk evaluation section which generates navigation information showing a trouble solution procedure from the trouble solution knowledge, and a solution procedure display section which displays the navigation information.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
The coping apparatus 1 is installed within the IT system or installed outside the IT system without being communication-connected to the IT system. The coping apparatus 1 has an incident registration section 11, a solution knowledge generation section 12, a risk registration section 13, a solution procedure display section 14, a risk evaluation section 15, a first storage section 16 and a second storage section 17.
The incident registration section 11 registers an incident which has solved a trouble in the IT system, and stores the registered incident information into the first storage section 16 of the auxiliary storage. Here, an incident means a history of trouble solution knowledge in the past applied to troubles in the IT system.
The solution knowledge generation section 12 generates trouble solution knowledge (hereinafter simply referred to as solution knowledge) from the incident stored and registered with the first storage section 16 by the incident registration section 11.
The risk registration section 13 registers risk items, which are judgment criteria for appropriateness of a work candidate to be selected, with the solution knowledge generated by the solution knowledge generation section 12, and stores the registered solution knowledge in the second storage section 17 of the auxiliary storage. Here, a risk is material for judging the appropriateness of selection of a work candidate. If selection of a work candidate fails, a user of the IT system suffers disadvantages such as that a lot of time is required for solution of a trouble. Therefore, such material is called a risk. Risk items include time required, cost, probability and occurrence probability. A risk value indicates the degree of risk. If a risk item is time required, the risk value indicates time required, and the degree is indicated by whether it is long or short. If a risk item is cost, the risk value indicates cost, and the degree is indicated by whether high or low.
The risk evaluation section 15 generates navigation information which shows a trouble solution procedure from the solution knowledge.
The solution procedure display section 14 displays the navigation information generated by the risk evaluation section 15 on a display not shown.
The flowchart shown in
At step S1, work candidates for the next work and risk values are read from second storage section 17.
At step S2, a work candidate selection screen for displaying the work candidates and the risk values is generated.
At step S3, the selection screen generated by execution of step S2 is displayed on the display.
At step S4, input of selection of a work candidate is received from an IT system administrator who has looked at the selection screen displayed at step S3.
Specific examples of the extracted data include “syslog” and “core” as described in an incident shown at the upper part of
“Core” is a file created by recording the state of a program when the running program abnormally ends. In this file, the execution history of the program is stored. The data in the created file is read by core analysis software and displayed by a viewer. The core searches for an evaluation position in the executed program.
At step S1, trouble solution knowledge is read from the second storage section 17.
At step S2, risk values inputted by the administrator for the items of the trouble solution knowledge, that is, for the extracted data and measures for solving a trouble are received.
At step 3, the risk values received at step S2 are written in the trouble solution knowledge read at step S1.
The flowchart shown in
At step S1, all the trouble solution knowledge is read from the second storage section 17.
At step S2, a work candidate tree display screen for displaying work candidates and risk values is generated.
At step S3, the tree display screen generated by execution of step S2 is displayed on the display.
At step S4, the flow returns to the work candidate selection screen shown in
At step S1, a notification to the effect that a trouble has been solved is received from the administrator.
At step S2, the incident shown at the upper part of
At step S3, the incident is divided into groups of extracted data and measures. Specifically, for example, the incident of the incident ID “INC007-0723-0314” is divided into groups as shown below.
Group 1: “phenomenon”, “layer”, “product”, “extracted data [1]”, “extracted data [1] acquisition start date” and “extracted data [1] acquisition end date”
Group 2: “evaluation position in extracted data [1]”, “measure”, “measure start date”, “measure end date” and “measure completion date”
At step S4, “conditions” in trouble solution knowledge are created on the basis of the “phenomenon”, “layer”, “product” and “evaluation position in extracted data [1]” of the incident, and “extracted data” and measures in the trouble solution knowledge are created on the basis of “extracted data” and “measures” of the incident. In this case, description is made for each of the groups divided at step S3. Specifically, with reference to the incident ID “INC007-0723-0314”, a trouble solution knowledge ID “Sym-0001-1” is created, “phenomenon=hungup”, “layer=application” and “product=Interstage” are written under “conditions”, and “syslog” is written under “extracted data”. Similarly, a trouble solution ID “Sym-0001-2” is created, “syslog=erroneous deletion of necessary file” is written under “conditions”, and “restore from backup” is written under “measures”.
At step S5, time required is calculated from the incident shown at the upper part of
The risk registration section 13 sets time required, cost, occurrence probability or probability as risk items in the solution knowledge, and stores the solution knowledge into the second storage section 17 of the auxiliary storage.
The risk registration section 13 may also set any multiple risk items, among time required, cost, probability and occurrence probability, as risk items in the solution knowledge and store the solution knowledge into the second storage section 17 of the auxiliary storage.
At step S5 in the flowchart illustrated in
As illustrated in
As illustrated at the upper part of
As illustrated at the lower left and the lower right of
Looking at this information, the IT system administrator can select the measure “access and confirm operation” if he wants to make the time required for trouble solution the shortest.
First method: Only the time required for the next work is read from the second storage section 17 and displayed.
Extraction of syslog: 10 min
Reboot of process: 5 min
Second method: The longest time (Max) and the shortest time (min) required for work required until a trouble is solved are displayed.
Extraction of syslog: Max 250 min; Min 30 min
Reboot of process: Max 35 min; Min 15 min
Third method: Average time required for work required until a trouble is solved is displayed.
Extraction of syslog:
10+20*(0.2/0.8)+240*(0.6/0.8)=27.5 min formula (1)
Reboot of process:
5+30*(0.2/0.2)+10*(0/0.2)=35 min formula (2)
In the formula (1), the first term indicates the time 10 minutes required for the work candidate “extraction of syslog”. The second term indicates the product of the time 20 minutes required for the work candidate “reboot” multiplied by 0.2/0.8, which is the occurrence probability ratio of the work candidate “reboot” to the work candidate “extraction of syslog.” The third term indicates the product of the time 240 minutes required for the work candidate “restore from backup” multiplied by 0.6/0.8, which is the occurrence probability ratio of the work candidate “restore from backup” to the work candidate “extraction of syslog”. The formula (2) is also calculated similar to the formula (1).
The probability will be described. Probability is defined to be probability of extraction of data and execution of a measure performed in the past leading to successful solution. The probability is calculated by dividing the number of successes by the number of executions. For example, referring to the execution history, A was executed under all the five incident numbers 1 to 5. Referring to the shortest success routes, A was effective under the four incident numbers 1, 2, 3 and 4. Therefore, the probability is calculated by 4÷5=0.8. The results of the possibilities calculated similarly are illustrated in
Next, the shortest success route will be described below. Such extraction of data, measures or results of measures that it would have been possible to lead to solution without executing the extraction of data, the measures or the results of measures in the solution process are excluded from the shortest success routes. That is, A under the incident number 5, for example, is excluded from the shortest success route. The calculation method is: the shortest route to reach the last item in the trouble solution knowledge execution history is selected.
Occurrence probability will be described. Occurrence probability is defined to be the frequency that solution was successful by executing the measure concerned among all the flows. The occurrence probability is calculated by the following formula (3).
Occurrence probability=the number of successes÷the number of all incidents formula (3)
For example, referring to the shortest success routes, A was effective under the four incident numbers 1, 2, 3 and 4. Therefore, by dividing the number 4 by 5, the number of all the incidents, the occurrence probability is calculated by 4÷5=0.8. The result of the occurrence probabilities calculated similarly are illustrated in
The coping apparatus 2 has an incident registration section 11, a solution knowledge generation section 12, a risk registration section 13, a solution procedure display section 14, a risk evaluation section 15, a first storage section 16, a second storage section 17, a policy registration section 18 and a third storage section 19. Since the incident registration section 11, the solution knowledge generation section 12, the risk registration section 13, the solution procedure display section 14, the risk evaluation section 15, the first storage section 16 and the second storage section 17 are similar to those of the trouble coping apparatus of the first embodiment shown in
The policy registration section 18 registers which risk items among multiple risk items are to be adopted, and the priority rankings of the adopted risk items in order to determine a solution procedure. Furthermore, the policy registration section 18 weights the registered risk items among the multiple risk items.
In the third storage section 19, there is stored policy information about which risk item is to be adopted as an evaluation policy among the risk items of time required, success probability (probability), occurrence probability (frequency of taking the measure) and cost.
The flowchart shown in
At step S1, by referring to the policy information, it is selected whether success probability or time required is to be prioritized as an evaluation policy. Specifically, an administrator determines that, for example, time required is to be prioritized as an evaluation policy, on the display illustrated in
At step S2, the priority degrees of the multiple risk items determined as evaluation policies are calculated. Here, it is assumed that time required is selected as a risk item. As illustrated in
Here, a method for normalizing and calculating priorities from multiple risk items.
Weights to be given to cost, time required, success probability (probability) and occurrence probability (frequency of taking the measure) as risk items are inputted as evaluation policies in advance. The weights are assumed to be α, β, γ and δ, respectively.
The average μ and the standard deviation σ of each of the risk items (cost, time required, success probability (probability) and occurrence probability (frequency of taking the measure) relative to all the trouble solution knowledge are calculated. They are denoted by μc, σc, μt, μo, σo, μp and σp.
A normalization function is assumed to be: f(x)=(x−μ)/σ.
The normalization function of the average μc and the standard deviation σc is assumed to be fc.
The normalization function of the average μt and the standard deviation σt is assumed to be ft.
The normalization function of the average μo and the standard deviation σo is assumed to be fo.
The normalization function of the average μp and the standard deviation σp is assumed to be fp.
The cost, time required, probability (success probability) and occurrence probability (frequency of taking the measure) of a symptom concerned are denoted by c, t, o and p, respectively.
In this case, a higher priority is better. The priority is calculated by the following formula (4).
Priority=−αfc(c)−βft(t)+γfo(o)+δfp(p) (4)
At step S2, the policy registration section 18 can set weights α, β, γ and δ for the registered risk items among the multiple risk items.
At step S3, priority rankings and a prioritization flow are displayed to the administrator on the display. Specifically, the visualized flow shown in
At step S4, work candidates determined by the administrator are received. Specifically, the administrator determines “extraction of core” and “extraction of syslog” shown in
According to the embodiments described above, when, from the phenomenon and the like of a trouble in an IT system, trouble solution knowledge for identifying the cause of the trouble is generated, material(s) for judging appropriateness of selection of a work candidate, such as time required, cost, occurrence probability and probability, are registered, and navigation information is generated from this trouble solution knowledge. Therefore, it is possible to judge appropriateness of a measure for a trouble and solve the trouble quickly.
Furthermore, when work candidates are selected from multiple work candidates for solving a problem in an IT system, risk items for preventing disadvantages being caused by failure in selection of a work candidate and the priority rankings of selected risk items are presented by the policy registration section. Thereby, even an administrator with little experience can appropriately judge selection of a work candidate, solve a trouble in an appropriate order, and shorten the time required for solving the trouble.
Furthermore, by visualizing the flow, it is possible to make more appropriate work candidate selection judgment.
Furthermore, by the policy registration section registering the priority rankings of the risk items and setting weights for the risk items, it is possible to select an evaluation policy and make more appropriate work candidate selection judgment.
A program implementing the embodiments may be recorded on computer-readable media comprising computer-readable recording media. Examples of the computer-readable recording media include a magnetic recording apparatus, an optical disk, a magneto-optical disk, and/or a semiconductor memory (for example, RAM, ROM, etc.). Examples of the magnetic recording apparatus include a hard disk device (HDD), a flexible disk (FD) and a magnetic tape (MT). Examples of the optical disk include a DVD (Digital Versatile Disc), a DVD-RAM, a CD-ROM (Compact Disc-Read Only Memory), and a CD-R (Recordable)/RW.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment(s) of the present inventions have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2008-084594 | Mar 2008 | JP | national |