This technique relates to a management technique of a computer system.
Because of the development of the cloud computing or the like, the computer system becomes larger, and a failure of a part of devices within the system and/or an operation mistake such as a setting mistake influence broadly.
Conventionally, as for countermeasures against troubles, there is a method in which a scenario-based test is performed in advance. More specifically, a scenario is created by assuming past experience and/or utilization, occurrences of the troubles and the like, and a test is performed along the scenario. However, because the scenario is created based on the initial assumption, there is a problem that a case that have a large risk and is beyond expectation cannot be covered. Especially, there are various kinds of causes for the troubles, and the situation beyond expectation cannot be avoided. Furthermore, the system often falls into the situation beyond expectation when the large-scale trouble occurs. In other words, a latent risk that was not mentioned at the design becomes an issue when any condition is satisfied by another trouble, and troubles sequentially occur and become large-scale. On the other hand, in case of the situation within the expectation, it is possible to prepare the countermeasure and settle the trouble before the influence extends.
It is preferable that the situation beyond expectation is eliminated in order to avoid the aforementioned large-scale trouble, however, the manual assumption is difficult. Therefore, a method for predicting a range of the influence by the simulation is often employed. More specifically, by performing simulation of the situation of the system step-by-step while changing a failure pattern, the range of the influence of the trouble is predicted for each failure pattern. However, the number of failure patterns for which the simulation should be performed becomes very huge for the large-scale system.
It is assumed that the failure pattern represents what component item within the system breaks and how to break, “i” represents the number of component items, and “j” represents an average value of the number of kinds of failures in each component item. Then, the number of failure patterns P is represented as follows:
P=i*j+
i
C
2
*j*j
For example, it is assumed that a cloud center includes 8 zones, and some hundreds physical machines and some thousands virtual machines are included in one zone. In such a case, in case of assuming j=5, there are about 0.2 million patterns only for a case in which only one portion breaks, and there are patterns more than 10 billion for a case in which two portions break. Thus, it is not realistic that all patterns are simulated.
Therefore, there is no conventional technique for efficiently identifying failure patterns that have large influence.
An information processing method relating to this technique includes: (A) identifying a component item that satisfies a predetermined condition concerning an indicator value for an influenced range within a system, from among plural component items included in the system, by using data regarding the plural component items and relationships among the plural component items; (B) extracting component items included in a predetermined range from the identified component item, based on the data; and (C) generating one or plural failure patterns, each of which includes one or plural sets of one component item of the extracted component items and a failure type corresponding to the one component item, by using data including, for each component item type, one or plural failure types.
The object and advantages of the embodiment will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the embodiment, as claimed.
The operation management system 200 is a system that has already been constructed for operation management of a system in which occurrences of the troubles are assumed, and includes a system configuration data storage unit 210 that stores data of component items for the system in which the occurrences of the troubles are assumed.
The system configuration data storage unit 210 stores data of component items within the system, data of connection relationships between component items, and calling relationships between component items. For example, when a switch Switch001 is connected with a server Server001 as illustrated in
The examples in
The information processing apparatus 100 has an aggregation point identifying unit 101, an aggregation point storage unit 102, a failure part candidate extractor 103, a failure part candidate list storage unit 104, a failure pattern generator 105, a failure type list storage unit 106, an exclusion list storage unit 107, a failure pattern list storage unit 108, a simulator 109, a state transition model storage unit 110, a simulation result storage unit 111 and an output processing unit 112.
The aggregation point identifying unit 101 uses data stored in the system configuration data storage unit 210 to identify an aggregation point in the system in which the occurrences of the troubles are assumed, and stores data of the identified aggregation point into the aggregation point storage unit 102. The failure part candidate extractor 103 extracts a failure part candidate from the system configuration data storage unit 210 based on data stored in the aggregation point storage unit 102, and stores extracted results into the failure part candidate list storage unit 104. The failure pattern generator 105 generates a failure pattern by using data stored in the failure part candidate list storage unit 104 and the failure type list storage unit 106, and stores data of the generated failure pattern into the failure pattern list storage unit 108. At this time, the failure pattern generator 105 deletes a failure pattern to be deleted from the failure pattern list storage unit 108 based on data stored in the exclusion list storage unit 107.
The simulator 109 performs, for each failure pattern stored in the failure pattern list storage unit 108, simulation for state transitions of the component items stored in the system configuration data storage unit 210, according to the state transition model stored in the state transition model storage unit 110, while assuming that the failure pattern occurs, and stores the simulation results into the simulation result storage unit 111. The output processing unit 112 generates output data from data stored in the simulation result storage unit 111 in response to a request from the user terminal 300, for example, and outputs the generated output data to the user terminal 300.
For example, the user terminal 300 is a personal computer operated by an operation administrator, and instructs the aggregation point identifying unit 101 of the information processing apparatus 100 or the like, to start the processing and requests the output processing unit 112 to output the processing result, receives the processing result from the output processing unit 112 and displays the processing result on a display apparatus.
Next, processing contents of the information processing apparatus 100 will be explained by using
Firstly, the aggregation point identifying unit 101 performs a processing for identifying an aggregation point (
In this embodiment, an explanation will be made using a system, for example, illustrated in
In this system, the virtual machines ci11 to ci15 are masters, and the virtual machines ci16 to ci20 are their copies. The virtual machines ci11 to ci15, which are masters, respectively confirm existences of their copies, for example, periodically. This is defined in the system configuration data storage unit 210 as a calling relationship (Call) from the virtual machine ci11 to the virtual machine ci16. As for the virtual machines ci12 to ci15, the same data is defined. Moreover, when the existence of its own copy becomes unknown, the virtual machines ci11 to ci15, which are masters, send a request (Call), in other words, a copy generation request, to the manager Mgr in order to generate its new copies of the virtual machines ci11 to ci15. This is defined as a calling relationship from the virtual machines ci11 to ci15, which are masters, to the manager Mgr.
Firstly, the aggregation point identifying unit 101 identifies one unprocessed Component Item (CI) in the system configuration data storage unit 210 (
In this embodiment, an item type of the identified component item is identified, and the number of subordinate items is calculated according to the item type. The item types of the component items include a router, a switch (core), a switch (edge), a physical machine and a virtual machine. Typically, the physical configuration of the system is as illustrated in
Then, in case of the core switch, the number of subordinate items of the core switch is calculated by a total sum of the number of edge switches just under itself and the number of items under the edge switches just under itself. As illustrated in
Moreover, in case of the edge switch, the number of items under the edge switch is calculated by a total sum of the number of physical machines just under itself and the number of items under them. The switch ci01 is connected to two physical machines ci05 and ci06, and is determined as being the edge switch. Then, the number of subordinate items is calculated by a total sum “12” of the number of physical machines ci05 and ci06 “2” and a sum of the numbers of items under these physical machines ci05 and ci06 “10 (=5+5)”. The switch ci03 is connected to two physical machines ci07 and ci08, and is determined as being the edge switch. Then, the number of subordinate items is calculated by a total sum “2” of the number of physical machines ci07 and ci08 “2” and a sum of the numbers of items under these physical machines ci07 and ci08 “0”. The switch ci04 is connected to the physical machine ci09, and is determined as being the edge switch. Then, the number of subordinate items is calculated by a total sum “2” of the number of physical machines ci09 “1” and a sum of the numbers of items under this physical machine ci09 “1”.
Furthermore, in case of the physical machine, the number of virtual machines just under itself is the number of subordinate items of the physical machine. In case of the physical machines ci05 and ci06, the number of virtual machines just under itself is “5”, so the number of subordinate items is “5”. In case of the physical machines ci07 and ci08, the number of virtual machines just under itself is “0”, so the number of subordinate items is “0”. In case of the physical machine ci09, the number of the virtual machines just under itself is “1”, so the number of subordinate items is “1”. In case of the virtual machines, the number of subordinate items is identified as being “0”.
Moreover, the aggregation point identifying unit 101 calculates the number of items that directly or indirectly call the identified component item, and stores the calculated number of items into the storage device such as the main memory (step S25). The number of items that directly or indirectly call the identified component item is calculated as a total sum of the number of calling relationships whose target is the identified component item and the number of items that directly or indirectly call the source of that calling relationship. In other words, the source of the calling relationship is reversely traced and the total sum of the numbers of calling relationships until the trace cannot be performed is the number of items that directly or indirectly call the identified component item. In the example of
On the other hand, as another example, it is assumed that, in a system as illustrated in
Then, for example, the calculation results as illustrated in
Then, the aggregation point identifying unit 101 determines whether or not the identified component item satisfies a condition of an aggregation point (also noted as “aggregation P”) (step S27). For example, it is determined whether or not the number of subordinate items is equal to or greater than “16” or the number of items that directly or indirectly call the identified component item is equal to or greater than “6”. Whether or not the identified component item is the aggregation point is determined based on whether or not an evaluation value that is calculated by adding the number of subordinate items and the number of items that directly or indirectly call the identified component item with weights is equal to or greater than a threshold. In the example of
When the condition of the aggregation point is not satisfied, the processing shifts to step S31. On the other hand, when the condition of the aggregation point is satisfied, the aggregation point identifying unit 101 adds the identified component item to the aggregation point list, and stores its data into the aggregation point storage unit 102 (step S29). The aggregation point storage unit 102 stores data as illustrated in
As described above, the aggregation point is the component item that is associated with a lot of other component items in the system. Then, there are a structural aggregation point that is identified because the component item has a lot of subordinate items as described above and a behavioral aggregation point that is identified because of the number of items that directly or indirectly call the item to be processed, which means that the specific item is directly or indirectly called from a lot of component items. This is because the possibility is high that the influence range expands in short time, when the aggregation point is influenced by the failure, and it is important in view of the countermeasure that the failure that influences the aggregation point is discovered. Especially, the failure that influences the aggregation point in the early stage is a failure whose exigency is high, and it is much effective that such a failure whose exigency is high can be treated. Therefore, in this embodiment, the failure that influences the aggregation point in the early stage is searched for.
The processing shifts to the step S31, and the aggregation point identifying unit 101 determines whether or not there is an unprocessed component item in the system configuration data storage unit 210 (step S31). When there is an unprocessed component item, the processing returns to the step S21. On the other hand, when there is no unprocessed component item, the processing returns to the calling-source processing.
When this processing is performed, the list of the aggregation points is stored in the aggregation point storage unit 102.
Returning to the explanation of the processing in
On the other hand, when the aggregation point for the behavior is identified based on the number of items that directly or indirectly call the item to be processed in the system as illustrated in
When the aggregation point is extracted after totally evaluating the number of subordinate items and the number of items that directly or indirectly call the item to be processed, or when there is an aggregation point that satisfies both of the criterion for the number of subordinate items and the criterion for the number of items that directly or indirectly call the item to be processed, both of the component items that are connected within the predetermined number of hops through the connection relationship and the component items that are connected within the predetermined number of hops through the calling relationship are extracted.
After that, the failure part candidate extractor 103 stores the component items detected at the search of the step S43 as the failure part candidate into the failure part candidate list storage unit 104 (step S45). In the example of
Then, the failure part candidate extractor 103 determines whether or not there is an unprocessed aggregation point in the aggregation point storage unit 102 (step S47). When there is an unprocessed aggregation point, the processing returns to the step S41. On the other hand, when there is no unprocessed aggregation point, the processing returns to the calling-source processing.
By performing such a processing, the component items that have high possibility that the aggregation point is influenced when the failure occurs are extracted as the failure part candidates.
Returning to the explanation of the processing in
Then, the failure pattern generator 105 initializes a counter i to “1” (step S53). After that, the failure pattern generator 105 generates all of patterns, which includes “i” sets of the failure part candidate and the failure type, and stores the generated patterns into the failure pattern list storage unit 108 (step S55).
When the failure part candidates as illustrated in
Moreover, it may be assumed that the failures occur at plural failure part candidates at once. For example, in case of i=2, a failure pattern including two sets of the aforementioned sets is generated for all combinations of the aforementioned sets. For example, a combination of a set (ci01, failure) and a set (ci03, failure), a combination of the set (ci01, failure) and a set (ci06, Disk failure) and the like are generated.
Then, the failure patterns generated at the step S55 are stored in the failure pattern list storage unit 108. Data as illustrated in
After that, the failure pattern generator 105 deletes the failure pattern stored in the exclusion list storage unit 107 from the failure pattern list storage unit 108 (step S57). A failure pattern that is not required to consider in case of only one failure, and a combination that does not occur and/or is not required to consider in case where the failures occur at plural parts are registered in advance in the exclusion list. This registration may beperformedinadvancebytheoperationadministratorbyusinghisorherknowledge. Moreover, the virtual machines under a physical machine are also failed, when the physical machine is failed. Therefore, when a set (pm1, failure) is registered, a rule that a combination of (pm1, failure) and (vm11, failure) are deleted may be registered and applied.
For example, by using a technique described, for example, in Japanese Laid-open Patent Publication No. 2011-145773 (US 2011/0173500 A1), the failure patterns (or rule) to be registered in the exclusion list may be automatically generated from the system configuration data storage unit 210, and may be stored in the exclusion list storage unit 107.
After that, the failure pattern generator 105 determines whether or not “i” exceeds an upper limit value (step S59). The upper limit value is an upper limit of the failures that occur at once, and is preset. Then, when “i” does not exceed the upper limit value, the failure pattern generator 105 increments “i” by 1 (step S61), and the processing returns to the step S55. On the other hand, when “i” exceeds the upper limit value, the processing returns to the calling-source processing.
By performing such a processing, the failure patterns that influence the aggregation point and are to be assumed are generated.
Returning to the explanation of the processing in
The state transition model is stored in advance for each item type in the state transition model storage unit 110. Typically, the state transition model is described in a format as illustrated in
More specifically, an example of the state transition model for the component item that has the item type “sw” and is used in the system illustrated in
Moreover, an example of the state transition model for the component item that has the item type “pm” and is used in the system illustrated in
Moreover, an example of the state transition model in case of a main virtual machine that has the item type “vm” and is used in the system illustrated in
Furthermore, an example of the state transition model in case of the copy virtual machine that has the item type “vm” and is used in the system illustrated in
Moreover, an example of the state transition model for the component item that is used in the system illustrated in
The simulator 109 performs the simulation by using those state transition models. The simulation is performed assuming that the specific failure occurs in the specific component item, which is defined in the failure pattern, at this time.
For example, as for the system in
In the initial state, as illustrated in
After that, at the third step, as illustrated in
Then, at the fourth step, as illustrated in
After that, at the fifth step, as illustrated in
As described above, it is understood that any trouble occurs in the component items ci10 to ci20 in addition to the component item ci06 that is included in the failure pattern. Here, the number of damaged items including the component item included in the failure pattern is counted. In this example, the number of damaged items “12” is obtained.
When the aforementioned processing is performed for each failure pattern, the simulator 109 stores data as illustrated in
As for the specific processing method of this simulation, a conventional method can be used, and the method of the simulation itself is not the main portion of this invention, therefore the explanation of the specific method is omitted.
Returning to the explanation of the processing in
For example, data as illustrated in
Because the failure patterns whose number of damaged items is great, in other words, the failure patterns whose range of the influence is broad can be identified, it becomes possible to perform the countermeasure against these failure patterns.
In the first embodiment, an example was explained that the component items included in the fixed range of the number of hops n from the aggregation point are extracted as the failure part candidates. However, “n” cannot be always set appropriately from the first time. Moreover, the influence range of the component item that is relatively apart from the aggregation point may be broad. Therefore, by performing a processing that will be explained later, the range from which the failure part candidates are extracted is dynamically changed to extract the proper failure part candidates. Accordingly, the failure pattern to be treated is appropriately extracted.
For example, a processing as illustrated in
Then, the simulator 109 performs, for each failure pattern stored in the failure pattern list storage unit 108, the simulation of the state transition of each component item, which is stored in the system configuration data storage unit 210, according to the state transition model stored in the state transition model storage unit 110, while assuming that the failure of the failure pattern occurs (step S209). The processing contents of this step is similar to the step S7, therefore, the detailed explanation is omitted.
After that, the output processing unit 112 sorts the failure patterns in descending order of the number of damaged items, which is included in the simulation result (step S211). This step is also similar to the step S9, therefore, the further explanation is omitted. Then, the output processing unit 112 identifies the maximum number of damaged items and the corresponding failure pattern at that time, and stores the identified data, for example, into the simulation result storage unit 111 (step S213).
Furthermore, the output processing unit 112 determines whether or not n reached the maximum value, which was preset, or the fluctuation converged (step S215). As for the convergence of the fluctuation, it is determined whether or not a condition such as a condition that the maximum number of damaged items does not sequentially change two times is satisfied.
When n does not reach the maximum value or the fluctuation does not converge, the output processing unit 112 increments n by 1 (step S217). Then, the processing returns to the step S205.
As schematically illustrated in
On the other hand, when n reached the maximum value or the fluctuation converged, the output processing unit 112 generates data representing the change of the maximum number of damaged items, and outputs the generated data to the user terminal 300, for example (step S219). The user terminal 300 displays data as illustrated in
By carrying out such a processing, it is possible to obtain an estimation as to how broad range from the aggregation point the user should consider. Furthermore, similarly to the first embodiment, it is possible to identify the failure pattern to which attention should be paid, therefore, it is also possible to prepare the countermeasure for that.
As described above, by limiting the failure patterns to failure patterns that have high possibility that the influence range becomes large, it is possible to grasp the failure pattern that has a high risk, efficiently. Especially, even when there are a lot of component items, it is much effective to employ the method in the embodiments, because the embodiments do not depend on the number of component items and the number of failure patterns is determined by the number of items included in a predetermined range from the aggregation point.
Furthermore, although an example was explained above that the operation administrator uses this information processing apparatus, it is possible to design the system that does not cause any large-scale trouble, when the aforementioned processing is performed, for example, at the system design. Furthermore, as described above, when the operation administrator uses this information processing apparatus, it is possible to assume the occurrence of the large-scale trouble in advance, and furthermore it is possible to prepare the countermeasure and perform any action to prevent the trouble in advance. Moreover, when the aforementioned processing is performed in advance at the system change, it becomes possible to perform any action to avoid the change that may cause the large-scale trouble.
Although the embodiments of this technique were explained, this technique is not limited to the embodiments. For example, the aforementioned functional block diagram is a mere example, and may not correspond to any actual program module configuration. The data storage mode is also a mere example, and may not always correspond to an actual file configuration.
Furthermore, as for the processing flows, as long as the processing results do not change, the processing turns may be exchanged and parallel execution may be performed.
Furthermore, an example was depicted that the operation management system 200 and the information processing apparatus 100 are different apparatuses, however, they may be integrated. Moreover, the information processing apparatus 100 may be implemented by plural computers. For example, the simulator 109 may be implemented on another computer.
Furthermore, the number of failures that occur at once may be changed.
In addition, the aforementioned information processing apparatus 100 and operation management system 200 are computer devices as illustrated in
The aforementioned embodiments are outlined as follows:
An information processing method relating to the embodiments includes: (A) identifying a component item that satisfies a predetermined condition concerning an indicator value for an influenced range within a system, from among a plurality of component items included in the system, by using data regarding the plural component items and relationships among the plural component items, wherein the data is stored in a first data storage unit; (B) extracting component items included in a predetermined range from the identified component item, based on the data stored in the first storage unit; and (C) generating one or plural failure patterns, each of which includes one or plural sets of one component item of the extracted component items and a failure type corresponding to the one component item, by using data that includes, for each component item type, one or plural failure types and is stored in a second data storage unit, and storing the one or plural failure patterns into a third data storage unit.
Thus, failure patterns for all component items within the system are not generated, however, by limiting the component items from which the failure pattern should be generated as described above, it becomes possible to efficiently identify failure patterns that have large influence. When any trouble occurs in the component item to which communication may be concentrated within the system and/or the component item to which messages may be concentrated, large-scale influence is given to the entire system. Therefore, attention is paid to a component item that influences a broad range, however, attention is also paid to the component item that influences that component item by the failure and trouble. Thus, it is possible to generate failure pattern candidates that give an impact to the entire system by influencing the component item that influences a broad range as described above even if its influence range is small.
The aforementioned information processing method may further include: (D) performing simulation for a state of the system for each of the one or plural failure patterns, which are stored in the third data storage unit, to identify, for each of the one or plural failure patterns, the number of component items that are influenced by a failure defined in the failure pattern. By performing the simulation as described above, it is possible to further narrow the failure pattern.
Moreover, the aforementioned information processing method may further include: (E) sorting the one or plural failure patterns in descending order of the identified number of component items; and outputting the top predetermined number of failure patterns among the one or plural failure patterns. Thus, it becomes possible for the user to easily identify the failure pattern to which any action should be taken.
Furthermore, the aforementioned information processing method may further include: repeating the extracting, the generating and the performing by changing the predetermined range; and generating data that represents a relationship between the predetermined range and a maximum value of the numbers of component items, which are identified in the performing. Thus, it becomes possible to determine how to set the predetermined range. In other words, it becomes possible to understand how broad component items that influence the component item that influences a broad range of the component items the user should consider.
Furthermore, the aforementioned relationships among the plural component items may include connection relationships among the plural component items and calling relationships among the plural component items. In such a case, the aforementioned identifying may include: calculating, for each of the plural component items, the number of subordinate items of the component item based on the connection relationships; calculating, for each of the plural component items, the number of items that directly or indirectly call the component item based on the calling relationships; and identifying a component item that satisfies the predetermined condition based on the number of subordinate items and the number of items, which are calculated for each of the plural component items. A threshold may be set for each of the number of subordinate items and the number of items that directly or indirectly call the component item, and any evaluation function may be prepared to totally determine the component item.
Incidentally, it is possible to create a program causing a computer to execute the aforementioned processing, and such a program is stored in a computer readable storage medium or storage device such as a flexible disk, CD-ROM, DVD-ROM, magneto-optic disk, a semiconductor memory, and hard disk. In addition, the intermediate processing result is temporarily stored in a storage device such as a main memory or the like.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present inventions have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
This application is a continuing application, filed under 35 U.S.C. section 111(a), of International Application PCT/JP2012/051796, filed on Jan. 27, 2012, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2012/051796 | Jan 2012 | US |
Child | 14325068 | US |