The present invention relates to a fault recovery support apparatus, a fault recovery support method, and a program.
With increases in scale and complexity of information and communication technology (ICT) systems, the number and types of failures (faults) arising in the systems have increased. In order to perform handling such as recovery from failures, monitoring, analysis, and the like of observation data (for example, logs of systems, metrics, and the like) are necessary. However, an amount of data or relevancy between pieces of data becomes complicated and recovery tasks of failures become very difficult. Accordingly, in recent years, fault recovery operations using machine learning technology have been developed.
For example, NPL 1 discloses a scheme of executing a command for recovery of a system by trial and error and learning a recovery policy function of calculating a command to be executed subsequently by deep enhancement learning based on observation data obtained as feedback in a verification environment in which a failure is inserted. For example, NPL 2 discloses a scheme of calculating an optimum recovery method by formulating a fault recovery process in accordance with a probability model such as a Markov decision process (MDP) or a partially observable Markov version process (POMDP) and then using a scheme such as Bayesian estimation. Both of the schemes described in NPL 1 and NPL 2 can be said to be schemes aimed at automation of a fault recovery operation.
On the other hand, there is a technology for presenting to a maintenance person what behaviors to perform to recover a system. For example, NPL 3 discloses a technology for visualizing which behavior to perform next in a workflow format by a machine learning technology such as a hidden Markov model by using a trouble ticket from a time of failure handling in the past. It can be said that the technology described in NPL 3 is aimed at a reduction in generality of a fault recovery operation, standardization, and an improvement in efficiency by expressing a fault recovery process in a form easy for a maintenance person to understand.
However, in the scheme disclosed in the foregoing NPL 1, since all learning and execution of a recovery policy function are automatically performed in a black box, a maintenance person cannot ascertain a state and a behavior of a system during a fault recovery. Therefore, for example, when an irregular event occurs during fault recovery and the maintenance person needs to take over and handle the event, it is difficult to handle the failure thereafter in some cases.
On the other hand, in the scheme disclosed in the foregoing NPL 2, since a fault recovery process is expressed with a probability model, a state and a behavior of a system can be ascertained to some degree. However, it is necessary for a maintenance person or the like to construct the probability model in advance. Therefore, in construction of the probability model, operation cost for the maintenance person is increased, and high-level knowledge about the system is also required. As the system increases in size and becomes complicated, the operation cost and knowledge required for constructing the probability model also increase.
According to the technology described in the foregoing NPL 3, a state, a behavior, and the like of a system can be read from a workflow because a fault recovery process is visualized as the workflow. However, a trouble ticket from a time of failure handling in the past is necessary in generation of the workflow. Therefore, for example, a workflow cannot be generated for a system in which an operation result is superficial or a failure of which an occurrence frequency is low in some cases.
An embodiment of the present invention has been contrived in view of the foregoing circumstances, and an objective of the embodiment is to support fault recovery in a target system.
In order to achieve the foregoing object, a fault recovery support apparatus according to an embodiment includes: a fault insertion unit configured to insert a fault into a target system; a behavior execution unit configured to execute a behavior related to recovery from the fault for the target system; a first construction unit configured to construct an automaton representing a recovery process from the fault by using observation data acquired from the target system as a result of the behavior by the behavior execution unit; and a second construction unit configured to construct a workflow representing a behavior for separating each fault included in a plurality of faults and a recovery process of each fault by using a plurality of the automatons related to the plurality of faults.
It is possible to support fault recovery in a target system.
Hereinafter, one embodiment of the present invention will be described. In the present embodiment, a failure recovery support device 10 (fault recovery support apparatus) capable of presenting a failure (fault) recovery process in a form that a maintenance person can understand without manually constructing a probability model or the like expressing the failure recovery process and without requiring presence of a trouble ticket or the like from handling of a past failure (fault) will be described.
Hereinafter, a theoretical configuration of the present embodiment will be described.
The present embodiment is broadly divided into two phases: (1) an observation-labeled deterministic finite automata (OLDFA) construction phase and (2) a recovery process workflow (RPW) construction phase. Therefore, problem definition and an input/output in each phase will be described first. Then, an algorithm for implementing each phase will be described. Here, the present embodiment is not limited to these problem settings, and can be applied to similar problem settings in which construction of an automaton is required.
OLDFA Construction Phase
In this phase, observation data is acquired by inserting a failure (fault) to be recovered from into a target system and then executing any behavior. An objective of the present invention is to construct an automaton for each failure by analyzing a combination of the behavior and observation data. The target system is, for example, a system that verifies a system which is an actual support target of failure recovery, an emulator simulating the system, or the like. The behavior is a behavior by which a state of the target system can be changed and is, for example, an input of any command.
In the present embodiment, it is assumed that a recovery process for the target system is sparse, discrete, and deterministic, and an automaton expressing a behavior of the target system is constructed by using these properties. The fact that the recovery process of the target system is sparse means that only a small number of kinds of behaviors change the state of the target system. The fact that the recovery process is discrete means that the number of possible behaviors is finite and transition of the state of the target system can be discretely expressed. Further, the fact that the recovery process is deterministic means that a transition destination of the state is also determined when the state of the target system and a behavior to be performed are determined.
Here, a state referred to in the present embodiment abstractly represents the state of the target system, and the foregoing assumption of the discreteness and the like is convenient for developing a theory. Therefore, the foregoing assumption does not limit applications of the present embodiment, and the present embodiment can be applied to, for example, a target system that continuously changes.
Hereinafter, a formal problem setting will be described. First, it is assumed that the target system is X, and a behavior of the target system is written as X(f) when a failure f∈F (where F is all the sets desired to be used as a target system) is inserted into X. When X(f) is described with the OLDFA, constructing the OLDFA is an objective of this phase.
The OLDFA is a kind of finite deterministic automaton defined in the present embodiment and is expressed as a tuple <Q, A, δ, q0, {Fl}l∈L, {Ol}l∈L>. Here, Q is a set of states, A is a set of behaviors, δ: Q×A→Q is a state transition function, q0∈Q is an initial state, and Fl⊂Q is a set of states having a label l∈L={0, 1, . . . , |L|−1}. Ql represents an n-dimensional real-value probability variable (here, n is a natural number arbitrarily determined in advance) corresponding to label 1 and indicates observation data.
A* is defined as a set of the whole behavior sequence, and ε∈A* is defined as a behavior sequence having a length of 0. At this time, a definition domain of δ is expanded to Q×A* as follows.
δ(q,ε)=q
δ(q,u·a)=δ(δ(q,u),a)
Here, “·” which is q∈Q, a∈A, and u∈A* represents a combination of a behavior and behavior sequences.
Each q∈Q belongs to exactly any one in {F1}l∈L. That is, Q is covered by {F1}l∈L and a different F1 does not have a common portion.
An initial state q0∈F0 is a state of the target system immediately after insertion of the failure f and has label 0. It is assumed that a state that has label 1 is a state in which recovery is completed (that is, a normal state). In the present embodiment, for simplicity, it is assumed that |F1|=1 (that is, there is only a normal state) and a mechanism determining whether the state is normal is separately provided.
A state that has label l∈L outputs observation data according to the probability variable O1. At this time, it should be noted that observation data is output according to the same probability distribution (probability variable) when the labels are the same even in different states. That is, this is also handled in the target system in which a state cannot be uniquely identified even when only the observation data is viewed, and the Markov property cannot be established even when only the observation data is viewed.
The OLDFA defined above is assumed to be behind X(f), and Q, δ, and {F1}l∈L cannot be directly observed. Instead, by executing a behavior sequence u=a1a2 . . . a|u|∈A*, the observation data in a state of a transition destination
O(u)≡O{label(δ(q
an implemented value of [Math. 1] can acquired. Here, label (q) indicates a state q.
At this time, in this step, an OLDFA for the failure f is constructed by executing the behavior sequence in an appropriate order after insertion of each failure f into the target system and analyzing the observation data obtained at that time.
The example illustrated in
In the OLDFA illustrated in
In the OLDFA illustrated in
On the other hand, in the OLDFA illustrated in
RPW Construction Phase
Referring to the OLDFA basically obtained in the OLDFA construction phase, a maintenance person can grasp the recovery process of the relevant failure. For example, when initial states of the failure f1 and the failure f2 are similar to each other and cannot be distinguished from each other (that is, when an initial state q1 of the failure f1 and an initial state q2 of the failure f2 are outputted with the same probability distribution), it is not known which OLDFA for a failure should be referred to when one of the failures actually occurs. In order to avoid such a situation, a plurality of OLDFAs having similar initial states are merged as one workflow in this phase. In the present embodiment, this workflow is referred to as a recovery process workflow (RPW).
The RPW is defined as follows in terms of form. First of all, N OLDFs corresponding to the failure fi (where i=1, 2, . . . , N), that is,
OLDFAMi=<Qi,A,δi,{Fli}l∈L
is assumed to be set. Here, N is the number of failures.
The RPW is expressed as a directed graph in which a label and a candidate for a state which can be present with respect to each failure are described in a node (vertex), and a behavior and a label of observation data obtained as a result of the behavior are described in a directed side.
That is, the vertex of the RPW is presented as tuple <l, Θ>. Here,
l∈∪
i=
N
L
i [Math 3]
is a label, and Θ=(θ1, . . . , θN) (Q1∪{N/A})× . . . ×(QN∪{N/A}) is indicates a state which can be present for each failure fi. θi=N/A indicates that the failure fi cannot be obtained. θ1, . . . and θN all have label l as long as they are not N/A.
When the directed side from the vertex u=<lu, (θui)i∈{1, . . . , N}> to the vertex v=<lv, (θvi)i∈{1, . . . , N}> of the RPW is written as
(a∈A,lv∈∪i=1NLi) [Math. 4]
the directed side is present when and only when, δi(θui, a)=θvi is established for i satisfying θui≠N/A and θvi=N/A is established for i satisfying θui=N/A. (θi)i∈{1, . . . , N}=(θ1, . . . , θN) is set.
An example of the RPW is illustrated in
For example, the topmost vertex in
On the other hand, for example, when it is assumed that a behavior a is executed at the topmost vertex in
Observation Table
Before algorithms for the foregoing OLDFA construction phase and RPW construction phase are described, an observation table which is one of the important concepts in the present embodiment will be described. The algorithm of the OLDFA construction phase according to the present embodiment is similar to the algorithm called the L* algorithm. The L* algorithm is an algorithm that obtains a deterministic finite automata (DFA) by expanding the observation table based on a result of a binary observation value (0: unaccepted, 1: accepted) obtained by executing a behavior systematically. The details of the L* algorithm should be referred to, for example, reference literature 1 “D. Angluin, “Learning regular sets from queries and counterexamples,” Information and computation, vol. 75, No. 2, pp. 87 to 106, 1987.” or the like.
Hereinafter, an expanded observation table of the L* algorithm will be described as in the construction of the OLDFA. When A and L are given, the observation table is expressed as tuple T=<P, S, h>. Here, P∪A* is a prefix-closed set, S∪A* is a suffix-closed set, and h:(P∪P·A) S→L is a function.
A set E∪A* is referred to as a prefix-closed when or only when u∈E is satisfied at the time of u·a∈E for u∈A* and a∈A. The E⊂A* is referred to as a suffix-closed when or only when u∈E is satisfied at the time of a·u∈E for u∈A* and a∈A.
The observation table is a 2-dimensional matrix in which elements of P∪P·A are indexes of rows and elements of S are indexes of columns. Each element can be expressed as a 2-dimensional matrix defined as
{tilde over (h)}:(P∪P·A)×S(p,s)→h(p·s)∈L [Math. 5]
In text of the present specification, the foregoing number 5 is written as “˜h” below.
Further, as a row vector,
{tilde over (h)}:(P∪P·A)→L|S| [Math. 6]
is defined as
{tilde over (h)}(p)=({tilde over (h)}(p,s1),{tilde over (h)}(p,s2), . . . ,{tilde over (h)}(p,s|S|)) [Math. 7]
Here, p∈(P∪P·A) and sj∈S (where j=1, . . . , |S|) are satisfied. In text of the present specification, the foregoing number 6 is written as “vector ˜h” below.
At this time, an observation table formed with a vector ˜h (hereinafter written as an “observation table ˜h”) is closed when “there is certain p∈P for “any p′∈P·A and a vector ˜h(p′)=vector ˜h(p)” are established. When “a vector ˜h(p1)=vector ˜h(p2) is satisfied for any p1, p2∈P, a∈A and a vector ˜h(p1·a)=vector ˜h(p2·a)” is established, the observation table is consistent.
For a closed and consistent observation table ˜h, the OLDFA in which the number of vertexes satisfying δ(q0, p·s)=˜h(p, s) is the minimum can be constructed as follows.
Q≡{{tilde over (h)}(p)|p∈P}
q
0
≡{tilde over (h)}(∈)
δ({tilde over (h)}(p),a)={tilde over (h)}(p·a)
F
c
≡{{tilde over (h)}(p)|p∈P,h(p)=l} [Math. 8]
An example of the observation table is illustrated in
The upper six rows (from ε∈P to abb∈P) of the observation table illustrated in
The vector ˜h(∈) corresponds to state 0 of the OLDFA illustrated in
OLDFA Construction Algorithm
Next, an algorithm for implementing the OLDFA construction phase (an OLDFA construction algorithm) will be described. An example of the OLDFA construction algorithm is illustrated in
As illustrated in
When closedness and consistency of the observation table T are tested in the 8th to 16th rows and closedness is broken, addition of a row to the observation table T is executed (9th to 12th rows). When consistency is broken, addition of a column to the observation table T is executed (13th to 16th rows). The addition of the row and the addition of the column are executed in an EXPAND-TABLE procedure to be described below.
When the observation table T is closedness and consistency, an OLDFA is constructed from the observation table T (18th row) and a True is set to the learned, (19th row), and then an equivalence test is executed (20th to 27th rows). The equivalence test is a test for confirming whether the target system conforms to the OLDFA.
In the equivalence test, sampling of an appropriate behavior sequence is executed (the 21st row), the target system is executed in the behavior sequence, the state of the OLDFA is caused to transition, and it is determined whether the observation data obtained by the target system matches a label obtained by the OLDFA (22nd row). In the OLDFA construction algorithm illustrated in
When it is determined that observation data obtained by the target system matches the label obtained by the OLDFA in all the sampled behavior sequences, the OLDFA is output (28th row). Conversely, when there is an unmatched behavior sequence, expansion of the observation table T is executed as a counter-example (23rd to 25th rows) of the behavior sequences, False is then set in learned (26th row), and omission from the equivalence test is executed (27th row). Thus, the closedness and the consistency are tested again.
Next, an EXPAND-TABLE procedure will be described. An example of an algorithm of the EXPAND-TABLE procedure is illustrated in
Specifically, in the 2nd to 5th rows, p-s is executed for all the sets of p∈P′ and s∈S′ in the target system (4th row) to acquire observation data. Then, the observation data is all stored in a data area Data (p·s) (5th row). It is assumed that, in the data area Data, observation data at the time of execution of the EXPAND-TABLE procedure before the present time is not erased and remains. The data area Data is realized by, for example, a database or the like.
Thereafter, unsupervised clustering is executed using all the observation data accumulated in the data area Data and a cluster number obtained as a result is used as a label to be given to ˜h(p·s)=h(p·s) (6th and 7th rows). That is, a label given to h(p·s) is registered as an element having p as a row index and has s as a column index. Thus, the observation table T is expanded.
The foregoing clustering processing can be said to be processing for converting the observation data which is a multidimensional probability value into a label which is a finite discrete value. Thus, it is possible to handle the observation table in the same manner as the L* algorithm. The main reasons why the observation data is all accumulated in the data area Data and the clustering is executed are that clustering reliability is higher as the number of pieces of data is larger, and a clustering result is meaningful for only the classification method and is not meaningful for a number itself allocated to each cluster, and therefore it is simply to execute the clustering on all the observation data again and exchange all the values of the elements of the observation table.
Several variations of the clustering such as variations exemplified in the following (a) and (b) are considered, and the present invention is not necessarily limited to those described above.
(a) When a sufficient number of pieces of observation data has already been accumulated, a supervised classifier associating the observation data with labels may be learned, and the labeling may be executed by the classifier subsequently.
(b) When a sufficient number of pieces of observation data has already been accumulated, only additional observation data may be clustered by online clustering or the like subsequently.
Any clustering scheme can be used and a scheme that has the following Properties 1 and 2 is preferable. As the clustering scheme that has the Properties 1 and 2, for example, DBSCAN or the like can be exemplified. For DBSCAN, for example, reference literature 2 “M. Ester, H-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorism for discovering clusters in large spatial databases with noise.” in Kdd, vol. 96, No. 34, 1996, pp. 226-231.” and the like.
Property 1: The number of clusters can also be estimated at the same time. This is because information regarding the number of states and the number of labels is not held in advance.
Property 2: No specific assumption such as Gaussian distribution is imposed on distribution of the observation data. This is because data characteristics of the observation data are not always held.
RPW Construction Algorithm
Next, an algorithm for implementing the RPW construction phase and the RPW construction algorithm will be described. An example of the RPW construction algorithm is illustrated in
Although the RPW construction algorithm integrates (merges) a plurality of OLDFAs, it is necessary to unify all the meanings of the label numbers in the OLDFAs before the merging. For example, it is necessary to allocate label 2 in the OLDFA of the failure f1 and label 2 in the OLDFA of the failure f2 to the probability variable of the same observation data. Accordingly, before this algorithm is executed, for example, it is necessary to execute clustering processing collectively again to unify the label values of the OLDFAs or to unify the label values of the respective OLDFAs by using the same classifier. Alternatively, for example, when the OLDFA is constructed, the OLDFA is not individually constructed for each failure. When a plurality of initial states are prepared for each failure to construct one OLDFA in accordance with the OLDFA construction algorithm and the maximum component that can reach from each initial state is subsequently extracted, the OLDFA for each failure in which a label number is spontaneously unified can be acquired.
As illustrated in
The RPW construction algorithm constructs the RPW by defining vertices and directed sides while confirming where a subsequent state transitions for each state and behavior in the OLDFA for each failure. As an order in which the vertices and the oriented sides are constructed, for example, as illustrated in
Specifically, after a set Edges, a queue nodeQueue, and a set seen are initialized (2nd to 5th rows), a graph is constructed during nodeQueue≠φ (6th to 19th rows). On the other hand, when nodeQueue≠φ is satisfied, Edges is output (20th row). Edges is a set that has (u, v, (a, l)) as an element when a start point of a directed side is u, an end point of the directed side is v, and a behavior and a label described in the directed side is (a, l). Thus, an RPW in which the N OLDFAs are merged is constructed.
In the construction of the graph, a vertex getting from nodeQueue is set as u=<l, Θ> (7th row) and the 9th to 19th rows are repeated for each a∈A (8th row). Note getting is an operation of extracting an element from nodeQueue which is a list structure of first-in first-out.
When ˜Θ1=(N/A, . . . , N/A) is set for all l∈UiLi in the 9th row and Θ[i]≠N/A is set for i∈{1, . . . , N} in the 10th to 13th rows, a label of δi(Θ[i], a) is obtained and a state ˜Θ1[i] corresponding to the label is updated to δi(Θ[i], a). Here, Θ[i] represents an ith element of Θ.
In the 14th to 19th rows, v=<l, ˜Θ[i]> is set for l∈UiLi, and then (u, v, (a, l)) is added to Edges. When v is not included in seen, v is added to nodeQueue and v is also added to seen. seen is a set which stores nodes already constructed as the RPW.
Next, a hardware configuration of the failure recovery support device 10 according to the present embodiment will be described with reference to
As illustrated in
The input device 11 is, for example, a keyboard, a mouse, a touch panel, or the like. The display device 12 is, for example, a display or the like. The failure recovery support device 10 may not include at least one of the input device 11 and the display device 12.
The external I/F 13 is an interface with an external device such as a recording medium 13a. The failure recovery support device 10 can perform reading, writing, and the like of the recording medium 13a via the external I/F 13. Examples of the recording medium 13a include a compact disc (CD), a digital versatile disk (DVD), a secure digital (SD) memory card, and a Universal Serial Bus (USB) memory card.
The communication I/F 14 is an interface connecting the failure recovery support device 10 to a communication network.
The processor 15 is any of various arithmetic devices such as a central processing unit (CPU) and a graphics processing unit (GPU). The memory device 16 is, for example, any of various storage devices such as a hard disk drive (HDD), a solid-state drive (SSD), a random access memory (RAM), a read only memory (ROM), and a flash memory.
The failure recovery support device 10 which has the hardware configuration illustrated in
Next, a functional configuration of the failure recovery support device 10 according to the present embodiment will be described with reference to
As illustrated in
The failure recovery support device 10 according to the present embodiment includes an observation data DB 108. The observation data DB 108 can be implemented by, for example, the memory device 16 (in particular, a nonvolatile memory device such as an HDD or a flash memory). The observation data DB 108 may also be realized by using, for example, a storage device connected to the failure recovery support device 10 via the communication network or the like.
The origin recovery unit 101 performs origin recovery for returning the target system 20 to a normal state. Here, the target system 20 is a system that verifies a system which is an actual support target of a failure recovery, an emulator simulating the system, or the like, as described above. The origin recovery unit 101 is not necessarily provided in the failure recovery support device 10, and for example, the target system 20 may have the origin recovery unit 101.
The failure insertion unit 102 inserts a failure into the target system 20. The failure insertion unit 102 can be implemented by a tool (generally called a failure insertion tool or the like) that artificially inserts various kinds of failure data into the target system 20.
The observation table construction unit 103 performs expansion of the observation table, a closedness/consistency test or determines a subsequent behavior.
The behavior execution unit 104 executes a behavior determined by the observation table construction unit 103. The behavior execution unit 104 is implemented by program processing called an agent capable of executing various behaviors or the like.
The clustering unit 105 clusters the observation data accumulated in the observation data DB 108.
The equivalence testing unit 106 constructs an OLDFA from the observation table and performs an equivalence test.
The RPW construction unit 107 constructs an RPW from a plurality of OLDFDAs.
The observation data DB 108 accumulates observation data acquired from the target system 20.
Next, a flow of processing in the OLDFA construction phase and the RPW construction phase will be described with reference to
Here, the processing (steps S101 to S113) of the OLDFA construction phase is repeatedly executed for each failure to construct the OLDFA for each failure. That is, steps S101 to S113 are repeatedly executed for each failure fi (where i=1, . . . , N). Hereinafter, a flow of processing when the OLDFA corresponding to a certain failure fi is constructed will be described below as an example. Although the case where the observation table is expanded after accumulation of a certain amount of observation data of
However, when a plurality of initial states are prepared for each failure and one OLDFA is constructed by the OLDFA construction algorithm, it is not necessary to repeatedly execute steps S101 to S113. In this case, not only one failure but also various failures are inserted in step S102 to be described below.
First, the origin recovery unit 101 performs origin recovery for returning the target system 20 to a normal state (S101).
Next, the failure insertion unit 102 inserts the failure fi to the target system 20 (step S102).
Next, the observation table construction unit 103 determines the next behavior (step S103). The observation table construction unit 103 determines the p·s of the fourth row of the EXPAND-Table procedure illustrated in
Next, the behavior execution unit 104 executes the behavior determined in the foregoing step S103 (step S104). Thus, the observation data is acquired from the target system 20, and the observation data are accumulated in the observation data DB 108.
Next, the clustering unit 105 determines whether a clustering command is issued from the observation table construction unit 103 (step S105). When the clustering command is not issued, the processing returns to step S103. When the clustering command is issued, the processing proceeds to step S106. The observation table construction unit 103 may issue a clustering command, for example, when a fixed number or more of pieces of observation data are accumulated in the observation data DB 108. When the clustering command is issued once, the determination of the present step may not be performed subsequently.
The clustering unit 105 clusters the observation data accumulated in the observation data DB 108 (step S106). Then, the observation table construction unit 103 expands the observation table by using the clustering result of the foregoing step S106 (step S107). Steps S106 and S107 correspond to the sixth and seventh rows of the EXPAND-TABLE procedure illustrated in
Next, the observation table construction unit 103 executes a closedness/consistency test (that is, a test of whether the determination of the 7th row and the determination of the 13th and 14th rows of the OLDFA construction algorithm illustrated in
When it is determined that either the closedness or the consistency is not satisfied, the processing returns to step S103. Thus, when it is determined that the closedness is not satisfied, a behavior of p* in the ninth and tenth row of the OLDFA construction algorithm illustrated in
On the other hand, when it is determined that the closedness and the consistency are satisfied, the process proceeds to a step S110. An equivalence testing unit 106 constructs an OLDFA from the observation table, and then performs the equivalence test (step S110). That is, the equivalence testing unit 106 executes the 18th and 22th rows of the OLDFA construction algorithm illustrated in
Next, the equivalence testing unit 106 determines whether the OLDFA has passed through the equivalence test or not (step S111). The equivalence testing unit 106, for example, may determine that the equivalence test has passed when labels match in all behavior sequences U sampled in the 21st row of the OLDFA construction algorithm illustrated in
When it is determined that the equivalence test is not passed in the step S111, an observation table construction unit 103 executes the 23rd and 27th rows of the OLDFA construction algorithm illustrated in
Conversely, when it is determined in step S111 that the equivalence test has passed, the equivalence testing unit 106 outputs the OLDFA on which the equivalence test has passed (step S113).
Subsequently, in the RPW construction phase, the RPW construction unit 107 inputs any plurality of OLDFAs and constructs RPW from the OLDFAs in accordance with the RPW construction algorithm illustrated in
Then, the RPW construction unit 107 outputs the RPW constructed in the foregoing step S114 (step S115).
As described above, the failure recovery support device 10 according to the present embodiment can construct an automation (OLDFA) indicating a state and a behavior of the target system in the recovery process by artificially inserting various failures into the target system 20 and acquiring observation data while taking various behaviors by the agent. Further, the failure recovery support device 10 according to the present embodiment constructs a recovery process workflow in which the plurality of OLDFAs are integrated (merged). Thus, for example, even when a type of failure cannot be identified, transition to the state and behavior of the target system can be ascertained.
As described above, since the failure recovery support device 10 according to the present embodiment automatically executes all the processes, it is not necessary to design a probability model by a manual operation. Since a failure is inserted artificially, it is possible to construct an automaton and a workflow even for a low frequent failure or an unknown failure. In this case, presence of a trouble ticket at the time of handling of a past failure is not assumed either.
Then, a maintenance person can easily ascertain a state and a behavior of the target system by referring to the OLDFA and the RPW. Thus, the failure recovery operation in an actual target system can be accelerated and standardized.
The OLDFA and the RPW can be utilized not only when the maintenance person himself or herself manually recovers a failure, for example, but also when an agent implemented by an artificial intelligence (AI) technology or the like-automatically recover a failure, and the maintenance person can confirm a recovery process by the agent.
Hereinafter, a result obtained by carrying out an actual experiment using the failure recovery support device 10 according to the present embodiment will be described.
In this experiment, a Kubernetes cluster was generated using Kubernetes which is an orchestrator of a container type virtual environment, and a web3 layer environment was constructed therein. The web3 layer environment implemented by the container is regarded as the target system 20 in the present embodiment. The web3 layer environment is configured from three containers of Nginx, Rails, and MySQL.
At this time, http requests were generated at random by a load test tool as background traffic. As a failure, a delay of 2,000±100 ms was inserted into two containers among the three containers.
Inflow and outflow traffic in each container (on a byte basis or a number-of-packet basis) were collected in each container to be set as observation data. The observation data is a 12-dimensional vector. After normalization, the dimensions were reduced by UMAP and clustering was performed by DBSCAN. For the UMAP, refer to, for example, reference literature 3 “L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018.”
A behavior that the agent can take is execution of a command for regenerating each container. Since there are three containers in total, the total number of behaviors is three. More specifically, a: Nginx was regenerated, b: Rails was regenerated, and c; MySQL was regenerated. The container into which a failure is inserted is recovered by issuing a regeneration command.
An OLDFA constructed by the failure recovery support device 10 according to the present embodiment under the above-described experimental setting is illustrated in
In f1 and f2, labels in the initial states are both 0, and a failure cannot be separated with only the observation data. Therefore, the RPWs were constructed. The results are illustrated in
The present invention is not limited to the foregoing specifically disclosed embodiment, and various modifications and changes, combinations with known technologies, and the like can be made without departing from the description of the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/047094 | 12/17/2020 | WO |