The present disclosure relates to a learning data generation apparatus, a learning data generation method, and a program.
One of the important tasks of a service provider operating an information and communication technology (ICT) system is to ascertain a state of an abnormality occurring in the ICT system and to quickly cope with the abnormality. Therefore, a scheme for detecting an abnormality occurring in an ICT system early and a scheme for estimating an abnormal portion have been studied (for example, NPL 1 and NPL 2). As schemes for estimating an abnormal portion, for example, a scheme described in NPL 3, a scheme described in NPL 4, and the like have been proposed. NPL 3 proposes a scheme for modeling a relationship between an abnormal portion and a change in data in an ICT system caused in the abnormal portion as a causal model by using a Bayesian network, and estimating the abnormal portion from data observed during abnormality. NPL 4 proposes an abnormality factor identifying scheme by generating fault data by chaos engineering.
Here, when an abnormal portion is estimated by the causal model, there are two methods of constructing the causal model. The first method is a method of defining and modeling an abnormal portion and a rule of a change in data in an ICT system caused by the abnormal portion based on knowledge or the like of an expert operator (for example, NPL 3). The second method is a method of constructing a causal model from an abnormal portion during past abnormality and data at that time. In the studies of the related art, a causal model is constructed by one of the two methods and an abnormal portion is estimated.
In general, only a small amount of data during a fault can be obtained on an ICT system. However, in chaos engineering, a fault is intentionally inserted into the ICT system, and an abnormal portion and data at that time are collected. Accordingly, the collected data can be used for modeling a Bayesian network or can be used for learning data such as a support-vector machine (SVM), and an abnormal portion and a factor can be estimated.
The two construction methods of the causal models in the studies of the related art have problems. First, in the first method, there is a problem that an abnormal portion cannot be correctly estimated when an abnormality other than a defined rule occurs. In particular, it is difficult to construct a causal model by covering an abnormality which can occur in the ICT system in advance and, as a result, it may not be possible to estimate the abnormal portion correctly in some cases.
Next, the second method has a problem that it is difficult to sufficiently collect data during an abnormality necessary to construct the causal model. This is because an ICT system generally rarely generates an abnormality, and even if an abnormality occurs, a recurrence prevention measure is taken so that the same abnormality does not occur again. In the second method, a causal model is constructed based only on past abnormalities, so that the causal model cannot cope with unknown abnormalities, and an abnormal portion cannot be estimated.
Chaotic engineering is likely to partially solve the problem that it is difficult to sufficiently collect data during abnormality necessary to construct a causal model, but cannot be said to suffice. This is because a wide variety of abnormalities occur in an ICT system, but chaos engineering is a method of intentionally inserting a fault, and thus only data related to abnormalities able to be conceived by humans can be obtained.
The present disclosure has been devised in view of the foregoing circumstances and provides a technique for generating data used to construct a model for estimating an abnormal portion.
According to an aspect of the present disclosure, a learning data generation apparatus generating learning data used to learn a model for estimating an abnormal portion of an ICT system includes: a learning unit configured to learn parameters of a generator and a discriminator forming a conditional hostile generative network by using observation data during abnormality of the ICT system; and a generation unit configured to generate the learning data using the generator in which the learned parameters are set.
A technique for generating data used to construct a model for estimating an abnormal portion is provided.
Hereinafter, an embodiment of the present invention will be described. Hereinafter, a learning data generation apparatus 10 for generating learning data used to construct a model for estimating an abnormal portion of the ICT system (for example, a causal model modeled by a Bayesian network or the like, a machine learning model such as a support-vector machine (SVM)) will be described.
First, a theoretical configuration of a scheme in which the learning data generation apparatus 10 according to the embodiment generates learning data (hereinafter, also referred to as a proposal scheme) will be described.
In the present proposal scheme, learning data is generated using a conditional hostile generative network (CGAN: Reference Literature 1) generating abnormal data. Accordingly, abnormal data in an amount sufficient for learning a model for estimating an abnormal portion of the ICT system can be obtained as learning data. Since the CGAN generates abnormal data by inputting random data to a generator, various types of abnormal data can be generated. Therefore, for example, abnormal data which is difficult to obtain in chaos engineering can also be generated.
A case where the learning data is generated by the CGAN will be described below, but the present invention is not limited to the CGAN. Another model can be realized as long as the generation model is capable of designating at which position the abnormality occurs in abnormal data.
First, a data set in the past abnormality occurring in the ICT system is assumed to be X={x1, . . . , xN}. Here, xi is a k-dimensional vector representing past abnormal data. k is the number of types of data, such as a traffic amount collected from the ICT system and a central processing unit (CPU) usage rate. That is, each xi represents any of various states such as a traffic amount and a CPU usage rate when the ICT system is abnormal. N is the number of pieces of abnormal data. Each xi may have a data value at a certain time as an element, or may have a statistical value such as an average of data values in a certain time duration as an element.
Data representing an abnormal portion when the abnormality occurs with regard to the abnormal data xi is represented by y1, and a data set formed by the abnormal portion data y1 is represented by Y={y1, . . . , yN}. Here, y1 is an l-dimensional (where, l is a lower case letter of L) vector. l denotes the number of apparatuses in the ICT system. It is assumed that each element of yi corresponds to each apparatus in the ICT system. However, the present invention is not limited thereto. For example, each element of yi corresponds to an I/F of an apparatus or a device built into the apparatus. When each element of yi corresponds to an I/F of the apparatus, it is possible to estimate an abnormal portion in units of I/Fs. When each element corresponds to a device built into the apparatus, it is possible to estimate an abnormal portion in units of devices.
It is assumed that yi is a one-hot vector in which only a j∈{1, . . . , j}-th element corresponding to the abnormal portion is 1, and the other elements are 0.
Hereinafter, it is assumed that the data sets X and Y are formed by data observed when abnormality occurs in an actual ICT system, but the present invention is not limited thereto. For example, the data sets X and Y may be formed by data generated by the chaos engineering, or data observed when abnormality occurs in an actual ICT system and data generated by the chaos engineering may be mixed.
In this proposal scheme, the CGAN illustrated in
The generator G (⋅; θG) accepts an m+ l-dimensional vector in which an m-dimensional vector generated at random and an l-dimensional vector are combined as an input, and outputs a k-dimensional vector.
Hereinafter, in text of the present specification, a character in which “{circumflex over ( )}” of xi as an accent will be referred to as “{circumflex over ( )}xi”.
Although there are various methods for generating random m-dimensional vectors, for example, a method of sampling values of elements from a normal distribution with an average of 0 and a variance of 1 can be exemplified. In the learning of the generator G (⋅; θG), the parameter θG is learned so that an m-dimensional vector {circumflex over ( )}xi output when an (m+l)-dimensional vector obtained by combining the m-dimensional vector generated at random and the 1-dimensional vector yi having only the j-th element of 1 is input is similar to xi. That is, the generator G (⋅; θG) learns the parameter θG so that data similar to abnormal data actually collected by the ICT system can be generated. In other words, this means that the parameter θG is learned so that an erroneous determination is made in the determination of the discriminator D (⋅; θD) to be described below.
The discriminator D (⋅; θD) accepts the k-dimensional vector as an input and outputs a scalar value of 0 or 1. One of the abnormal data xi actually collected from the ICT system or the data {circumflex over ( )}xi generated by the generator G is input to the discriminator D (⋅; θD), and it is determined whether xi or {circumflex over ( )}xi is input. The discriminator D (⋅; θD) outputs 1 when it is determined that xi is input, and outputs 1 when it is determined that {circumflex over ( )}xi is input. In the learning of the discriminator D (⋅; θD), the parameter θD is learned so that discrimination performance is enhanced.
By learning the generator G (⋅; θG) and the discriminator D (⋅; θD) as described above, the generator G (⋅; θG) can generate data close to the abnormal data actually collected by the ICT system.
A loss function L of the CGAN including the generator G (⋅; θG) and the discriminator D (⋅; θD) is shown in the following Formula (1).
Here, E(⋅) is an expected value and z is an m-dimensional vector generated at random. z is also called noise. x∈X and y∈Y are abnormal portion data when abnormality occurs with regard to the abnormal data x∈X. Further, cot (z, y) is an operation of combining z and y to generate an (m+l)-dimensional vector.
Then, the parameters θG and θD are learned so as to minimize the loss function shown in the above Formula (1).
Specifically, the parameters θG, and θD, are learned by the following Formula (2).
It is conceivable that schemes of updating various parameters, and an appropriate scheme may be used among known updating schemes.
After learning is performed by the above Formula (2), learning data is generated by the generator G (⋅; θG) having the learned parameter θG. Specifically, the (m+l)-dimensional vector obtained by combining an m-dimensional vector z generated at random and an 1-dimensional vector y generated at random is input to the learned generator G (⋅; θG), and the k-dimensional vector {circumflex over ( )}x is obtained as an output. Accordingly, learning data ({circumflex over ( )}x, y) for constructing a model for estimating an abnormal portion of the ICT system (for example, the causal model modeled by a Bayesian network or the like, a machine learning model such as SVM) can be obtained. The l-dimensional vector y is, for example, a one-hot vector in which only the j-th vector is set to 1 at random due to a uniform distribution or the like.
The input device 101 is, for example, a keyboard, a mouse, a touch panel, various physical buttons, or the like. The display device 102 is, for example, a display or a display panel. The learning data generation apparatus 10 may not include at least one of the input device 101 and the display device 102.
The external I/F 103 is an interface with an external device such as a recording medium 103a. The learning data generation apparatus 10 can perform reading and writing from and on the recording medium 103a via the external I/F 103. Examples of the recording medium 103a include a flexible disk, a compact disc (CD), a digital versatile disk (DVD), a secure digital (SD) memory card, and a Universal Serial Bus (USB) memory card.
The communication I/F 104 is an interface for connecting the learning data generation apparatus 10 to a communication network. The RAM 105 is a volatile semiconductor memory (storage device) that temporarily stores programs and data. The ROM 106 is a nonvolatile semiconductor memory (storage device) that can hold programs and data even when a power source is turned off. The auxiliary storage device 107 is a storage device such as a hard disk drive (HDD) or a solid state drive (SSD). Examples of the processor 108 include various arithmetic devices such as a CPU and a graphics processing unit (GPU).
The learning data generation apparatus 10 according to the embodiment can implement various types of processing which will be described below with the hardware configuration illustrated in
The observation data collection unit 201 collects abnormal data x of the ICT system and abnormal portion data y when abnormality occurs. The abnormal data x and the abnormal portion data y are stored in the observation data DB 206. Accordingly, the data set X formed by the abnormal data x and the data set Y formed by the abnormal portion data y are stored in the observation data DB 206.
The generation unit 202 is realized by the generator G (⋅; θG), and accepts an (m+l)-dimensional vector as an input and outputs a k-dimensional vector.
The discrimination unit 203 is realized by the discriminator D (⋅; θD), and accepts the k-dimensional vector as an input and outputs a scalar value of 0 or 1.
The learning unit 204 learns the parameters θG, and θD by the above Formula (2).
The output unit 205 outputs various pieces of information to an output destination. For example, the output unit 205 outputs the k-dimensional vector output by the generation unit 202 and the scalar value output by the discrimination unit 203 to the display device 102 or the auxiliary storage device 107. For example, the output unit 205 outputs a set ({circumflex over ( )}x, y) of the k-dimensional vector {circumflex over ( )}x output by the generation unit 202 realized by the learned generator G (⋅; θG) and the l-dimensional vector y used at that time to the auxiliary storage device 107 or the like as learning data.
Hereinafter, a flow of processing performed by the learning data generation apparatus 10 will be described with reference to
Step S101: the learning unit 204 learns the parameters θG and θD by the above Formula (2) using the data sets X and Y.
Step S102: the generation unit 202 generates an m-dimensional vector z at random and generates an 1-dimensional vector y (where y is a one-hot vector in which only a j-th element is 1) at random, inputs an (m+l)-dimensional vector obtained by combining z and y to the learned generator G (⋅; θG), and generates a k-dimensional vector {circumflex over ( )}x as an output. Accordingly, the learning data ({circumflex over ( )}x, y) is obtained.
Step S103: the output unit 205 outputs the learning data ({circumflex over ( )}x, y) obtained in the foregoing step S102 to a predetermined output destination (for example, the auxiliary storage device 107 or the like).
As described above, the learning data generation apparatus 10 according to the embodiment can learn the CGAN using the observation data (x, y) during abnormality of the ICT system and can generate the learning data ({circumflex over ( )}x, y) for constructing a model for estimating an abnormal portion of the ICT system by the generator G included in the CGAN. Accordingly, a sufficient amount of learning data necessary to construct the model can be obtained.
Further, the generator G accepts a vector in which a vector z generated at random and a one-hot vector y generated at random are combined as an input, and generates abnormal data {circumflex over ( )}x. Therefore, for example, abnormal data which is difficult to obtain in the chaos engineering can also be generated. Accordingly, by using the learning data generated by the learning data generation apparatus 10 according to the embodiment, it is possible to construct a model capable of estimating an abnormal portion with high accuracy.
The present invention is not limited to the specifically disclosed embodiments, and various modifications, changes, combinations with known techniques, and the like can be made without departing from the scope of the claims.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/JP2022/015591 | 3/29/2022 | WO |