The present invention relates to a technique for detecting anomalies that occur as breakdown of correlations across data types, in a technique for monitoring a variety of data collected from a system to detect system anomalies.
A technique is proposed, in a system with a function of observing various data in real time, in which correlations between metrics at normal time are learned using normal data by projecting the correlations into a space of dimensions the number of which is smaller than that of a normal data space, and “degree of anomaly” of test data is output when correlations between metrics at normal time are disrupted in the test data (Non-Patent Documents 1 to 4).
In this technique, there is a problem that, as the number of observed pieces of data increases, the relationship between the normal data space and the projection data space to be learned increases in combination, resulting in an increase in the required normal data (Non-Patent Document 3).
An object of the present invention is to provide a technique for solving the problem in that, as the number of observed pieces of data increases, the space in which normal data is distributed increases in combination, in an anomaly detection technique in which multiple types of pieces of data are input.
According to the disclosed technique, there is provided an anomaly detection apparatus having a function of an autoencoder that includes an input layer, hidden layers, and an output layer, and that learns parameters such that data of the input layer is reproduced in the output layer, the anomaly detection apparatus including:
input means that input normal data of a plurality of types;
learning means that learn parameters such that normal data of the input layer is reproduced in the output layer by learning a feature across data types using data of a dimension which is smaller than a dimension of the normal data; and
anomaly degree calculation means that input test data to the autoencoder using the parameters learned by the learning means, and calculate anomaly degree of the test data based on output data of the autoencoder and the test data.
According to the disclosed technique, there is provided a technique for solving the problem in that, as the number of observed pieces of data increases, the space in which normal data is distributed increases in combination, in an anomaly detection technique in which multiple types of pieces of data are input.
Hereinafter, embodiments of the present invention (the present embodiment) will be described with reference to the drawings. The embodiments described below are only one example, and the embodiments to which the present invention is applied are not limited to the following embodiments.
(Apparatus Configuration)
The calculation unit 101 performs parameter learning processing in a learning phase and anomaly calculation processing in a test phase. Details of the processing performed by the calculation unit 101 will be described later. The storage unit 102 is a storage which stores various data, parameters, and the like. The input unit 103 inputs various types of data, and the output unit 104 outputs an anomaly degree.
(Example of Hardware Configuration)
The anomaly detection apparatus 100 can be implemented by causing a computer to execute a program describing the processing contents described in this embodiment. That is, the anomaly detection apparatus 100 can be realized by executing a program corresponding to the processing performed by the anomaly detection apparatus 100 using hardware resources such as a CPU or a memory embedded in a computer. The program can be recorded on a computer-readable recording medium (portable memory, etc.), stored or distributed. It is also possible to provide the program via a network, such as the Internet or e-mail.
A program for implementing processing by the computer is provided, for example, by a recording medium 151 such as a CD-ROM or a memory card. When the recording medium 151 on which the program is stored is set in the drive device 150, the program is installed in the auxiliary storage device 152 from the recording medium 151 through the drive device 150. However, the installation of the program need not necessarily be performed by the recording medium 151, and the program may be downloaded from another computer via a network. The auxiliary storage device 152 stores the installed program and stores necessary files, data, and the like.
The memory device 153 reads out the program from the auxiliary storage device 152 and stores the program when an instruction to start the program is issued. The CPU 154 performs the function related to the anomaly detection apparatus 100 according to a program stored in the memory device 153. The interface device 155 is used as an interface for connecting to a network. The display device 156 displays a GUI (Graphical User Interface) by a program or the like. The input device 157 may comprise a keyboard and a mouse, a button, touch panel, or the like, and may be used to input various operating instructions. The display device 156 may not be provided.
Hereinafter, examples 1 to 8 will be described as operation examples of the anomaly detection apparatus 100. Hereinafter, the example 1 is base processing, and as to examples 1-8, descriptions are given basically for differences from the example 1 and for features added to the example 1. Further, any of the examples 1 to 8 may be implemented in combination unless there is a conflict.
Before describing each example in detail, an outline of each example will be described.
First, an outline of the operation performed by the anomaly detection apparatus 100 will be described. The anomaly detection method that is the basis of the anomaly detection method performed by the anomaly detection apparatus 100 is as follows.
First, for data x represented by a multidimensional numerical vector, as for a mapping f to a space Z different from the original data space X and a mapping g from Z to the original data space X, when normal data is provided as x, x is projected to Z with f, and f and g are learned such that the reconstruction error, which is the distance between the reconstructed data projected by g to the original data space and the original data, is as small as possible. Then, for test data subject to anomaly detection, the reconstruction error when the test data is projected from X to Z by the mapping f and from Z to X by the mapping g is regarded as an anomaly degree of the data.
The outline of the anomaly detection method performed by the anomaly detection apparatus 100 according to the example 1 is as follows.
Normal data is input to the anomaly detection apparatus 100. Here, there are multiple types of data (K types in total). On the original data space X, data of type k exists on the subspace X_k of X, and the anomaly detection apparatus 100 learns a mapping f1_k (k=1, . . . , K) from X_k to the subspace Y_k of another space Y; a mapping f2 that maps the data on the space Y into the new space Z; a mapping g1 from Z to Y; and a mapping g2_k (k=1, . . . , K) from the subspace Y_k of Y to the subspace X_k of the original data space X.
Then, by using each learned mapping, the anomaly detection apparatus 100 assumes reconstruction error as anomaly degree, the reconstruction error being obtained by projecting test data to Y by the mapping f1_k (k=1, . . . , K), projecting Y to Z by the mapping f2, projecting Z to Y by the mapping g1, and projecting Y to X by the mapping g2_k (k=1, . . . , K). Accordingly, anomaly detection is performed by extracting features that span data types.
In the example 2, in the example 1, data is classified from viewpoints other than data type, and each is considered to be present in the subspace X_k of X.
In the example 3, for each mapping in the example 1, the mapping f from space A to space B is learned as a composite mapping of n mappings, such as f=f1∘f2∘ . . . , of_n.
In the example 4, in the example 1, the space in which normal data is ultimately mapped is space P instead of X, and the anomaly degree of data is given as a value obtained by inverting positive/negative of a likelihood function in which normal data is observed under a probability distribution when parameters are given as values when normal data is mapped onto space P, for a predetermined probability distribution F.
In the example 5, in the example 1, for a predetermined probability distribution G, parameters are given as values when normal data is mapped on space Z, random numbers are given on space Z′ according to the probability distribution, and mapping from space Z′ to Y is learned.
In the example 6, in examples 1 to 5, the mapping parameters obtained by learning each of the mapping parameters individually are set to initial values, and mapping parameters are learned such that the distance between the original data and the reconstructed data becomes small, so that learning is performed such that the distance between the reconstructed data and the original data for each data type becomes smaller.
In the examples 7, in the examples 1 to 6, a weight w_k considering the ease of reconstruction for each data type is used to give anomaly degree of data as a weighted average of the reconstruction error for each data type.
In the examples 8, in the examples 1 to 6, parameters are learned so as to minimize the weighted average of reconstruction errors for each data type when learning each mapping parameter using the weight w_k considering the ease of reconstruction for each data type.
Hereinafter, examples 1 to 8 will be described in more detail.
First, the example 1 will be described. In the example 1, when performing anomaly detection by learning a correlation of various types of data, anomaly detection is performed such that feature extraction based on correlation of each data type and feature extraction based on learning of correlation across data types are performed. The various types of data may be referred to as multiple types of data. Various types of data include, for example, MIB data, flow data, syslog, CPU information, and the like. In the examples 1 to 8, anomaly detection is performed by unsupervised learning.
Here, an example of an algorithm in which an autoencoder (Non-Patent Document 2) is applied is shown as an anomaly detection algorithm for performing feature extraction performed by the anomaly detection apparatus 100. The autoencoder has an input layer, hidden layers, and an output layer, and is a function using a neural network to learn parameters for reproducing data of the input layer at the output layer. The anomaly detection apparatus 100 includes a function of the autoencoder. Specifically, the function of the autoencoder corresponds to calculation processing by the calculation unit 101.
Referring to the flowchart of
First, in the learning phase S101, learning data x_t (t=1, . . . , T) is input from the input unit 103. Learning data is normal data. The input learning data is stored in the storage unit 102. The input learning data is data which has K types of data, and is expressed as x_t={x_t{circumflex over ( )}1, . . . , x_t{circumflex over ( )}K}. Hereinafter, in some cases, t is omitted and the data of the k_th type is denoted by x′{circumflex over ( )}k. Note that data x is a vector of a certain dimension. Also, t represents time, for example.
Next, in S102, the calculation unit 101 performs learning of parameters using the learning data as follows. Here, the model to be learned is an anomaly detection model that applies a multimodal autoencoder consisting of five layers. For multimodal encoders, for example, Zhang, Hanwang, et al. “Start from scratch: Towards automatically identifying, modeling, and naming visual attributes.” Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014., etc. may be referenced.
An image diagram of a model of a multimodal autoencoder consisting of five layers is shown in
More specifically, for example, each of the second layer and the fourth layer includes nodes that extract features of the MIB data using data in which dimension is reduced from the MIB data, nodes that extract features of the flow data using data in which dimension is reduced from the flow data, nodes that extract features of syslog using data in which dimension is reduced from the syslog, and nodes that extract features of the CPU information using data in which dimension is reduced from the CPU information. The third layer node extracts features that span these data types by weighting and adding the output data from the second layer.
In
Specifically, the calculation unit 101 learns the parameters W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=2, . . . , 5, for all k by solving the following optimization problem. That is, the calculation unit 101 obtains parameters W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=2, . . . , and 5 that minimize the reconstruction error of the input data and the output layer data (using the following equation (1), here MSE), and stores the obtained parameters in the storage unit 101.
N_k is the dimension of the data type k, and x{circumflex over ( )}{k,(l)}, l=2, 4, and 5 denote the output of the first layer of the k_th data type, and x{circumflex over ( )}{(3)} denotes the output of the third layer, respectively as follows.
In each of the above equations, W{circumflex over ( )}{k,(l)} is a connection weight from l-1 layer to l layer for the k_th data type, b{circumflex over ( )}{k,(l)} is the l layer bias term for the k_th data type, and φ{circumflex over ( )}(l) is the l layer activation function. Here, the number of dimensions of the second and fourth layers is less than the number of dimensions of each data. This reduces the dimensions for each data type, and the correlation across data types is learned using the dimension-reduced data at the third layer. Thus, it is possible to prevent the increase in the normal space in combination due to the increase in dimensions having no correlation across data types. The parameters learned by the calculation unit 101 are stored in the storage unit 102.
Learning of the aforementioned mapping f1_k (k=1, . . . , K), f2, g1, and g2_k (k=1, . . . , K) correspond to learning of parameters W{circumflex over ( )}{k, . . . , l}, b′{circumflex over ( )}{k, . . . , l}, l=2, . . . , 5.
The test data x_test is input from the input unit 104 in S103 of the test phase. The calculation unit 101 calculates the vector x_test{circumflex over ( )}(5) of the output layer using the above-described equations (2) to (5) based on the parameters W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=2, . . . , 5, for all k stored in the storage unit 102 and calculates the Mean Square Error (MSE) as an anomaly degree (S104). The anomaly degree is output from the output unit 104 (S105). That is, in the test phase, test data is input to an autoencoder (here, a multimodal autoencoder) using the parameters learned in the learning phase, and the degree of anomaly of the test data is calculated based on the output data of the autoencoder and the test data.
Next, the example 2 will be described. In the example 2, the data in the example 1 is not classified by data type, but is classified from other perspectives. Classification methods include classifying data according to attributes, for example, by data collection devices or by collection points. In addition, the data may be clustered in advance and categorized according to the cluster.
The processing contents of the anomaly detection apparatus 100 according to the example 2 are the same as those according to the example 1. In the example 2, the K type data x_t={x_t{circumflex over ( )}l, . . . , x_t{circumflex over ( )}K} of the example 1 may be classified as the data of K groups classified in terms of the example 2.
The term “data type” may be interpreted as including both the classification of data in the example 1 and the classification of data in the example 2.
Also in the example 2, it is possible to prevent the increase of normal space in combination due to the increase of dimensions having no correlation across data types.
Next, the example 3 will be described. In the example 3, more complicated feature extraction is performed by further increasing the number of layers of the autoencoder in the example 1. For example, by increasing the number of layers between the first and the second layers and between the fourth and the fifth layers, it is expected that more complicated features can be extracted in feature extraction for each data type. Also, by increasing the number of layers between the second and the third layers and/or the number of layers between the third and the fourth layers, it is expected that more complicated features can be extracted in feature extraction of the entire data.
In accordance with the third embodiment, it is also possible to prevent the increase in normal space in combination due to the increase of dimensions having no correlation across data types. Also, in the example 3, more accurate dimensional reduction can be achieved by increasing the number of layers, particularly when data are complex and dimensional reductions are difficult.
Next, the example 4 will be described. In the example 4, instead of representing the degree of anomaly of data by the MSE between the input and output layers in the example 1, the value obtained by inverting positive and negative of a likelihood function in which data of the input layer is observed under a predetermined probability distribution, using a value at the output layer as a parameter, is regarded as anomaly degree. The value obtained by inverting positive and negative of a likelihood function may be replaced by “a value obtained by multiplying the likelihood function by minus”. Such definition of anomaly degree is also performed in anomaly detection using Variation Autoencoder. For example, An, Jinwon, and Sungzoon Cho. Variational Autoencoder based Anomaly Detection using Reconstruction Probability. Technical Report, 2015., etc. may be referenced for anomaly detection using Variation Autoencoder.
In S401, learning data x_t (t=1, . . . , T) and a probability distribution F (x, θ) are input from the input unit 103. The input learning data is stored in the storage unit 102.
In S402, the calculation unit 101 obtains parameters W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=2, . . . , 5 so as to minimize the value obtained by inverting the positive and negative of the likelihood function in which the data of the input layer is observed under F(x,θ) with the value in the output layer as θ, and stores the obtained parameters in the storage unit 102. That is, in the example 4, the calculation unit 101 solves the optimization problem in which equation (1) is an objective function representing minimizing the value obtained by inverting the positive and negative of the likelihood functions in which the data of the input layer is observed under F(x, θ) with the value in the output layer as θ, in the equations (1) to (5) of the example 1.
The test data x_test is input from the input unit 104 in S403 of the test phase.
In S404, the calculation unit 101 calculates a value, as the anomaly degree, obtained by inverting the positive and negative of the likelihood function in which the data of the input layer is observed under F(x, θ) where the value in the output layer is θ, using parameters W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{circumflex over ( )}{k,(l)}, l=2, . . . , 5 read from the storage unit 102. In S405, the output unit 104 outputs the anomaly degree of the test data.
As noted above, in Example 4, instead of calculating the MSE in Example 1, the positive/negative inverted value of the likelihood function is calculated and the parameters are learned to minimize the value. When defining the anomaly degree, the value is defined as the anomaly degree in the test data.
The example 4 also prevents increase in normal space in combination due to increase of dimensions having no correlation across data types.
Next, the example 5 will be described. In the example 5, in the example 1, a new layer is defined between the third layer and the fourth layer, random numbers generated by a predetermined probability distribution in which the third layer value is used as a parameter are used as the values of the new layer, and the value is mapped to the fourth layer in the example 1. The mapping that introduces such random numbers is also performed in the Variational Autoencoders (e.g., An, Jinwon, and Sungzoon Cho. Variational Autoencoder based Anomaly Detection using Reconstruction Probability. Technical Report, 2015), where normal distribution is given as the probability distribution, and random numbers are generated by regarding the values in the third layer as mean and variance of the normal distribution, and the random numbers are then used as the values of the fourth layer.
The processing flow of the anomaly detection apparatus 100 according to the example 5 is basically the same as the processing flow shown in
That is, in the example 5, predetermined half of the values of each dimension of x{circumflex over ( )}(3) calculated by equation (3) are regarded as the mean and the remaining half as the variance, and the covariance is set to 0, and the random number x′{circumflex over ( )}(3)′ of dimensions of half of the number of dimensions of x(3) is generated according to the normal distribution. The equation for mapping x′{circumflex over ( )}(3)′ to the fourth layer is an equation in which the input x′″(3) in the equation (4) is regarded to be x{circumflex over ( )}(3)′, and W{circumflex over ( )}{k,(4)} and b{circumflex over ( )}{k,(4)} are matrices and vectors matched to the dimensions of x′{circumflex over ( )}(3)′. The same is true for the test phase.
The example 5 also prevents increase in normal space in combination due to increase of dimensions having no correlation across data types.
Next, the example 6 will be described. The example 6 is based on the example 1 or the example 4. In the case in which the example 1 is used as the base, the learning of the mapping in the example 1 is performed individually for each data type, and the learning of the parameters is performed so that the reconstruction is sufficiently performed for each data type. Like this, it is also applicable to Examples 2-5 that the learning of mapping is performed individually for each data type.
For example, when a model based on the autoencoder of the example 1 is used, parameters are learned by optimization of equation (1). At this time, when data types that are easy to reconstruct and data types that are difficult to reconstruct are mixed, the square error in the equation (1) is likely to be larger, and all parameters are updated so as to reduce the square error of the latter. Therefore, there may not be enough learning to reconstruct the former data type.
Accordingly, in the example 6, parameters by which reconstruction is sufficiently performed for each data type are first learned, and learning is performed such that the overall reconstruction error is reduced by setting the parameters as the initial values.
In S511, the learning data x_t (t=1, . . . , T) is input from the input unit 103. The input learning data is stored in the storage unit 102.
Subsequently, in S512 and S513, the calculation unit 101 performs learning of parameters W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k, (l)}, l=2,5, for all k so as to sufficiently perform the reconstruction by feature extraction for each data type, and learning of parameters W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)},l=3,4, for all k so as to sufficiently perform the reconstruction by feature extraction for all data types, as prior learning, respectively.
That is, as shown in
Next, as shown in
Then, as shown in
More specifically, in S514 of
The processing contents of the test phases S515 to S517 are the same as those of S103 to S105 of
When the example 4 is used as the base, the calculation unit 101 learns the values obtained by inverting the positive and negative of the likelihood function in the learning of W{circumflex over ( )}{k,(l)}, b′{circumflex over ( )}{k,(l)},l=2,5, for all k in the same manner as in the example 4, and learns the parameters such that the reconstruction error becomes small in the learning of W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=3,4, for all k.
In S541, the learning data x_t (t=1, . . . , T) and the probability distribution F (x, θ) are input from the input unit 103.
In S542, the calculation unit 101 determines parameters that minimize the value obtained by inverting the positive and negative of the likelihood function in which the data of the input layer is observed under F(x, θ) where the value in the output layer of the autoencoder using W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=2,5, for all k is θ, using x_t as learning data.
In S543, the calculation unit 101 determines parameters for minimizing the reconstruction error of the autoencoder using W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=3,4, for all k by using data converted from x_t according to equation (2) using W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=2, for all k, as learning data.
Then, in S544, the calculation unit 101 determines parameters W{circumflex over ( )}{k,(l)}, b{circumflex over ( )}{k,(l)}, l=2, . . . , 5, for all k to minimize the value obtained by inverting the positive and negative of the likelihood function in which the data of the input layer is observed under F(x,θ) with the value in the output layer as θ, using already obtained parameters as the initial values, and stores the obtained parameters in the storage unit 102.
The processing contents in the test phases S545-S547 are the same as those of S403-S405 of
According to the example 6, in addition to the effect of the examples 1 and 4, there is an effect of solving the problem that the ease of learning for each data type affects the learning of the correlation of the whole data and the calculation of the anomaly degree of test data.
Next, the example 7 will be described. In the example 7, in the calculation of anomaly degree in the examples 1-6, weighting is performed in consideration of ease of reconstruction for each data type or magnitude of likelihood. For example, in the case of the example 1, when the ease of reconstruction varies by data types, the variation of reconstruction error caused by anomaly occurring in data which is easy to reconstruct may be smaller than reconstruction error in normal time of data which is difficult to reconstruct, which causes that such anomaly cannot be detected.
Accordingly, in the example 7, the calculation unit 101 performs calculation of the MSE as a weighted average squared error as follows.
In the equation (6) above, w_k is a coefficient representing easiness of reconstruction of data type k. The easier the data type k is to reconstruct, the larger the coefficient w_k is, and the more difficult the data type k is to reconstruct, the smaller the coefficient w_k is. This is because data that is easily reconstructed tends to have a smaller reconstruction error, and data that is difficult to reconstruct tends to have a larger reconstruction error, thus the coefficient is used for offsetting the difference. For example, w_k may be given as a reciprocal of the mean of the distribution of reconstruction errors when normal data is input into a learned model.
In the case of the example 4, for data where the likelihood function is likely to be large, the value obtained by inverting positive/negative of the likelihood function is likely to be small and, as in the case of the example 1, the data may cause undetectable anomaly. Therefore, when the weighting of the example 7 is applied to the example 4, as for the value obtained by inverting the positive and negative of the likelihood function for each data type, the weighted value is regarded as the anomaly value in the same way. That is, when the weighting of the example 7 is applied to the example 4, w_k becomes larger when the value obtained by inverting the positive and negative of the likelihood function in data type k is smaller, and w_k becomes smaller when the value obtained by inverting the positive and negative of the likelihood function in data type k is larger.
Also according to the example 7, there is an effect of solving the problem that the ease of learning for each data type affects the learning of the correlation of the whole data and the calculation of the anomaly degree of test data.
Next, the example 8 will now be described. In the example 8, at the time of learning in the examples 1-7, weighting is performed in consideration of ease of reconstruction for each data type or the magnitude of the likelihood function, like the example 7. This is to avoid, in learning, that minimization of the MSE of data that is difficult to reconstruct, or minimization of the value obtained by inverting the positive and negative of the likelihood function for data for which likelihood function is likely to be small dominates, as described in the example 6. In the example 8, the calculation unit 101 determines parameters that minimize the following equation in the learning phase.
As a method of giving w_k, for example, in the case of the example 1, a reciprocal r of the mean of the distribution of the reconstruction error when normal data is input to the model using parameters at that time, etc. may be used.
Also according to the example 8, there is an effect of solving the problem that the ease of learning for each data type affects the learning of the correlation of the whole data and the calculation of the anomaly degree of test data.
As described above, the technique according to the present embodiment can solve the increase in combination of normal states to be learned and the influence of differences of ease of learning. Here, using the example 6 and the example 7 based on the example 1, the effect of this technique is shown by the result of anomaly detection by the anomaly detection apparatus 100 in a test bed network.
Three types of data, flow data, MIB data, and syslog, were collected from the testbed network, and feature vectors were generated by a combination of key and value as shown in
As shown in
In addition, in order to show that the MAE learns the correlation between data after performing dimensional reduction of each data type so as to solve the problem that the space where the normal data is distributed increases in combination,
As described above, according to the present embodiment, there is provided an anomaly detection apparatus having a function of an autoencoder that includes an input layer, hidden layers, and an output layer, and that learns parameters such that data of the input layer is reproduced in the output layer, the anomaly detection apparatus including:
input means that input normal data of a plurality of types;
learning means that learn parameters such that normal data of the input layer is reproduced in the output layer by learning a feature across data types using data of a dimension which is smaller than a dimension of the normal data; and
anomaly degree calculation means that input test data to the autoencoder using the parameters learned by the learning means, and calculate anomaly degree of the test data based on output data of the autoencoder and the test data.
The input unit 103 described in the embodiment is an example of the input means, and the calculation unit 101 is an example of the learning means and the anomaly calculation means.
The autoencoder may include a first layer as the input layer, a second layer, a third layer and a fourth layer which are three layers as the hidden layers, and a fifth layer as the output layer, and the learning means may extract, in the second layer and the fourth layer, a feature for each data type with a dimension smaller than a dimension of the normal data, and extract, in the third layer, a feature across data types.
The learning means may learn the parameters so as to minimize an MSE between data of the input layer and data of the output layer, or to minimize a value obtained by inverting positive and negative of a likelihood function in which data of the input layer is observed under a predetermined probability distribution, using a value at the output layer as a parameter.
The learning means may learn parameters in the autoencoder by using parameters, as initial values, obtained by individually executing learning for each data type.
The anomaly degree calculation means may calculate an anomaly degree of the test data as a weighted average of reconstruction error for each data type, by using a weight for each data type.
The learning means may perform learning of parameters so that a weighted average of reconstruction error for each data type is minimized using a weight for each data type.
In addition, according to the present embodiment, there is provided An anomaly detection method detected by an anomaly detection apparatus having a function of an autoencoder that includes an input layer, hidden layers, and an output layer, and that learns parameters such that data of the input layer is reproduced in the output layer, the anomaly detection method including:
an input step of inputting normal data of a plurality of types;
a learning step of leaning parameters such that normal data of the input layer is reproduced in the output layer by learning a feature across data types using data of a dimension which is smaller than a dimension of the normal data; and
In addition, according to the present embodiment, there is provided a program for causing a computer to function as each means in the anomaly detection apparatus.
Although the present embodiment has been described above, the present invention is not limited to such specific embodiments, and various modifications and changes are possible within the scope of the present invention as claimed.
The present patent application claims priority based on Japanese patent application No. 2017-212801, filed in the JPO on Nov. 2, 2017, and the entire contents of the Japanese patent application No. 2017-212801 are incorporated herein by reference.
Number | Date | Country | Kind |
---|---|---|---|
2017-212801 | Nov 2017 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2018/039987 | 10/26/2018 | WO | 00 |