The present invention relates to an anomaly detection apparatus, an anomaly detection method, and a program.
There is a growing need in various networking domains for functions to process, analyze, and evaluate network data (such as sys-log, FLOW, and mib) and automatically detect and identify anomalous patterns in the data. In recent years, a rapid increase in network anomaly detection studies by deep neural networks has been observed, and higher accuracy and more robust results than those by known rule-based detection methods have been obtained (NPLs 1 and 2).
There are primarily two types of network anomaly detection, which are supervised and unsupervised anomaly detections. In the supervised anomaly detection, labels of both normal data instances and anomalous data instances are used to train supervised binary or multi-class classifier. While supervised learning methods have improved in performance, these methods have less labeled training data in actual use scenes (NPL 1).
The unsupervised deep anomaly detection technology is a technique for reconfiguring data without labeling, and an autoencoder as a representative technique thereof (NPL 1) is a neural network that learns to copy an input to an output.
Furthermore, long short-term memory (LSTM) networks used to detect time series outliers (NPL 2) have abilities to maintain a long-term memory, and thus, stacking recurrent hidden layers also enables the learning of higher level temporal change features, for faster learning of time series changes with sparser representations. The LSTM-based autoencoder for time series anomaly detection is also a type of the unsupervised anomaly detection. An encoder learns a vectorial representation of the input time series, and a decoder reconstructs the time series by using an encoder forward transmission context representation.
NPL 1: Sakurada, Mayu, and Takehisa Yairi, “Anomaly detection using autoencoders with nonlinear dimensionality reduction.” Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. ACM, 2014.
NPL 2: Malhotra, Pankaj, et al., “Long short term memory networks for anomaly detection in time series.” Proceedings. Presses universitaires de Louvain, 2015.
However, due to mobile networks, social media, etc. in the IoT era, network data has explosively increased. Once the data becomes large due to the increased number of connected devices, it is not possible to label all data. Also, as for dynamic concept drift (a temporal change in a concept to be learned from data, e.g., continued change in a feature amount value due to system updates), collecting vast data during a certain period of time for always performing batch learning or re-learning (off-line learning) in the related art is very inefficient in large scale ICT systems.
In the anomaly detection algorithms in the related art, a static training data set is assumed to be prepared before performing actual detection, and is used to extract all necessary information. In other words, it is necessary to learn at a time by using the data set, and hence to prepare a data set containing sufficient information. This approach is not suitable for situations where a complete training data set is not previously available.
That is, while various algorithms have been proposed based on learning patterns only at normal time as application to anomaly detection of a network environment of the related art, none of algorithms is designed for online learning of high accuracy time-series analysis and large scale data based on unsupervised training algorithms.
The present invention has been made in view of the above and has an object to enable improvement in accuracy for the anomaly detection and efficient learning.
In order to solve the above-described problem, an anomaly detection apparatus includes an anomaly detection unit configured to perform anomaly detection on time series data, wherein the anomaly detection unit includes an encoding unit configured to encode the time series data by using a plurality of LSTM cells, an attention layer configured to calculate a weight of attention on an output from the encoding unit, a context generation unit configured to generate a context vector by applying the weight to the output from the encoding unit, and a decoding unit configured to reconfigure the time series data by using the plurality of LSTM cells in accordance with the context vector.
It is possible to enable improvement in accuracy for anomaly detection and efficient learning.
Hereinafter, an embodiment of the present disclosure will be described with reference to the drawings.
A program that realizes processing in the anomaly detection apparatus 10 is provided via a recording medium 101 such as a CD-ROM. When the recording medium 101 having the program stored therein is set in the drive device 100, the program is installed in the auxiliary storage device 102 from the recording medium 101 via the drive device 100. However, the program does not necessarily have to be installed from the recording medium 101, and may be downloaded from another computer via a network. The auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.
The memory device 103 reads and stores the program from the auxiliary storage device 102 when the program is instructed to start. The CPU 104 executes the function for the anomaly detection apparatus 10 according to the program stored in the memory device 103. The interface device 105 is used as an interface for connection to a network.
The preprocessing unit 11 performs preprocessing on stream data (time series data) such as network data (sys-log, FLOW, mib, or the like) that is a target of anomaly detection. The anomaly detection unit 12 is a neural network that detects anomalies in accordance with the preprocessed time series data. The learning unit 13 controls learning processing of a learning parameter group of the neural network as the anomaly detection unit 12.
Such an anomaly detection unit 12 is a model capable of unsupervised learning of time series data, and unlike a known autoencoder, is a model having a dependency on time series data. Furthermore, because the anomaly detection unit 12 is with attention (or includes the attention layer 123), it is possible to improve the accuracy of the anomaly detection by generating dynamic hidden information for each decoding step.
A compact form of forward equations for the LSTM cell is as follows.
[Math. 1]
f
t=σ(Wf[ht'11,xt]+bf) (1)
i
t=σ(Wi[ht−1,xt]+bi) (2)
c
t
=f
t
*c
t−1
+i
t*tanh(Wc[ht−1, xt]+bc) (3)
o
t=σ(Wo[ht−1, xt]+b0) (4)
h
t
=o
t*tanh(ct) (5)
Here, ft, it, ot represent respective outputs of the forget gate, the input gate, and the output gate, and ct and ht represent respective outputs of the LSTM cell. The details related to the LSTM such as a backward learning method are described in reference document 1, for example.
Equations (1) to (5) can be combined as follows.
[Math. 2]
h
t
, c
t=LSTM(ht−1,ct−1,xt) (6)
LSTM Autoencoder Model with Attention
The input to the anomaly detection unit 12 (LSTM autoencoder with attention) illustrated in
As illustrated in
[Math. 3]
h
T
, c
T=LSTM(hT−1, cT−1, xt)=LSTM(LSTM(hT−2, cT−2, xt−1), xt)=LSTM(LSTM( . . . LSTM(h0, c0, xt) . . . , xt−2),xt−1), xt) (7)
Here, hT and cT represent context vectors generated by LSTM-coding the window data having the size T, T consecutive times.
The decoding unit 122 (decoder) uses the context vectorial representation to generate data Y={y1, y2 . . . yT}, Y∈Rm×T, which reconfigures the input window X={x1, x2 . . . xT}, where X∈Rm×T. The decoding unit 122, similarly to the encoding unit 121, uses the network connecting T LSTM cells in each decoding step to generate each piece of reconfigured data yt∈Y in the reverse order Y′={yT, yT−1 . . . y1}, An initial state (h0de, c0de) of the decoding unit 122 is an output (hT, cT) of the encoding unit 121.
[Math. 4]
y
t=LSTM(htde, ctde, yt+1)t∈(T,T−1 . . . 1) (8)
Here, in order to achieve a high accurate and high interpretable time series decoder, the decoding unit 122 uses a dynamic context vector that is calculated by the dynamic context generation unit 124 in accordance with Equation (8) below, unlike reference document 1 in which fixed ctde used in each decoding step is expressed as ctde≡cT.
Here, t∈(T, T−1, . . . , 1) represents each decoding step by the decoding unit 122, and hi means a hidden vector in each decoding step. In other words, in each decoding step t∈(T, T−1 . . . 1) of the decoding unit 122, the context vector ctde indicates a weighted sum of output vectors h1, h2, h3 . . . , hT of the encoding unit 121, and the vectors h1, h2, h3 . . . , hT are scaled by a degree of relevance based on an attention weight ati to the input vector X={x1, x2 . . . xT}.
The attention layer 123 calculates the attention weight ati in accordance with Equations (10) and (11) below.
Here, Wa and b represent learnable attention weights, and concatenation of [hi, ci−1] means that the attention weight is generated with not only the input information of the decoding unit 122 but also the past hidden state information being into consideration. Unlike the known fixed context information, the generation of the dynamic hidden information at each step of the decoding unit 122 can improve the accuracy of the anomaly detection.
Online Learning
In recent years, online learning algorithms have been of great interest in developing intelligent agents that can perform online learning in complex IoT environments (‘Zhou, Guanyu, Kihyuk Sohn, and Honglak Lee. “Online incremental feature learning with denoising autoencoders.” In Artificial intelligence and statistics, pp. 1453-1461, (2012).’).
The online learning is a technique that processes in real-time and continually learns data, and is also called process incremental learning. Such learning methods can initiate anomaly detection as soon as possible with very little initial knowledge and apply this knowledge when new data becomes available. This learning technique is capable of processing time series data. An online learning model, unlike known batch learning, is trained stepwise with available new information, and the model weights are updated at all times and also believed to be effective for prediction of stream data or anomaly detection.
However, most related studies emphasize that test-then-train is performed each time data is provided, but do not refer to a scenario, likely to occur in the IoT network, that there is no labeling when the data streams come in order in real application. Therefore, calculation such as a loss function or updating of training parameters has also been made impossible (‘Mohammadi, Mehdi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. “Deep learning for IoT big data and streaming analytics: A survey.” IEEE Communications Surveys & Tutorials 20, no. 4: 2923-2960. (2018)’).
Subsequently, the learning processing in the anomaly detection unit 12 will be described.
In step S101, the preprocessing unit 11 performs preprocessing (AGGREGATION, normalization, etc.) on potentially infinite time series data Xall={x1, x2, . . . , x1 . . . }. Here, the preprocessing such as AGGREGATION or normalization depends on the type of data, and thus details thereof will be omitted. The preprocessing unit 11 also writes the preprocessed data into the continuous windows. For example, the window size is T and the preprocessed data is written into a window X1={x1, x2 . . . , xT}, a window X2={xT+1, xT+2 . . . , x2T}, . . . Note that a size N of a trunk is n, and one data trunk includes n continuous windows. Here, in a case that the size N of the trunk is 1, WINDOW-BY WINDOW learning is performed, and in a case of N>1, CHRUNK-BY-CHRUNK learning is performed. The larger the trunk size is, the faster the learning converges, but the longer a test interval becomes.
Next, the anomaly detection unit 12 inputs each piece of preprocessed data Xn∈{X1, X2, . . . , XN} in one trunk into Equation (8) to generate each piece of reconfigured data Yn∈{Y1, Y2, . . . , YN} (S103).
Subsequently, the learning unit 13 calculates a reconfiguration error in accordance with the following loss function (S104).
Here, N represents the trunk size and T represents the window size.
Subsequently, the learning unit 13 compares the reconfiguration error with a threshold (S105). In a case that the reconfiguration error is larger than the threshold (No in S105), the learning unit 13 performs learning (S106) by the anomaly detection unit 12 (the LSTM autoencoder model) by using an Adam optimization method. The Adam optimization method demonstrates experimentally best performance by updating the weight on each parameter with an appropriate learning rate by taking into account a mean square and a mean of gradients as a primary moment and a secondary moment. For the online learning of the LSTM autoencoder, it is believed that the Adam method converges the earliest and stabilizes (the accuracy does not fall down even if the learning rate is set widely).
A main difference between the online learning and the known offline (off-line) learning is that the online learning does not strictly isolate training and test data, but instead each instance (the trunk in this case) is first used for model testing, and then used for training. However, to improve online learning accuracy, the offline learning may be used with small batch data when initializing the model (‘Zhou, Guanyu, Kihyuk Sohn, and Honglak Lee. “Online incremental feature learning with denoising autoencoders.” In Artificial intelligence and statistics, pp. 1453-1461, (2012).’).
After step S106, step S103 and subsequent steps are repeated. When the reconfiguration error is less than or equal to the threshold, the processing procedure in
Note that when the anomaly detection task is performed, steps S101 to S105 in
As described above, according to the present embodiment, it is possible to enable the improvement in accuracy for the anomaly detection and the efficient learning. Specifically, effects (1) and (2) below can be achieved by constructing the LSTM autoencoder.
In addition, the anomaly detection unit 12 of the present embodiment including the attention layer 123 (attention mechanism) allows the following effect (3) to be obtained.
Furthermore, by performing the online learning for the anomaly detection unit 12, the following effect (4) can be obtained.
Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to such specific embodiments, and various modifications and changes can be made without departing from the gist of the present disclosure described in the aspects.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/045661 | 11/21/2019 | WO |