The present invention relates to analytical techniques for time-series data.
As time-series data that includes random noise, there are data items including communication traffic, stock prices, weather data, and the like, and by approximating the behaviors of these data items, analytical techniques for feature understanding, prediction, anomaly detection, and the like are being investigated.
These methods can be classified into two broad categories. The first category includes methods using neural networks, and the second category includes methods in which time-series data is considered to be generated based on mathematical models. As for the second category, although classical methods assume a linear relationship among data items, in recent years, techniques for analyzing time-series data that use a mathematical instrument called the transfer operator, with which a model can be represented even for a nonlinear relationship, have been studied (Non-Patent Documents 1-3).
Non-patent document 1 discloses a technique for understanding a feature of time-series data having randomness by approximating eigenvalues and eigenfunctions of a transfer operator. Non-patent document 3 discloses a technique for calculating similarity between time-series data items not having randomness by using a transfer operator defined on a space called a reproducing kernel Hilbert space (RKHS). Non-patent document 2 discloses a technique for understanding a feature of time-series data having randomness by approximating eigenvalues and eigenfunctions of the transfer operator defined on an RKHS.
[Non-patent document 1] Crnjaric-Zic, N., Macesic, S., and Mezic, I., Koopman Operator Spectrum for Random Dynamical Systems, arXiv:1711.03146, 2019
[Non-patent document 2] Klus, S., Schuster, I., and Muandet, K., Eigendecompositions of Transfer Operators in Reproducing kernel Hilbert spaces, arXiv:1712.01572, 2017
[Non-patent document 3] Ishikawa, I., Fujii, K., Ikeda, M., Hashimoto, Y., and Kawahara, Y., Metric on Nonlinear Dynamical Systems with Perron-Frobenius Operators, In Advances in Neural Information Processing Systems 31, p.p. 2856-2866, Curran Associates, Inc., 2018
The neural network is a method of approximating a relationship among data items without assuming a model; therefore, it is difficult to incorporate information on randomness into the approximation.
By considering a mathematical model, it is expected that a relationship among data items can be approximated while taking the randomness into account. However, any classical method using a mathematical model assumes a linear relationship among data items; therefore, for data items that exhibit nonlinear behavior, the accuracy of analysis falls.
Therefore, techniques to represent and analyze models in which nonlinear behaviors are assumed, by using transfer operators, have been studied. Conventional techniques using a transfer operator are only effective in the case where the transfer operator has good properties such as “having only a discrete spectrum” or “being bounded”.
However, a transfer operator that represents a model generating time-series data in practice does not necessarily have these properties. Also, the conventional techniques aim at approximating eigenvalues of a transfer operator and at calculating the degree of similarity among time-series data items, but do not aim at anomaly detection.
The present invention has been made in view of the above, and has an object to provide techniques with which behaviors of time-series data items including random noise can be approximated, to execute anomaly detection.
According to the disclosure techniques, an anomaly detection apparatus is provided that includes:
According to the disclosed techniques, techniques are provided, with which behavior of time-series data items including random noise can be approximated, to execute anomaly detection. The present techniques are also applicable when the transfer operator does not have properties such as “having only a discrete spectrum” or “being bounded”.
In the following, embodiments of the present invention (the present embodiments) will be described with reference to the drawings. The embodiments described below are merely examples, and the embodiments to which the present invention is applied are not limited to the following embodiments.
In the present embodiment, a method of approximating a transfer operator called a Perron-Frobenius operator on an RKHS, and as an application example of using it, a time-series data anomaly detection apparatus, i.e., a system that implements anomaly detection will be described. The present time-series data anomaly detection apparatus can also be applied to cases where the transfer operator does not have properties such as “having only a discrete spectrum” or “being bounded”.
The time-series data anomaly detection apparatus 100 can be implemented by, for example, causing a computer to execute a program.
In other words, the time-series data anomaly detection apparatus 100 can be implemented by executing a program corresponding to processing executed by the time-series data anomaly detection apparatus 100, by using hardware resources such as a CPU, a memory, and the like embedded in the computer. In other words, calculation of approximation of a Perron-Frobenius operator, calculation of prediction, calculation of an index of dispersion level, and the like described later can be implemented by the CPU that executes processing expressed in formulas corresponding to these calculations according to the program. Parameters corresponding to the formulas, data to be calculated, and the like are stored in a storage unit such as the memory, and when the CPU executes the processing, the CPU reads the data and the like from the storage unit to execute the processing.
The program described above can be recorded on a computer-readable recording medium (portable memory, etc.), to be stored and distributed. Also, the program described above can also be provided via a network such as the Internet, e-mail, and the like.
A program that implements processing on the computer is provided by a recording medium 1001 such as a CD-ROM. When the recording medium 1001 storing the program is set into the drive device 1000, the program is installed in the auxiliary storage device 1002 from the recording medium 1001 through the drive device 1000. However, installation of the program does not need to be done from the recording medium 1001; the program may be downloaded from another computer via a network. The auxiliary storage device 1002 stores the installed program and stores necessary files, data, and the like.
The memory device 1003 reads the program from the auxiliary storage device 1002, and stores the program when an activation command of the program is received. The CPU 1004 implements functions related to the time-series data anomaly detection apparatus 100 according to the program stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network, and functions as an input unit and an output unit via the network. The display device 1006 displays a GUI (Graphical User Interface) or the like based on a program. The input device 1007 is constituted with a keyboard and a mouse, buttons, a touch panel, or the like, and is used for inputting various operational commands.
An overview of operations of the time-series data anomaly detection apparatus 100 is as follows. The time-series data anomaly detection apparatus 100 executes anomaly detection of time-series data by executing an approximation step and an anomaly detection step, as follows.
Step 0: The observed data obtaining unit 110 obtains observed data in time series up to time T. The observed data is, for example, data of a traffic volume obtained from a router or the like that constitutes a network.
Step 1: the Perron-Frobenius operator approximation unit 121 approximates a Perron-Frobenius operator on an RKHS that represents a mathematical model to generate the data by using the obtained observed data.
Step 2: from predictions with respect to the respective observed data items, the dispersion level calculating unit 122 uses the approximated Perron-Frobenius operator, to calculate a dispersion level of the predictions.
Step 3: the observed data obtaining unit 110 obtains an observed data item at time t and an observed data item at time t+1.
Step 4: the detection unit 130 uses the Perron-Frobenius operator approximated at the approximation step, to predict a data item at time t+1 from the observed data item at time t.
Step 5: the detection unit 130 calculates discrepancy between the observed data at time t+1 and the predicted data at time t+1.
Step 6: the detection unit 130 determines a threshold value of anomaly taking into account the dispersion level of the prediction calculated at Step 2, and if the discrepancy calculated at Step 5 is greater than the threshold value, regards the observed data at time t+1 as anomalous.
Details of the operations of the time-series data anomaly detection apparatus 100 will be described with reference to flow charts in
Compared to Method 1, Method 2 can better reflect latest information; therefore, this is more suitable in the case where a trend changes little by little over a long period of time. However, Method 2 requires a greater calculation amount than Method 1; therefore, in the case where real-time detection is required for time-series data within a short duration, Method 1 is more suitable. In the following, each of Method 1 and Method 2 will be described. Note that observed data described below may be data obtained in real time, or may be observed data in the past obtained from a server or the like. In either case, in the time-series data anomaly detection apparatus 100, the observed data is stored in a storage unit such as the memory, read from the storage unit, and used.
The approximation unit 120 of the time-series data anomaly detection apparatus 100 starts approximation.
In
At Step 102, the Perron-Frobenius operator approximation unit 121 generates an S-dimensional space from the S sets of data sets by an operation called orthogonalization.
At Step 103, the Perron-Frobenius operator approximation unit 121 generates an approximation of a Perron-Frobenius operator in the generated S-dimensional space that represents a mathematical model to generate the obtained observed data, by a function of restricting the behavior of the Perron-Frobenius operator on the RKHS.
At Step 104, the dispersion level calculating unit 122 uses the generated approximation of the operator, to calculate an index representing the dispersion level of the data, by a function of calculating the dispersion level of predictions with respect to the observed values, so as to set a larger threshold value, the smaller this index value is.
The approximation unit 120 outputs the approximation of the Perron-Frobenius operator and the threshold value of anomaly, and ends processing.
In
At Step 201, the observed data obtaining unit 110 obtains an observed data item at time t (t>T) and an observed data item at time t+1.
At Step 202, the detection unit 130 uses the approximation of the Perron-Frobenius operator output at the end of the approximation step illustrated in
At Step 203, the detection unit 130 determines the anomaly level at time t+1, by a function of calculating the discrepancy between the predicted data item at time t+1 and the observed data item.
At Step 204, the detection unit 130 determines whether the anomaly level at t+1 is less than the threshold value, and if yes, sets t+1 as t, and returns to the beginning; or if no, determines that the observed data is anomalous, and ends anomaly detection. Note that even in the case where that the observed data is determined as anomalous, the process may return to the beginning to repeat.
In
At Step 301, the Perron-Frobenius operator approximation unit 121 partitions observed data from time T-U (U>0) to time T obtained by the observed data obtaining unit 110 into S sets of data sets.
At Step 302, the Perron-Frobenius operator approximation unit 121 generates an S-dimensional space from the S sets of data sets by an operation called orthogonalization.
At Step 303, the Perron-Frobenius operator approximation unit 121 generates an approximation of the Perron-Frobenius operator in the generated S-dimensional space that represents a mathematical model to generate the obtained observed data, by a function of restricting the behavior of the Perron-Frobenius operator on the RKHS.
At Step 304, the dispersion level calculating unit 122 uses the generated approximation of the operator, to calculate an index representing the dispersion level of the data, by a function of calculating the dispersion level of predictions with respect to the observed values, so as to set a larger threshold value for a smaller value of the index.
The approximation unit 120 outputs the approximation of the Perron-Frobenius operator and the threshold value of anomaly, and ends learning.
Next, the detection unit 130 starts anomaly detection.
At Step 305, the observed data obtaining unit 110 obtains an observed data item at time t=T+1 and an observed data item at time t+1.
At Step 306, the detection unit 130 uses the approximation of the Perron-Frobenius operator output at the end of the learning step, to predict a data item at time t+1, by using a function of predicting the data item at time t+1 from the observed data item at time t.
At Step 307, the detection unit 130 determines the anomaly level at time t+1, by a function of calculating the discrepancy between the predicted data item at time t+1 and the observed data item.
At Step 308, the detection unit 130 determines whether the anomaly level at t+1 is less than the threshold value, and if yes, sets T+1 as T, and returns to the beginning; or if no, determines it as anomalous, and ends anomaly detection. Note that even in the case where it is determined as anomalous, the process may return to the beginning to repeat.
In the following, calculation methods executed by the time-series data anomaly detection apparatus 100 will be described in detail. In addition, evaluation results will also be described. Note that in the following description, due to limitation on usable characters in the plain text in the specification, ‘˜’ to be attached above a character may be attached as the prefix of the character (e.g., ˜K). Also, ‘{circumflex over ( )}’ to be attached above a character may be attached as the prefix of the character (e.g., {circumflex over ( )}K).
In the description here, it is assumed that time-series data is generated from the following mathematical model.
X
t+1
=h(Xt)+ξt (1)
where Xt and ξt are random variables from a state space x (compact distance space) to a probability space (Ω,F), and h is a nonlinear mapping from X to X. Assume that a probability measure P is defined on Ω. ξt(t=0,1, . . . ) is an independent and identically distributed random variable representing noise, and is also independent from ξt and Xt.
Let k be a bivariate function with respect to x, being a measurable, bounded continuous function that satisfies the following two conditions:
Condition 1: for any x,yϵX, k(x,y)=k(y,x)
Condition 2: for any xi, . . . , xjϵX and c1, . . . , cn∈R, Σni,j=1cicjk (xi, xj)≥0
where k is referred to as kernel. For xϵX, let φ(x) be a function k(x,y) with respect to y. A reproducing kernel Hilbert space(RKHS) with respect to k is an infinite-dimensional function space of all linear combinations of φ(x) and their limits.
Here, the RKHS with respect to k is denoted as Hk. In Hk, the concept of inner product can be applied to elements of Hk, by defining the inner product of φ(x) and φ(y) by k(x,y).
By introducing this concept of inner product, the theory of linear algebra can be used in Hk. Assume that Hk is dense in the space constituted with all bounded continuous functions.
As k that satisfies the conditions described above, a Gaussian kernel k(x,y) =e-c||x-y||{circumflex over ( )}2, a Laplacian kernel k(x,y)=e−c|x−y|, and the like are available, and these are used in many applications.
By converting the random variable into a probability measure, the relationship of Equation (1) is converted into a relationship using the probability measure, and the following equation is obtained:
[Formula 1]
Xt+1,P=Ft, (Xt, P⊗P) (1)
where with respect to the random variable X, XIP is a probability measure determined by X*P(A)=P(X−1 (A)) with respect to the set A, and Ft (x,ω)=h(x)+ξt (ω). By converting the random variable into the probability measure, by a concept called “kernel mean embedding” (Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Scholkopf.
Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2), p.p. 1-141, 2017), the probability measure can be embedded in Hk.
The kernel mean embedding with respect to a signed measure μ is a mapping Φ from the signed measure to Hk defined by Φ(μ)=∫xϵXφ(x) dμ(x). Φ can be shown to be continuous and linear. A Perron-Frobenius operator K on the RKHS Hk is an operator defined as follows:
[Formula 2]
KΦ(X,P)=Φ(Ft, (X,P⊗P)) 2
It is possible to show that K is defined as a mapping, K is independent of t, and K is linear.
An approximation method of a Perron-Frobenius operator executed by the Perron-Frobenius operator approximation unit 121 will be described.
Let {x0, x1, . . . , xT−1} be observed data. This observed data is partitioned into S sets of data sets {x0, xS, . . . , X(N−1)s}, {X1,X1+s, . . . , X1+(N−1)S)}, . . . , {xS−1, xS−1+S, . . . , xS−1+(N−1LS}, and μt,N is introduced as follows:
where δx with respect to an element x of X is a probability measure that returns, with respect to a set A, if xϵA, then δx (A)=1, or
[Formula 4]
if x ∉A then Sx 6l (A)=0 (4)
μt,N can be calculated based on only the observed data. Here, with introducing Ψ0,N as Ψ0,N=[Φ(μ0,N), . . . , Φ(μs-1,NL], the following relationship holds:
[Formula 5]
[Φ(F0,(μ0,N⊗P)), . . . , Φ(F0, (μ0,N⊗P))]=KΨ0,N(2) (5)
Using Equation (2), an operator whose K is restricted to a space constituted with Φ(μ0,N), . . . , Φ(μs-1,N) is calculated. However, in practice, the following expression cannot be calculated,
[Formula 6]
Φ(F0, (μ0,N⊗P)), . . . , Φ(F0,(μ0,N⊗P)) (6)
and hence, it is approximated from a finite number of observed data items. The following condition that the spatial mean is coincident with the time mean is assumed.
where ω0 ∈Ω is a latent state in the observed data.
The left side of Equation (3) is coincident with the following expression:
and the right side is coincident with the following expression:
Since Φ(μt+1,N) can be calculated based on only the observed data, the following expression,
[Formula 10]
Φ(F0,(μt,N⊗P)) (10)
is approximated by Φ(μt+1,N).
In the case where K has a good property of being bounded, when in N->∞ in Equation (2), the following equation holds:
Therefore, the following equation holds:
[Φ(μ1), . . . , Φ(μs)]=K[Φ(μ0), . . . , Φ(μs-1)] (4)
where
In this way, by approximating Φ(μt) by Φ(μt,N) for each t=0, . . . , S, from a finite number of data items, K can be restricted approximately to a space including all the linear combinations of Φ(μ0), . . . , Φ(μs-1). QR decomposition is executed to obtain [Φ(μ0,N), . . . , Φ(μs-1,N)]=QS,NRS,N. The calculation method of QS,N⋅RS,N will be described in Section 1.1.1. Denoting the restricted operator as ˜KS,NArnoldi, it can be calculated as follows:
[Formula 13]
{tilde over (K)}
S,N
Arnoldi
=Q
x
S,N[Φ(μ1,N), . . . ,Φ(μS,N)]RS,N−1 (5)
It can be seen from Equation (4) that the space including all the linear combinations of Φ(μ0), . . . , Φ(μS-1) is the same space as a space called the Krylov subspace used in the most standard Krylov subspace method called the Arnoldi method. Therefore, the present method can be regarded as approximate execution of the Arnoldi method using observed data.
By applying QR decomposition to Ψ0,N to obtain Ψ0,N=QS,NRS,N, a conversion of the space including all the linear combinations of Φ(μ0,N), . . . , Φ(μS-1,N) into a representation using an orthonormal basis, or QS,N, can be obtained.
Specifically, when an orthonormal basis of q0,N, . . . , qt-1,N, qt,N is obtained by making Φ(μt,N) orthonormal to q0,N, . . . , qt-1,N, qt,N, and then, QS,N as a conversion from CS to Hk is defined by the following conversion:
[Formula 14]
[z0, . . . ,zS-1]z0q0+. . . , +zS-1qS−1 (14)
qt,N is calculated using the following equation:
where <⋅,⋅>k denotes the inner product on the RKHS, and the calculation method will be described as follows. RS,N is an S×S matrix, a component (i,t) of RS,N is denoted as ri,t, and for i<t, defined as <Φ(μt,N), qi>k; for i=t, defined as
[Formula 16]
/∥∥K (16)
and for i>t, defined as 0. At this time, it can be expressed as follows:
[Formula 17]
qi,N=(Φ(μi,N)-Σj=0i−1rj,iqj,N)/ri,t (17)
Then, for i<t, ri,t can be calculated as follows:
where <Φ(μi,N), Φ(μt,N)>k can be calculated as follows:
Also, ∥⋅∥k is a norm in the RKHS, and calculated by the following equation:
[Formula 20]
∥|k= (20)
When i=j, <qi,N,qi,N>k=1, and when i≠j, <qi,N,qi,N>k=1; therefore, rt,t can be calculated as follows:
In Equation (5), [Φ(μ1,N), . . . , Φ(μS,N)] is a conversion from C5 to Hk expressed as follows:
[Formula 22]
[z0, . . . ,zn−1]z0Φ(μ1,N)+ . . . ,+zS−1Φ(μS,N) (22)
Q*S,N is a conversion from Hk to CS expressed as follows:
[Formula 23]
v[v,q0k, . . . ,v,qs−1k] (23)
Therefore, Q*S,N [Φ(μ1,N), . . . , Φ(μS,N)] corresponds to an S×S matrix whose (i,t) component is <Φ(μt+1,N), qi>k, and hence, can be calculated in substantially the same way as ri,t.
In the case where K is not bounded, a limiting state of N->∞ cannot be considered; therefore, the validity of the approximation by observed data cannot be shown. Therefore, by selecting a complex number y such γ such that (γI-K)−1 becomes bounded and bijective to approximate (γI-K)−1, this problem is solved. Since (γI-K)−1 is bounded, the following equation holds:
and assuming Equation (3), the following equation holds:
Therefore, for j=0, . . . , S, the following equation holds:
Therefore, the following equation holds:
For each t=0, S, by approximating Φ(μt) with Φ(μt,N), from a finite number of data items, (γI-K)−1 can be approximately restricted to a space including all of the following linear combinations:
With the above expression, QR decomposition is applied to obtain Ψ0,N=QS,NRS,N. The calculation method of QS,N⋅RS,N is simply to replace Φ(μj,N) in Section 1.1.1 with the following:
By using this, the behavior of (γI-K)−1 can be restricted to the space that includes all the linear combinations.
With the above expression, as in Section 1.1, (γI-K)−1 is approximated from a finite number of observed data items by {circumflex over ( )}KS,N defined as follows:
[Formula 32]
{tilde over (K)}
S,N
=Q
S,N
+Ψ1,NRS,N−1 (32)
Even in the case where K is not bounded, (γI-K)−1 is bounded; therefore, as in Section 1.1, the present method can be regarded as approximate execution of the Arnoldi method with respect to (γI-K)−1 by observed data. The Arnoldi method with respect to (665 I-K)−1 is called the Shift-invert Arnoldi method.
Since K=γI-((γl-K)631 1)−1, with the following equation,
[Formula 33]
{tilde over (K)}
S,N
SIA
=γI−{tilde over (K)}
S,N
−1 (33)
K is approximated by ˜KS,NSIA.
In the following, validity of approximation methods in Section 1.1 and Section 1.2 will be described.
The following proposition holds for QS,N RS,N that appears in the approximation methods in Section 1.1 and Section 1.2.
In Section 1.1, Ψ0 is expressed as Ψ0=[(μ0), . . . , Φ(μS−1)], and in Section 1.2, Ψ0 is expressed as follows,
and let Ψ0=QSRS be the QR decomposition of Ψ0. Let ˜KS=Q*SΨ1RN−1. ˜KS,NArnoldi and ˜KS,NSIA are collectively denoted as ˜KS,N. At this time, with respect to QS,N and ˜KS,N defined in Section 1.1 and in Section 1.2, QS,N->QS(strongly) and ˜KS,N->˜KS hold.
Next, a calculation method of anomaly detection will be described.
By using ˜KS,NNArnoldi and ˜KS,NSIA generated in Section 1.1 and in Section 1.2, respectively, anomaly detection is executed by predicting a data item to be observed at time t from an observed data item φ(xt−1) at time t−1, and calculating the discrepancy with an actual observed data item at time t. In the following, ˜KS,NArnoldi and ˜KS,NSIA are collectively denoted as ˜KS,N. The prediction is generated by the following expression:
[Formula 35]
QS,N{tilde over (K)}S,NQS,N+Ø(Xr−1) (35)
Therefore, the anomaly level at that represents the discrepancy with an actual observation at time t is defined as follows:
where pS is a polynomial of (S631 1) dimensions that satisfies φ(xt−1)=ps ((γI-K) −1) μS, and the following expressions are introduced:
[Formula 38]
ûs=γS−1(γΦ(μ0)−Φ(μ1))- . . . +(−1)s−1(γΦs−1)−Φ(μs)), μS=ûs/μûs∥ (38)
In the above, Γr is a set that satisfies Γr⊇Γs⊇ W((γI-K)−1) for s≤r, where W((γI-K)−1)={z=v*(γI-K)−1|v∈Hk||v||k=1}. With respect to the anomaly level at, the following proposition holds:
In Section 1.2, the following expression is introduced:
[Formula 39]
{tilde over (K)}
S
=Q
S
Q
S
+Ψ1:SRS−1QS+,{tilde over (K)}SSIA=γ1−{tilde over (K)}S−1 (39)
Let RS be a space including all the linear combinations of γΦ(μ0)-Φ(μ1), γ(γΦ(μ0)-Φ(μ1))-(γΦ(μ1)-Φ(μ2)), . . . , γS−1 (γΦ(μ0)−Φ(μ1)-. . . +(−1)S−1(γΦ(μS−1)-Φ(μS)). If 6100 (xt−1) is sufficiently close to RS, there exist C1,C2,C3>0 and 0<θ<1, and the following equation holds:
The first term on the right-hand side of Equation (6) represents the discrepancy between an expected value of observation and the actual observation, under assumption of xt−1 and xt conforming to the model of Equation (1). The second term takes a value close to zero if φ(xt−1) is sufficiently close to RS. Since 0<θ<1, if S is sufficiently large, the third term takes a value close to zero. Therefore, if xt−1 and xt conform to the model of Equation (1), and φ(xt−1) is sufficiently close to RS, then, at takes a small value. Therefore, if at is large, it can be stated that xt−1 and xt do not conform to the model of Equation (1), or φ(xt−1) is not close to RS, namely, it is anomalous.
However, in practice, GS(r) or QS cannot be calculated; therefore, the following value is used instead.
Here, it can be shown that a certain C exists, and the following inequality is satisfied:
[Formula 42]
CGS(r)≥∥QS,N{tilde over (K)}S,NQS,N+Φ(xt−1)∥K (42)
As such, if at is large, {circumflex over ( )}at,N becomes large.
Therefore, if {circumflex over ( )}at,S,N is greater than the threshold value, it is regarded as anomalous, and if smaller, it is regarded as normal.
Upon setting the threshold value for anomaly, the randomness of predictions needs to be taken into account. Thereupon, a value of the following expression is used as the magnitude of the prediction on the RKHS:
[Formula 43]
∥QS,N{tilde over (K)}S,NQS,N+Φ(xt−1)∥k (43)
Let d(x,y) represent the distance between x,y∈X.
Let the kernel k be a function related to the distance, and can be expressed as k(x,y)=f(d(x,y)). Further, assume that f is a monotone decreasing function. The examples shown in Section 0., the Gaussian kernel k(x,y) =e-c||x-y||{circumflex over ( )}2, and the Laplacian kernel k(x,y)=e-c|x-y| satisfy these conditions.
It is possible to show that any probability measure μ can be expressed in the following form:
With respect to μ, the magnitude of Φ(μ) on the RKHS is expressed as follows:
While the weighted sum of f(d(xi,xj)) described above is smaller, the distance between xi and xj is greater; therefore, the dispersion expressed as follows,
spreads over a wide range.
[Formula 47]
QS,N{tilde over (K)}S,NQS,N+Φ(xt−1) (47)
This expression is a prediction with respect to information on the probability measure at time t; therefore, in the case where it is predicted correctly,
[Formula 48]
∥QS,N{tilde over (K)}S,NQS,N+Φ(xt−1)∥k (48)
for a smaller value of the above, it can be considered that the dispersion of predictions is greater. Thereupon, by calculating values of the following expression for normal data,
[Formula 49]
∥QS,N{tilde over (K)}S,NQS,N+Φ(xt−1)∥k (49)
information on the randomness of the data can be extracted. This can be used for setting the threshold value such that in the case where the randomness is large, the threshold value for anomaly is set to be large; or if the randomness is small, the threshold value for anomaly is set to be small.
In the following, evaluation results will be described.
Time-series data {x0, x1, . . . , xT−1} was generated as follows:
where ξt takes values randomly sampled from a normal distribution with a mean of 0 and a standard deviation of σ. In order to confirm the relationship between the dispersion of predictions and the index expressed as
[Formula 51]
∥QS,N{tilde over (K)}S,NQS,N+Φ(xt−1)∥k (51)
˜KS,N as the approximation of K was calculated for σ=1, 3, and 5, N=60, and S=30, and for each t for each σ, the following value was calculated:
[Formula 52]
∥QS,N{tilde over (K)}S,NQS,N+Φ(xt−1)∥k (52)
As the kernel, a Laplacian kernel k(x,y)=e−|x-y| was used. The results were as illustrated in
[Formula 53]
∥QS,N{tilde over (K)}S,NQS,N+Φ(xt−1)∥k (53)
It can be considered that a greater dispersion of data makes the dispersion of predictions greater; therefore, it can be seen that the magnitude expressed as follows,
[Formula 54]
∥QS,N{tilde over (K)}S,NQS,N+Φ(xt−1)∥k (54)
can be used as an index of the dispersion of predictions.
With respect to traffic data published in http://totem.info.ucl.ac.be/dataset.html, the anomaly levels obtained by the Arnoldi method, the Shift-invert Arnoldi method, and an existing method were calculated. This data includes measurements of the traffic volume at each router at 15-minute intervals in a network that is constituted with 23 routers, 38 links between the routers, and 53 links with the outside.
Only a traffic volume transmitted from one particular router was extracted for 876 units of time, and the first 780 data items were used as training data, whereas the remaining 96 data items (for one day's worth) were used as normal data for testing.
As anomalous data for testing, {10, 10, . . . , 10} was used. The data used are illustrated in
In the Arnoldi method and the Shift-invert Arnoldi method, ˜KS,N as the approximation of K was calculated using the training data, and using the approximation, the respective anomaly levels of the normal data and the anomalous data were calculated. The setting values were N=60 and S=13. In the Shift-invert Arnoldi method, γ=1.25. As the kernel, the Laplacian kernel k(x,y)=e-|x-y| was used.
Here, for data {z0,z1, zT-1}, by regarding columns of three-dimensional vectors {x0, x1, . . . , xT−1} where xi=[zi,zi+1,zi+2] as observed data, predictions were generated using information up to three units of time before, to calculate the anomaly levels.
As an existing method, a method using LSTM proposed in literature (Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. Long short term memory networks for anomaly detection in time series. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, p.p. 89-94, 2015), was used. The LSTM that generates predictions using information up to three units of time prior to the current time, was trained by using the training data, to calculate the anomaly levels proposed for the method in the literature (Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. Long short term memory networks for anomaly detection in time series. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, p.p. 89-94, 2015), for the normal data and the anomalous data.
Results for the normal data are illustrated in
The anomalous data takes a constant value at all times, and hence, the anomaly level is also constant. The respective anomaly levels of the anomalous data were 77.2 for the Arnoldi method, 74.7 for the Shift-invert Arnoldi method, and −4.5 for the LSTM.
The Arnoldi method and the Shift-invert Arnoldi method can distinguish the normal data from the anomalous data more clearly than the existing method. Referring to
As described above, according to the techniques described in the present embodiment, by approximating a Perron-Frobenius operator on a reproducing kernel Hilbert space, predictions can be generated in which the randomness of time-series data is captured. In this way, anomaly detection that takes into account the randomness of the data can be achieved.
More specifically, by considering a space called RKHS, the concept of inner product can be used. Also, a Krylov subspace can be generated by approximation from a finite number of data items. In this way, approximation of a Perron-Frobenius operator can be executed by the Krylov subspace method.
By using the Shift-invert Arnoldi method, a Perron-Frobenius operator that does not have the property of being bounded can be approximated. By using the approximate operator to generate predictions, an anomaly level can be defined by the discrepancy between the predictions and the observations, to execute anomaly detection.
Information on the randomness is incorporated into the Perron-Frobenius operator; therefore, anomaly detection taking the randomness into account can be achieved. The magnitude of the prediction in an RKHS represents the dispersion level of the prediction; therefore, it can be used for setting a threshold value of the anomaly level to determine whether it is anomalous.
The present specification describes at least the following matters related to an anomaly detection apparatus, an anomaly detection method, and a program:
An anomaly detection apparatus comprising:
The anomaly detection apparatus as described in Matter 1, wherein the approximation unit uses the approximation of the Perron-Frobenius operator to calculate an index of a dispersion level of predictions with respect observed data items, and
wherein the detection unit uses a threshold value according to the index of the dispersion level, to determine whether the observed data item is anomalous.
The anomaly detection apparatus as described in Matter 1, wherein the index of the dispersion level is a magnitude of the predictions in the RKHS obtained by using the approximation of the Perron-Frobenius operator.
The anomaly detection apparatus as described in any one of Matters 1 to 3, wherein the approximation unit partitions the observed data into S sets of data sets, to generate the approximation of the Perron-Frobenius operator restricted to an S-dimensional space by an orthogonalization operation from the S sets of the data sets.
The anomaly detection apparatus as described in Matter 4, wherein the approximation unit generates the approximation of the Perron-Frobenius operator by a Shift-invert Arnoldi method.
An anomaly detection method executed by an anomaly detection apparatus, the method comprising:
A program for causing a computer to function as respective units of the anomaly detection apparatus as described in any one of
Matters 1 to 5.
As above, the embodiments of the present invention have been described; note that the present invention is not limited to such specific embodiments, and various modifications and changes may be made within the scope of the subject matters of the present invention described in the claims.
The present patent application claims the priority of Japanese Patent Application No. 2019-154065 filed on Aug. 26, 2019, and the entire contents of Japanese Patent Application No. 2019-154065 are incorporated herein by reference.
100 time-series data anomaly detection apparatus
110 observed data obtaining unit
120 approximation unit
121 Perron-Frobenius operator approximation unit
122 dispersion level calculating unit
130 detection unit
1000 drive device
1001 recording medium
1002 auxiliary storage device
1003 memory device
1004 CPU
1005 interface device
1006 display device
1007 input device
Number | Date | Country | Kind |
---|---|---|---|
2019-154065 | Aug 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/031316 | 8/19/2020 | WO |