The present invention relates to a prediction method, a prediction apparatus, and a program.
Conventionally, techniques of outputting a prediction distribution of future one-dimensional continuous values on the basis of past history data have been known. Assuming that a time axis takes only integer values for time-series prediction (that is, prediction of continuous values at a plurality of future time points), each time is also referred to as a step or a time step, and continuous values to be predicted are also referred to as target values.
As a classical technique of time-series prediction, although autoregressive moving average models (ARIMA) have been known, in recent years, on the premise of using a large amount of history data, prediction techniques based on a more flexible model using neural networks are becoming mainstream. The prediction techniques using neural networks can be roughly classified into two types, a discriminative model method and a generative model method.
The discriminative model method is a method in which a length of the prediction period (that is, the period to be predicted) is determined in advance, past history data is taken as input, a probability distribution followed by a target value in a future prediction period is output, and an input and output relationship is constructed on the basis of a neural network. Meanwhile, the generative model method is a method in which history data from the past to the present is taken as input, a probability distribution followed by a target value at the next time step is output, and an input and output relationship is constructed on the basis of a neural network. In the generative model method, a target value one step ahead stochastically generated from a probability distribution that is an output of the neural network, is input again to the neural network as new history data, and a probability distribution one step ahead is obtained as an output thereof. In the prediction technique of the discriminative model method or the generative model method described above, it is common to take, as input, history data including not only past continuous values but also a simultaneously observable value (this value is also called a covariate).
As a prediction technique of the generative model method, for example, techniques disclosed in Non Patent Documents 1 to 3 have been known.
Non Patent Document 1 discloses that a past covariate and a target value predicted one step before are taken as input to a recurrent neural network (RNN), and a prediction distribution of a target value one step ahead is output.
Non Patent Document 2 discloses that, on the assumption that continuous values of a prediction target are temporally developed according to a linear state space model, a past covariate is taken as input of an RNN, and a parameter value on each time step in the state space model is output. In Non Patent Document 2, by inputting the target value predicted one step before to the state space model, a prediction distribution of the target value one step ahead is obtained as an output thereof.
Non Patent Document 3 discloses that, on the assumption that continuous values of a prediction target are temporally developed according to a Gaussian process, a past covariate is taken as input of an RNN, and a kernel function on each time step is output. In Non Patent Document 3, a joint prediction distribution of target values in a prediction period including a plurality of steps is obtained as an output of the Gaussian process.
Non Patent Document 1: D. Salinas, et al., “DeepAR: Probabilistic forecasting with autoregressive recurrent networks”, International Journal of Forecasting, vol. 36, pp. 1181-1191 (2020).
Non Patent Document 2: S. Rangapuram, et al., “Deep state space models for time series forecasting”, Advances in Neural Information Processing Systems, pp. 7785-7794 (2018).
Non Patent Document 3: M. AI-Shedivat, et al., “Learning scalable deep kernels with recurrent structure”, Journal of Machine Learning Research, vol. 18, pp. 1-17.
However, conventional techniques of the generative model method have a high calculation cost or low prediction accuracy in some cases.
For example, in the technique disclosed in Non Patent Document 1, in order to obtain a target value one step ahead, it is necessary to perform Monte Carlo simulation on the basis of a prediction distribution output from an RNN when a target value predicted one step before is taken as input. Therefore, in order to obtain the target value of the prediction period including a plurality of steps, it is necessary to perform RNN calculation and Monte Carlo simulation the same number of times as the number of steps. In order to obtain the prediction distribution of the prediction period, it is necessary to obtain several hundreds to several thousand target values, and finally, it is necessary to perform RNN calculation and Monte Carlo simulation several hundred times to several thousand times the number of steps. In general, the calculation cost of the RNN calculation and the Monte Carlo simulation is high, and thus the calculation cost becomes enormous as the number of steps in the prediction period increases.
Meanwhile, for example, in the technique disclosed in Non Patent Document 2, the target value of the next time step is obtained from a linear state space model, and thus the calculation cost thereof is relatively small. However, due to a strong constraint that the prediction distribution is a normal distribution, there is a possibility that the prediction accuracy becomes low for complex time-series data. Similarly, for example, even in the technique disclosed in Non Patent Document 3, there is a possibility that the prediction accuracy becomes low for complicated time-series data due to a strong constraint that the prediction distribution is a normal distribution.
An embodiment of the present invention has been made in view of the above points, and has an object to achieve highly accurate time-series prediction even for complicated time-series data at a small calculation cost.
In order to achieve the above object, according to an embodiment, a prediction method executed by a computer includes: an optimization step of optimizing a parameter of a second function that outputs parameters of a first function from covariates, and optimizing a parameter of a kernel function of a Gaussian process, by using a series of observation values observed in a past and a series of the covariates observed simultaneously with the observation values, wherein values obtained by non-linearly transforming the observation values by the first function follow the Gaussian process; and a prediction step of calculating a prediction distribution of observation values in a period in future to be predicted by using the second function and the kernel function having parameters optimized in the optimization step, and a series of covariates in the period.
It is possible to achieve highly accurate time-series prediction with a small calculation cost even for complicated time-series data.
Hereinafter, one embodiment of the present invention will be described. In the present embodiment, a time-series prediction apparatus 10 capable of achieving highly accurate time-series prediction even for complicated time-series data with a small calculation cost for a prediction technique of a generative model method will be described. Here, regarding the time-series prediction apparatus 10 according to the present embodiment, there are a parameter optimization time during which various parameters (specifically, a parameter θ of a kernel function and a parameter v of an RNN, which will be described later) are optimized from time-series data (that is, history data) representing a past history, and a prediction time during which a value of a prediction distribution in a prediction period, a mean thereof, or the like is predicted.
First, a hardware configuration of a time-series prediction apparatus 10 according to the present embodiment will be described with reference to
As illustrated in
The input device 11 is, for example, a keyboard, a mouse, a touch panel, or the like. The display device 12 is, for example, a display or the like. Note that the time-series prediction apparatus 10 may not include at least one of the input device 11 and the display device 12, for example.
The external I/F 13 is an interface with an external device such as a recording medium 13a. The time-series prediction apparatus 10 can execute, for example, reading and writing on the recording medium 13a via the external I/F 13. Note that the recording medium 13a is, for example, a compact disc (CD), a digital versatile disk (DVD), a secure digital memory card (SD memory card), a Universal Serial Bus (USB) memory card, and the like.
The communication I/F 14 is an interface for connecting the time-series prediction apparatus 10 to a communication network. The processor 15 is, for example, an arithmetic/logic device of various types such as a central processing unit (CPU) and a graphics processing unit (GPU). The memory device 16 is, for example, a storage device of various types such as a hard disk drive (HDD), a solid state drive (SSD), a random access memory (RAM), a read-only memory (ROM), and a flash memory.
The time-series prediction apparatus 10 according to the present embodiment having the hardware configuration illustrated in
Hereinafter, the time-series prediction apparatus 10 during the parameter optimization time will be described.
First, a functional configuration of the time-series prediction apparatus 10 during the parameter optimization time will be described with reference to
As illustrated in
The input unit 101 inputs time-series data, a kernel function, and a neural network provided to the time-series prediction apparatus 10. The time-series data, the kernel function, and the neural network are stored in, for example, the memory device 16 or the like.
The time series data is time-series data (that is, history data) representing past history, and includes a target value y1:T={y1, y2, . . . , yT} and a covariate x1:T={x1, x2, . . . , xT} from a time step t=1 to t=T. T is the number of time steps of the time-series data representing the past history. The target values and the covariates are assumed to take one-dimensional and multi-dimensional real values, respectively.
The target values are continuous values to be predicted, and examples thereof include the number of products sold in the marketing field, the blood pressure and blood glucose level of a person in the healthcare field, and power consumption in the infrastructure field. The covariate is a value that can be observed at the same time as the target value, and for example, in a case where the target value is the number of products sold, the day of the week, the month, the presence or absence of a sale, the season, the temperature, and the like may be exemplified.
The kernel function is a function that characterizes a Gaussian process and is denoted as kθ(t, t′). The kernel function kθ(t, t′) is a function that takes as input two time steps t and t′, and outputs a real value, and has a parameter θ. This parameter θ is not given as input, and is determined by the optimization unit 102 (that is, the parameter θ is a parameter to be optimized).
The neural network includes two types of neural networks Ωw,b(•) and Ψv(•).
Ωw,b(•) is a forward propagation neural network configured only with an activation function that is a monotonically increasing function. It is assumed that parameters of the forward propagation neural network Ωw,b(•) include a weight parameter w and a bias parameter b, and the dimensionality of each of the parameters is Dw and Db. Examples of the activation function that is a monotonically increasing function include a sigmoid function, a soft plus function, a ReLU function, and the like.
Ψv(•) is a recurrent neural network (RNN). It is assumed that the recurrent neural network Ψv(•) has a parameter v, takes as input a covariate x1:t up to a time step t, and outputs a two-dimensional real value (μt, φt), non-negative real values wt in the Dw dimensions, and real values bt in the Db dimensions. That is, μt, φt, wt, bt=Ψv (x1:t) is assumed. This parameter v is not given as input, and is determined by the optimization unit 102 (that is, the parameter v is a parameter to be optimized). There are a plurality of types of recurrent neural networks such as a long short-term memory (LSTM) and a gated recurrent unit (GRU), and the type of recursive neural network to be used is specified in advance.
The optimization unit 102 uses the time-series data (target value y1:T={y1, y2, . . . , yT} and covariate x1:T={x1, x2, . . . , xT}) the kernel function kθ(t, t′), the forward propagation neural network Ωw,b(•), and the recurrent neural network Ψv(•) to search for a parameter Θ=(θ, v) that minimizes a negative log marginal likelihood function. That is, the optimization unit 102 searches for a parameter Θ=(θ, v) that minimizes the following negative log marginal likelihood function L(Θ).
In addition, K=(Ktt′) is a T×T matrix, and
K
tt′
=k
θ(ϕt,ϕt′), 1≤t,t′≤T [Math. 3]
Note that,
Z
T [Math. 4]
The output unit 103 outputs the parameter Θ optimized by the optimization unit 102 to any output destination. The optimized parameter Θ is also referred to as an optimum parameter, and represented as,
{circumflex over (Θ)}=({circumflex over (θ)},{circumflex over (v)}) [Math. 5]
In the text of the specification, a hat “{circumflex over ( )}” indicating the optimized value is described immediately before the symbol, not immediately above the symbol. For example, the optimum parameter expressed in the above Math. 5 is expressed as {circumflex over ( )}Θ=({circumflex over ( )}θ, {circumflex over ( )}v).
Next, parameter optimization processing according to the present embodiment will be described with reference to
Step S101: First, the input unit 101 takes as input the given time-series data (target value y1:T={y1, y2, . . . , yT} and covariate x1:T={x1, x2, . . . , xT}), the kernel function kθ(t, t′), the neural network (forward propagation neural network Ωw,b(•), and the recurrent neural network Ψv(•))
Step S102: Next, the optimization unit 102 searches for a kernel function kθ(t, t′) that minimizes the negative log marginal likelihood function L(Θ) shown in the Math. 1 described above and a parameter Θ=(θ, v) of the recurrent neural network Ψv(•). It is sufficient that the optimization unit 102 searches for a parameter Θ=(θ, v) that minimizes the negative log marginal likelihood function L(Θ) shown in Math. 1 described above by any known optimization method.
Step S103: Then, the output unit 103 outputs the optimized parameter {circumflex over ( )}Θ to any output destination. The output destination of the optimum parameter {circumflex over ( )}θ may be, for example, the display device 12, the memory device 16, or the like, or may be another device or the like connected via the communication network.
Hereinafter, the time-series prediction apparatus 10 during the prediction time will be described.
First, a functional configuration of the time-series prediction apparatus 10 during the prediction time will be described with reference to
As illustrated in
The input unit 101 inputs the time-series data, the prediction period and the type of statistic, the covariate in the prediction period, the kernel function, and the neural network provided to the time-series prediction apparatus 10. The time-series data, the covariate in the prediction period, the kernel function, and the neural network are stored in, for example, the memory device 16 or the like. Meanwhile, the prediction period and the type of statistic may be stored in, for example, the memory device 16 or the like, or may be specified by the user via the input device 11 or the like.
As in the parameter optimization time, the time-series data includes a target value y1:T={y1, y2, . . . , yT} and a covariate x1:T={x1, x2, . . . , xT} from a time step t=1 to t=T.
The prediction period is a period during which target values are predicted. Hereinafter, assuming that 1≤τ0≤τ1, t=T+τ0, T+τ0+1, . . . , T+τ1 is set as the prediction period. Meanwhile, the type of statistic is the type of statistic of the target value to be predicted. Examples of the type of statistic include a value of a prediction distribution, a mean, a variance, and a quantile of the prediction distribution.
The covariate in the prediction period is a covariate in the prediction period t=T+τ0, T+τ0+1, . . . , T+τ1, that is,
x
T+τ
:T+τ
={x
T+τ
, . . . x
T+τ
}. [Math. 6]
The kernel function is a kernel function having an optimum parameter {circumflex over ( )}θ, that is,
k
{circumflex over (θ)}(t,t′) [Math. 7]
The neural network includes a forward propagation neural network Ωw,b(•) and a recurrent neural network having an optimum parameter {circumflex over ( )}v
Ω{circumflex over (v)}(⋅) [Math. 8]
The prediction unit 104 uses the kernel function k{circumflex over ( )}θ(t, t′), the forward propagation neural network Ωw,b(•), the recurrent neural network Ψ{circumflex over ( )}v(•), and the covariate in the prediction period, to calculate a probability density distribution p(y*) of the target value vector in the prediction period
y*=(yT+τ
That is, the prediction unit 104 calculates the probability density distribution p(y*) as follows.
p(y*)=(z*|E*,Σ*)Σt=τ
E*=k
*
T
K
−1
z
Σ*=K*−k*TK−1k* [Math. 11]
μt,ϕt,wt,bt=Ψ{circumflex over (v)}(x1:t)
z
t*=Ωw
Further, for T+τ0≤t, t′≤T+τ1,
k
*=(k{circumflex over (θ)}(t,t1), . . . ,k{circumflex over (θ)}(t,tT))T
K
tt′
*=k
{circumflex over (θ)}(ϕ
t,ϕt′) [Math. 13]
Note that K*=(Ktt′*).
However,
(⋅|E,Σ) [Math. 14]
Then, the prediction unit 104 calculates the statistic of the target value by using the probability density distribution p(y*). A method of calculating the target value according to the type of statistic will be described below.
Value of Prediction Distribution
With the probability density distribution p(y*), a probability corresponding to the target value yt at any time step in the prediction period can be obtained without using Monte Carlo simulation.
Quantile of Prediction Distribution
A quantile Qy of the prediction distribution of the target value yt is obtained by calculating a quantile Qz of zt* following a normal distribution, and then, converting Qz by the following formula.
Q
y=Ωw
where,
Ωw
Ωw
For the above Math. 15, it possible to obtain its solution by a simple root-finding algorithm such as the bisection method thanks to its monotonic increasing property, and it is not necessary to use Monte Carlo simulation.
Expected Value of Function
The expected value of the function f(y*) generally depending on y*, including the mean or covariance of each element yt(T+τ0≤t≤T+τ1) of the target value vector y* in the prediction period, is calculated by the following formula using Monte Carlo simulation.
(1) Multivariate Normal Distribution
From
(z*|E*,Σ*) [Math. 20]
J Samples
{
(2) The Samples Generated in the Above (1) is Converted by the Following Formula.
t
j=Ωw
As a result,
j=(
The output unit 103 outputs the statistic (hereinafter, also referred to as a prediction statistic) predicted by the prediction unit 104 to any output destination.
Next, prediction processing according to the present embodiment will be described with reference to
Step S201: First, the input unit 101 takes as input the given time-series data (target value y1:T={y1, y2, . . . , yT} and covariate x1:T={x1, x2, . . . , xT}), the prediction period t=T+τ0, T+τ0+1, . . . , T+τ1, the type of statistic to be predicted, the covariate {xt}(t=T+τ0, T+τ0+1, . . . , T+τ1) of the prediction period, the kernel function k{circumflex over ( )}θ(t, t′), and the neural network (forward propagation neural network Ωw,b(•) and recurrent neural network Ψ{circumflex over ( )}v(•)).
Step S202: Next, the prediction unit 104 calculates the probability density distribution p(y) by the above Math. 10, and then, calculates the prediction statistic according to the type of statistic to be predicted.
Step S203: Then, the output unit 103 outputs the prediction statistic to any output destination. The output destination of the prediction statistic may be, for example, the display device 12, the memory device 16, or the like, or may be another device or the like connected via the communication network.
As described above, the time-series prediction apparatus 10 according to the present embodiment converts the target value yt (in other words, the observed target value yt) representing the past history by the nonlinear function Ωw,b(•), and performs prediction on the assumption that the converted value Ωw,b(yt) follows the Gaussian process. In this respect, the present embodiment is a generalization of the technique disclosed in Non Patent Document 3, and considering a special case of the identity function being Ωw,b(yt)=yt, the present embodiment is consistent with the technique disclosed in Non Patent Document 3.
In the present embodiment, by maintaining the weight parameter w=wt to be a non-negative value, it can be ensured that Ωw,b(•) is a monotonically increasing function. Thanks to this monotonically increasing property, the calculation cost of the prediction processing by the prediction unit 104 can be reduced.
Therefore, the time-series prediction apparatus 10 according to the present embodiment can achieve highly accurate time-series prediction even for more complicated time-series data under the same calculation cost as the technique disclosed in Non Patent Document 3.
In the present embodiment, the time-series prediction apparatus 10 during the parameter optimization time and the time-series prediction apparatus 10 during the prediction time are implemented as the same device, but the present invention is not limited to this, and may be implemented as separate devices.
The present invention is not limited to the above-mentioned specifically disclosed embodiment, and various modifications and changes, combinations with known techniques, and the like can be made without departing from the scope of the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/041385 | 11/5/2020 | WO |