The embodiments relate generally to time series forecasting, and more specifically to systems and methods for learning latent causal dynamics in time series forecasting.
Time series forecasting has been widely used in research and industries such as economic planning, epidemiology study, or energy consumption. State space models, together with deep learning, have been used for time series analysis and prediction. However, these methods usually rely on stringent assumptions regarding the nature of causal relationships that may not hold in practice. For example, if the forms of transition and emission processes with stringent assumptions in these models cannot approximate the actual data generation process, the results could be sub-optimal.
Therefore, there is a need for improved time series forecasting.
In the figures, elements having the same designations have the same or similar functions.
As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network, or system and/or any training or learning models implemented thereon or therewith.
As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
Time series forecasting using deep learning models is often challenging because the deep learning models may not have a priori knowledge of the time series. For example, in modelling a pandemic transmission and recovery model, the assumptions of protocols such as containment strategies at different periods may not be apparent in the time-series data itself. For example, protocols such as self-quarantine and social distancing mandates, and the like, which were implemented at certain periods during a lookback time window, can often influence the outcome of time series forecasts. However, such knowledge of historical protocols may not be captured by the deep learning model by merely learning the time series data such as the number of new cases, 7-day average, and/or the like within a past time period.
Furthermore, deep learning models are often directly trained to minimize forecasting loss or reconstruction loss. However, deep learning models that are trained to minimize forecasting loss or reconstruction loss may not recover the correct latent processes. The lack of a correct latent process may result in models that capture spurious or over-completed dependencies with noise, which eventually impair the training performance.
In view of the need for a more accurate and computationally efficient time series forecasting mechanism, embodiments described herein provide a state space-model based framework that adopts a non-parametric transition model and a flexible emission model to learn the time-varying property. The state space model may be built upon a hierarchical variable auto encoder to estimate the a priori knowledge of the time series which may then be used in forecasting.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
In some examples, memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 120 includes instructions for the time series forecasting module 130 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein. A time series forecasting module 130 may receive input 140 that includes an input such as time series data, and/or the like via the data interface 115. The time series forecasting module 130 may generate an output 150 such as a prediction.
In some embodiments, the time series forecasting module 130 includes the encoder submodule 131, the decoder submodule 132, the transition prior network submodule 133, and the auxiliary predictor submodule 134. In one embodiment, the time series forecasting module 130 and its submodules 131-134 may be implemented by hardware, software and/or a combination thereof.
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of method. Some common forms of machine-readable media that may include the processes of method are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
The user device 210, data vendor servers 245, 270 and 280, and the server 230 may communicate with each other over a network 260. User device 210 may be utilized by a user 240 (e.g., a driver, a system admin, etc.) to access the various features available for user device 210, which may include processes and/or applications associated with the server 230 to receive an output data anomaly report.
User device 210, data vendor server 245, and the server 230 may each include one or more processors, memories, and other appropriate components for executing instructions such as program code and/or data stored on one or more computer readable mediums to implement the various applications, data, and steps described herein. For example, such instructions may be stored in one or more computer readable media such as memories or data storage devices internal and/or external to various components of system 200, and/or accessible over network 260.
User device 210 may be implemented as a communication device that may utilize appropriate hardware and software configured for wired and/or wireless communication with data vendor server 245 and/or the server 230. For example, in one embodiment, user device 210 may be implemented as an autonomous driving vehicle, a personal computer (PC), a smart phone, laptop/tablet computer, wristwatch with appropriate computer hardware resources, eyeglasses with appropriate computer hardware (e.g., GOOGLE GLASS®), other type of wearable computing device, implantable communication devices, and/or other types of computing devices capable of transmitting and/or receiving data, such as an IPAD® from APPLE®. Although only one communication device is shown, a plurality of communication devices may function similarly.
User device 210 of
In various embodiments, user device 210 includes other applications 216 as may be desired in particular embodiments to provide features to user device 210. For example, other applications 216 may include security applications for implementing client-side security features, programmatic client applications for interfacing with appropriate application programming interfaces (APIs) over network 260, or other types of applications. Other applications 216 may also include communication applications, such as email, texting, voice, social networking, and IM applications that allow a user to send and receive emails, calls, texts, and other notifications through network 260. For example, the other application 216 may be an email or instant messaging application that receives a prediction result message from the server 230. Other applications 216 may include device interfaces and other display modules that may receive input and/or output information. For example, other applications 216 may contain software programs for asset management, executable by a processor, including a graphical user interface (GUI) configured to provide an interface to the user 240 to view the prediction result.
User device 210 may further include database 218 stored in a transitory and/or non-transitory memory of user device 210, which may store various applications and data and be utilized during execution of various modules of user device 210. Database 218 may store user profile relating to the user 240, predictions previously viewed or saved by the user 240, historical data received from the server 230, and/or the like. In some embodiments, database 218 may be local to user device 210. However, in other embodiments, database 218 may be external to user device 210 and accessible by user device 210, including cloud storage systems and/or databases that are accessible over network 260.
User device 210 includes at least one network interface component 219 adapted to communicate with data vendor server 245 and/or the server 230. In various embodiments, network interface component 219 may include a DSL (e.g., Digital Subscriber Line) modem, a PSTN (Public Switched Telephone Network) modem, an Ethernet device, a broadband device, a satellite device and/or various other types of wired and/or wireless network communication devices including microwave, radio frequency, infrared, Bluetooth, and near field communication devices.
Data vendor server 245 may correspond to a server that hosts one or more of the databases 203a-n (or collectively referred to as 203) to provide training datasets including time series dataset to the server 230. The database 203 may be implemented by one or more relational database, distributed databases, cloud databases, and/or the like.
The data vendor server 245 includes at least one network interface component 226 adapted to communicate with user device 210 and/or the server 230. In various embodiments, network interface component 226 may include a DSL (e.g., Digital Subscriber Line) modem, a PSTN (Public Switched Telephone Network) modem, an Ethernet device, a broadband device, a satellite device and/or various other types of wired and/or wireless network communication devices including microwave, radio frequency, infrared, Bluetooth, and near field communication devices. For example, in one implementation, the data vendor server 245 may send asset information from the database 203, via the network interface 226, to the server 230.
The server 230 may be housed with the time series forecasting module 130 and its submodules described in
The database 232 may be stored in a transitory and/or non-transitory memory of the server 230. In one implementation, the database 232 may store data obtained from the data vendor server 245. In one implementation, the database 232 may store parameters of the time series forecasting module 130. In one implementation, the database 232 may store previously generated fixed patch of code and the corresponding input feature vectors.
In some embodiments, database 232 may be local to the server 230. However, in other embodiments, database 232 may be external to the server 230 and accessible by the server 230, including cloud storage systems and/or databases that are accessible over network 260.
The server 230 includes at least one network interface component 233 adapted to communicate with user device 210 and/or data vendor servers 245, 270 or 280 over network 260. In various embodiments, network interface component 233 may comprise a DSL (e.g., Digital Subscriber Line) modem, a PSTN (Public Switched Telephone Network) modem, an Ethernet device, a broadband device, a satellite device and/or various other types of wired and/or wireless network communication devices including microwave, radio frequency (RF), and infrared (IR) communication devices.
Network 260 may be implemented as a single network or a combination of multiple networks. For example, in various embodiments, network 260 may include the Internet or one or more intranets, landline networks, wireless networks, and/or other appropriate types of networks. Thus, network 260 may correspond to small scale communication networks, such as a private or local area network, or a larger scale network, such as a wide area network or the Internet, accessible by the various components of system 200.
Referring to
Time series forecasting plays a crucial role in economic planning, epidemiology study, and energy consumption. In various applications, one critical step is to infer the causal structure of the underlying stochastic dynamical system from the time series of measurements. State space models provide a unified methodology to learn and analyze time-lagged causal relations. Formally, given the observed data xt, the dynamical system with latent process zt may be described as:
where ηt and ϵdenote the i.i.d. Gaussian measurement/process noises. ƒ(⋅) and g(⋅) denote the nonlinear transition model and the nonlinear emission model, respectively. The transition model ƒ(⋅) aims to capture the latent dynamics underlying the observed data, while the emission model g(⋅) is designed for learning the mapping from the latent variables to the observed data. More expressive, scalable deep learning models may be leveraged for modeling complex transition and emission models effectively. Since the latent processes are typically unobserved, various works have focused on latent variable estimation via designing more efficient optimization methods. However, these works do not ensure that the models learn the correct underlying latent processes. First, they rely on stringent assumptions regarding the nature of causal relationships, which may not hold in practice. In particular, the additive noises in both transition and emission processes are not able to capture nonlinear distortions in the observed/latent values of the variables, which might be necessarily true in real-world applications, like sensor distortion, and motion capture.
Existing works often do not provide theoretical guarantees for identifiability. Usually in the implementations, state space models are directly trained with the forecasting loss or reconstruction loss. However, these objectives could not guarantee that they can recover the correct latent processes. These issues would be severe when using neural networks to learn transition and emission models. Because the parameter space has been significantly increased, the model tends to capture spurious and over-completed dependencies with noise. Then, the training processes are difficult and thus may result in sub-optimal performance. This situation would deteriorate when data are not abundant or the data generation process changes across the time period. Furthermore, the transition model ƒ(⋅) is usually assumed to be constant across the measured time period. It implicitly assumes that the underlying dynamical system is in a stationary environment, which may not hold in real-life problems. For example, the unemployment rate tends would rise much faster at the start of a recession than it drops at the beginning of a recovery. In the analysis of the COVID-19 outbreak, the transmission and removal processes may vary over time, governed by various virus containment strategies at different time periods, such as self-quarantine and social distancing mandates. If the forms of transition and emission processes with stringent assumptions cannot approximate the actual data generation process, the results will be sub-optimal.
As described herein, an improved framework using neural network model that is general enough when precise prior knowledge is unavailable is described. A general formulation of state space models, called Non-Parametric State Space Models (NPSSM), is described. NPSSM includes a completely non-parametric transition model and a flexible emission model. While the Non-Parametric State Space Model (NPSSM) is remarkably flexible, the latent processes are generally identifiable. In addition, a Time-Varying Non-Parametric State Space Model (TV-NPSSM) based on the NPSSM is described to capture the potential time-varying change property of the latent processes.
Furthermore, a time series forecasting framework may be provided by using a hierarchical VAE framework with the NPSSM. The time series forecasting framework may incorporate the stringent conditions for identifiability analysis to estimate the procedure and make it applicable for forecasting tasks. The time series forecasting framework may recover time-lagged latent variables and their causal relations from observed sequential data.
Accordingly, systems and methods described herein may be viewed to use a method for causal representation learning for time series data. Forecasting may benefit from this general and identifiable model, since the underlying latent process may be identified. Some of the features are listed as follows, which will be described in detail with reference to
First, general formulation of state space models, namely, NPSSM, are described. Further, an extension, namely, Time-Varying Non-Parametric State Space Model (TV-NPSSM), is described, which allows nonstationarity of the latent process over time. These models provide a flexible form for transition and emission model that is expected to be widely applicable.
Second, the identifiability of the time-lagged latent variables and their influencing strengths for NPSSM under relatively mild conditions are established.
Third, by incorporating the independent noise conditions for the identifiable analysis, a new hierarchical VAE framework for model estimation and its use for forecasting tasks are described.
Fourth, estimation of the models described herein may be viewed as a way to learn the underlying temporal causal processes, which further facilitates forecasting of the time series.
Referring to
At step 302, a time series dataset is received (e.g., via data interface 215). The time series dataset comprises datapoints at a plurality of timestamps in the timeseries.
At step 304, a non-parametric state space model (NPSSM) for a dynamical system underlying the time series dataset is provided. As described in detail below, the non-parametric state space model includes a nonparametric latent transition model and a flexible emission model.
To make state space models in Eq. (1) flexible, functional causal model is used to characterize transition process. Each latent factor zit is represented with the non-parametric form zit=ƒi({zj,t−τ|zj,t−τϵPa(zit)},ϵit), where i,j denotes variable element index, Pa(zit) (parents) denotes the set of time-lagged variables that directly determine the latent factor zit, and τ denotes the time lag index. In this way, noise ϵit together with parents of zit generate zit via unknown non-parametric function ƒ(⋅). Formally, NPSSM can be formulated as
where ϵit are mutually independent (i.e. spatially and temporally independent) random noises sampled from noise distribution p(ϵit). g2(⋅) is the nonlinear mixing function that takes latent factors as input. g1(⋅) denotes invertible post-nonlinear distortion on variable xt and ηt are independent noises. As such, a general form of the state space model is provided.
In NPSSM, the transition function of the latent processes is completely non-parametric: the effect zit is just a smooth function (referring to the condition 3 of Theorem 1 below, which is the core condition to guarantee the identifiability of NPSSM) of its parents and noise, and it may contain linear models, nonlinear models with additive noise, and even multiplicative noise model as special cases. The Independent Noise condition and Conditional Independent condition (see, e.g., J. Pearl., Causality. Cambridge university press, 2009) are widely satisfied in time series data. Furthermore, in the emission function, the post-nonlinear transformation g1(⋅) may model sensor or measurement distortion that usually happens when the underlying processes are measured with instruments.
Identifiability of NPSSM. Next, the identifiability of NPSSM in the function space is defined. Since the conditional independence relations fully capture time-delayed causal relations in the time-delayed causally sufficient system, NPSSMs are identifiable if the latent variables are identifiable up to permutation and component-wise invertible transformations.
Definition 1 (Identifiability of NPSSM). For a ground truth (f, g, p(ϵ)) and a learned ({circumflex over (f)},ĝ,{circumflex over (p)}(ϵ)) models as defined in Eq. (2), if the joint distribution for observed variables pf,g,p(ϵ) (xt) and p{circumflex over (f)},ĝ,{circumflex over (p)}(ϵ) (xt) are matched almost everywhere, then non-parametric state space models are identifiable if observational equivalence can always lead to identifiability of the latent variables up to permutation π and component-wise invertible transformation T:
p
ĝ,{circumflex over (ƒ)},{circumflex over (p)}
(xt)=pg,ƒ,p
where g−1=ĝ−1 are invertible functions that maps xt to and {circumflex over (z)}t respectively.
Next the identifiability result of the proposed model is discussed. Without loss of generality, assume the maximum time lag L=1 in the analysis. Note that it is trivial to extend the analysis for a long lag L>1. NPSSM is remarkably flexible, and further it is identifiable up to relative minimum indeterminacies. Each latent process can be recovered up to its nonlinear, invertible transformations. In many real time series applications, these indeterminacies may be inconsequential.
Theorem 1. Suppose that we observe data sampled from a generative model (as defined according to P. Becker, H. Pandya, G. Gebhardt, C. Zhao, C. J. Taylor, and G. Neumann, Recurrent Kalman networks: Factorized inference in high-dimensional deep feature spaces, International Conference on Machine Learning, pages 544-552, PMLR, 2019) with parameters ({circumflex over (f)},ĝ,{circumflex over (p)}(ϵ)). Assume the following holds:
1. The set {xt∈χ|φηt(xt)=0} has measure zero, where φηt is the characteristic function of the density p(ηt)=pg(xt|zt). The post nonlinear functions g1, ĝ1 are invertible. The mixing functions g2, ĝ2 are injective and differentiable almost everywhere.
2. The noise terms L it are mutually independent.
3. Let ηkt≙log p(zkt|zt−1), ηkt is twice differentiable in zkt and is differentiable in zl,t−1, l=1, 2, . . . , n. For each value of zt, v1t, v̊1t, v2t, v̊2t, . . . , vnt, v̊nt as 2n vector functions in z1,t−1, z2,t−1, . . . , zn,t−1, are linearly independent, with vkt and v̊kt defined below:
then zt must be an invertible, component-wise transformation of a permuted version of {circumflex over (z)}t.
Compared to the form of conventional state space models in Eq. (1), NPSSM is remarkably flexible, and are applicable to many real problems. Further, Theorem 1 indicates that NPSSM is still generally identifiable. With NPSSM, the underlying causal latent processes may be determined from the observed data. The differentiability and linear independence in condition 3 is the core condition to assure the identifiability of latent factors zt from observed xt. It indicates that time-lagged variables must have a sufficiently complex and diverse effect on the transition distributions in terms of the second- and third-order partial derivatives. From this condition we can find that the linear Gaussian SSM is unidentifiable, since the second- and third-order partial derivatives would be constant, which violates the linear independence assumption.
Below a proof of the identifiability theory for NPSSM is provided. Lemma 1, which presents the identifiability of latent variables in fixed latent dynamics, is first introduced. This result will be used in the proof of Theorem 1.
Lemma 1. The fixed latent causal dynamics takes on the following form:
x
t
=g(zt)zit=ƒi({zj,t−1|zj,t−1∈Pa(zit)},∈it). (4)
Let ηkt≙log p (zkt|zt−1), ηk (t) is twice differentiable in zkt and is differentiable in zl,t−1, l=1, 2, . . . , n. Suppose there exists invertible function ĝ that maps xt to zt, i. e., zt=ĝ(xt), such that the components of {circumflex over (z)}t are mutually independent conditional on {circumflex over (z)}t−1. Let
If for each value zt v1,t, {dot over (v)}1,t, v2,t, {dot over (v)}2,t, . . . , vn,t, {dot over (v)}n,t, as 2n vector functions in z1,t−1, z2,t−1, . . . , zn,t−1 are linearly independent, then z t must be an invertible, component-wise transformation of a permuted version of {circumflex over (z)}t.
Second, the additive noise model is considered, in which g1 is the identity mapping. To identify the noise-free distribution g(zt) from noisy data with assumption 1, convolution theorem is used to decouple measurement error. Since the volume of a matrix volA is defined as the product of the singular values of A. It is obtained that volA=|detA| when A is invertible. Use volA in the change of variables formula to replace the absolute determinant of the Jacobian, and suppose the joint distribution for observed variables pƒ,g,p(ϵ) (xt|zt−1) and p{circumflex over (ƒ)},ĝ,{circumflex over (p)}(ϵ) (xt|{circumflex over (z)}t−1) are matched almost everywhere. Then:
∫zpƒ,p(ϵ)(zt|zt−1)pg(xt|zt)dzt=∫zp{circumflex over (ƒ)},{circumflex over (p)}(ϵ)(zt|{circumflex over (z)}t−1)pĝ(xt|zt)dzt, (5)
∫zpƒ,p(ϵ)(zt|zt−1)pη
According to the Jacobian matrix of the mapping from
∫xpƒ,p(ϵ)(g−1(
Assume
∫x
According to the convolution theorem that the convolution in one domain (e.g., time domain) equals point-wise multiplication in the other domain (e.g., frequency domain), the following is obtained,
(
F[
ƒ,p(ϵ),g,z
](ω)φη
where * denotes the convolution operator and F[⋅] denotes the Fourier transform. We can find φη
F[
ƒ,p(ϵ),g,z
](ω)=F[
ƒ,p(ϵ),g,z
(xt)=
Thus, we can conclude that if the distributions are the same with additive noise, the noise-free distributions are still the same. Combined with the results from Lemma 1, the latent variables are identifiable up to permutation and component-wise invertible transformation.
Further, the effect of post non-linear function g1(⋅) is considered. Denote {circumflex over (x)}t=g2(zt)+ηt, then the learned post non-linear function xt=ĝ1({circumflex over (x)}t) can be written as xt=(g1∘(g1)−1∘ĝ1)({umlaut over (x)}t). Assume that ĝ1=g1∘((g1)−1∘g1)=g1∘g3, in which g3 represents the indeterminancy on the space of {tilde over (x)}t. Following the proof of Theorem 1, we have that g3 can only be a bijection if both g2, ĝ1 are injective functions. Thus, it may be treated as adding a component-wise invertible nonlinear function g3−1 on xt which does not affect the identifiability of zt up to permutation and component-wise invertible transformation. Therefore, NPSSM in (9) is identifiable.
As shown in
where ζt, similar to ϵit are mutually independent (i.e., spatially and temporally independent) random noises. ƒc(⋅) is the transition function for the time-varying change factors, which is also formulated in a non-parametric way, ct is a low-dimensional vector to characterize the time-varying information, which is used as an input for the transition model ƒi(⋅). TV-NPSSM includes the conventional state space models in Eq. (1) as a particular case in which the time-varying change factors do not exist. It also includes the time-varying parameter vector autoregressive model as a special case, which allows the coefficients or the variances of noises in the linear auto-regressive model to vary over time following a specified law of motion.
At step 306, estimated latent variables of a latent space for the time series dataset is determined, using a neural network system, based on the non-parametric state space model. At step 308, a prediction result for the time series dataset based on the latent causal representation is provided using the neural network system. Example neural network models providing time series forecasting based on the non-parametric state space model are described in detail with reference to
Referring to
Referring to
VAE. As shown in the example of
Transition Prior Network. In some examples, latent transition may be determined by leveraging the forward prediction function. However, forward prediction cannot model latent processes in a non-parametric way. In the example of time series forecasting framework 500, for implementation based on the non-parametric state space model, transition priors are obtained by learning inverse latent transition functions ƒ−1. Particularly, they may be implemented by a set of separate multilayer perceptron (MLP) Networks ri to satisfy the independent noise condition in Theorem (1), which take the estimated latent causal variables and time-varying change factors as input and output the noise terms, i.e. {circumflex over (ϵ)}it=ri({circumflex over (z)}it,ĉt, {{circumflex over (z)}t−τ}τ=1L). By applying the change of variables formula to the transformation, the transition probability may be formulated in a non-parametric way:
Because of the mutually independent noise assumption, the Jacobian is a lower-triangular. As such, its determinant may be efficiently calculated as the product of each element. By applying the independent noise assumption, the transition probability can be formulated as:
To fit the estimated noises terms, each noise p({circumflex over (ϵ)}it) is modeled as a transformation from the standard normal noise N(0,1) through function s( ), which may be formulated as
Specifically, explicitly estimation of the term
may not be necessary, since inverse causal transition functions {ri} could compensate it. Similarly, the transition probability of change factors ct may be defined as log
where ui denotes the inverse change transition function.
Auxiliary Predictor. Compared with the conventional forecasting models p(xt|{xt−τ}), which computes prediction loss in input space, in the time series forecasting framework described herein, the latent variables {{circumflex over (z)}t}τ=1L are recovered, and the auxiliary predictor is trained in the latent space p({circumflex over (z)}t|{{circumflex over (z)}t−τ}τ=1L). Note that although we do not explicitly involve the change factor ct in the predictor, it had to be inferred from the latent variables {{circumflex over (z)}t−τ}τ=0L as well, like the definition of encoder qc(ĉt|{{circumflex over (z)}t−τ}τ=0L). In an example, the auxiliary predictor 510 uses the long short-term memory (LSTM) network to implement the auxiliary predictor ppred({circumflex over (z)}t|{circumflex over (ϵ)}t, {{circumflex over (z)}t−τ}τ=1L) in latent space. The noise {circumflex over (ϵ)}t is generated from the inverse latent transition function ri({circumflex over (z)}it,ĉt, {{circumflex over (z)}t=τ}τ=1L) in the training phase, while it may be sampled from standard normal distribution N(0,1) in the forecasting phase.
Referring back to
In some embodiments, during the training process 404, at step 408, an estimated noise is generated based on the one or more estimated latent variables and an estimated time-varying change factor. As shown in the example of
In some embodiments, during the training process 404, at step 410, a prediction result in the latent space is generated. As shown in the example of
In some embodiments, during the training process 404, at step 412, a prediction result in the observed space is generated using a decoder. As shown in the example of
In some embodiments, during the training process 404, at step 414, parameters of the neural network models of the time series forecasting framework, including parameters of VAE, transition prior network, and auxiliary predictor, are updated, e.g., using an evidence lower bound objective (ELBO) as follows:
where Pz(xt|zt) and ppred({circumflex over (z)}t|{circumflex over (ϵ)}t, {{circumflex over (z)}t−τ}τ=1L) denote the decoder distribution and prediction distribution respectively, in which mean squared error (MSE) loss is used for the likelihood. In some embodiments, the training is performed in two phases, wherein during the first phase, the VAE parameters and transition prior network parameters are learned, but not the parameters of the auxiliary predictor. During the second phase after the first phase is performed, all parameters are learned jointly.
Referring to
Further, learning the latent temporal processes (which are not instantaneously related), as performed by time series forecasting framework 500 based on NPSSM, improves prediction. First, it provides a compact representation for forecasting. As shown in
The systems and methods described herein are evaluated on various synthetic and real-world datasets with next-step forecasting tasks. Experimental results demonstrate that latent causal dynamics could be reliably identified from observed data under various settings, and further verify that identifying and using the latent temporal causal processes consistently improve prediction.
Referring to
In some embodiments, the hyperparameters of TV-NPSSM include [β, γ, σ], which represent the weights for transition prior for latent variable z, change factor c, and auxiliary predictor. Since the objective of transition prior does not consider the initial time-lagged variables, conventional VAE is used, where the standard normal distribution N(0, 1) as the prior distribution for these initial latent variable instead. Therefore, we augment the hyperparameters to [β, βinit, γ, γinit, σ] The ELBO loss is used to select the best tuple of [β, βinit, γ, γinit, σ], because low ELBO always lead to high mean correlation coefficient (MCC). For the various datasets discussed below, the optimal configuration for COVID-19 dataset is [5e-2, 5e-3, 1e-4, 1e-3, 1e-1], the optimal configuration for Economics dataset is [5e-2, 1e-3, 5e-5, 1e-3, 1]. To facilitate comparison, the training parameters, e.g. optimizer, batch size, as well as the encoder and decoder architecture are identical to TV-NPSSM. Similarly, hyperparameters are chosen via objective loss. For all experiments, z∈R8 and c∈R4 are used, and the maximum time lag L=2 is set by the rule of thumb. For the initialization of VAE, the instruction of β-VAE is used and the He initialization is used. For the rest of modules/networks, the uniform initialization is used.
Training Stability. Various techniques are used improve training stability: (1) AdamW optimizer is used as a regularizer to prevent training from being interrupted by overflow or underflow of variance terms of VAE; (2) For the synthetic datasets, in the first phase, the VAE parameters and transition prior network parameters are learned, but not the parameters of the auxiliary predictor. After this phase, all parameters are learned jointly. This allows the model to first find good VAE embedding, and then learn how to utilize it for the forecasting task. For the real-world datasets, a single phase of training is performed, where all the model parameters are learned jointly. Nvidia A100 GPU is used to run the experiments.
Synthetic Datasets. To evaluate the identifiability and forecasting capability of the model under different conditions, various synthetic datasets are generated with 1) fixed causal dynamics; 2) fixed causal dynamics with distribution shift; 3) time-varying causal dynamics with changing noise variances and 4) time-varying causal dynamics with changing causal strengths. The first 80% data is used for training and the rest 20% is used for evaluation.
Fixed Casual Dynamics. For the fixed casual dynamics, 100,000 data points are generated based on the following equation:
Here, ϵk,t is standard Gaussian and ϵ1,t, ϵ2,t, . . . , ϵn,t are mutually independent and independent from
which depends on zt−1, is the standard deviation of the noise in zk,t. Set the latent size n=8 and the lag number of the process L=2. A 2-layer MLP with LeakyReLU is used as the state transition function. The emission function is a random three-layer MLP with LeakyReLU units. The process noise is sampled from i.i.d. Gaussian distribution (σ=0.1). The process noise terms are coupled with the history information through multiplication with the average value of all the time-lagged latent variables.
Fixed causal dynamics with distribution shift. Similar to the setting of fix causal dynamics, 80,000 data point are generated for the training set. To add changes, the values of the first layer of the MLP in the test set are varied to generate 20,000 samples. The entries of the kernel matrix of the first layer are uniformly distributed between [−1,1].
Time-varying causal dynamics with changing noise variances. For the time-varying causal dynamics with changing noise variances, 100,000 data points are generated based on the following equation:
where the noises ζkt are sampled from i.i.d. Lapalace distribution (σ=1). In the latent transition process for zt, noise terms are coupled with the history information and change factors through multiplication with the average value of all the time-lagged latent variables zt−τ and current time varying change factor ct.
Time-varying causal dynamics with changing causal strengths. For the time-varying causal dynamics with changing causal strengths, the process for changing noise variances is used, and 100,000 data points are generated based on the following equation:
where we take the change factor ct as a input for the latent transition function for zt.
Real-World Datasets. Two real-world datasets, including COVID-19 dataset and Economics dataset are used to evaluate the forecasting performance of the systems and methods described herein. The first 80% data is used for training and the rest 20% is used for evaluation.
The COVID-19 incidence data is publicly available at JHU-CSSE (https://github.com/CS S EGIS andData/COVID-19) and COVID-19 tracking project (https://covidtracking.com). The COVID-19 dataset used in the experiments covers the report from 2020, Jan. 23 to 2022, Jan. 1 from 50 states and DC in the US. Consider the skew distribution of new cases, preprocessing step is performed, and the data are normalized with log-transformation log (x+1).
The economics dataset was downloaded from https://www.theglobaleconomy.com. The time-lagged causal relationships about 10 macro-economic variables ranging from CPI, inflation to unemployment rate with monthly data from 1965 to 2017 in the USA, are investigated. Preprocessing step is performed and the data are normalized by subtracting the mean and dividing them by the standard deviation.
Evaluation Metric. To evaluate the identifiability of the learned latent variables, Mean Correlation Coefficient (MCC) is reported, which is a standard metric in ICA literature for continuous variables. MCC is a standard metric for evaluating the recovery of latent factors in ICA literature. It first computes the absolute values of the Spearman's rank correlation coefficients between every ground-truth factor against every estimated latent factor. The possible permutation is adjusted by solving a linear sum assignment problem in polynomial time on the computed correlation matrix. MCC reaches 1 when latent variables are identifiable up to component wise invertible transformation and permutation. To evaluate the forecasting performance, Mean Absolute Error (MAE) and ρ-risk which quantifies the accuracy of a quantile ρ of the predictive distribution are reported. Formally, they are defined as:
where {circumflex over (x)}itp is the empirical p-quantile of the prediction distribution and I is the indicator function. For the probabilistic forecasting models, forecast distribution are estimated by 50 trials of sampling. We use the predicted median value as {circumflex over (x)}it.
Baselines. The proposed method is compared with typical deep forecasting models and deep state space models: (1) LSTM which is a baseline for the deterministic deep forecasting model; (2) DeepAR which is an encoder-based probabilistic deep forecasting model; (3) VRNN and (4) KVAE which are deep state space models. Note that KVAE implicitly considers time-varying change factors by formulating the transition matrix as a weighted average of a set of base matrices and using an RNN to predict the combination weights at each step. Finally, (5) TV-NPSSM and NPSSM are compared to verify the effectiveness of incorporating time-varying change factors.
Experimental Results. Synthetic datasets that satisfy the identifiability conditions in the theorems are generated as discussed above. Specifically, four representative settings are considered respectively to validate the identifiability and forecasting results under fixed causal dynamics (Synthetic1), fixed causal dynamics with distribution shift (Synthetic2), time-varying causal dynamics with changing noise variances (Synthetic3) and time-varying causal dynamics with changing causal strengths (Synthetic4). For all the synthetic datasets, set latent size n=8 and the maximum latent process lag is set to L=2. The emission function g(⋅) is a random three-layer MLP with LeakyReLU units.
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and, in a manner, consistent with the scope of the embodiments disclosed herein.
This application claims priority to U.S. Provisional Patent Application No. 63/344,495 filed May 20, 2022 which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63344495 | May 2022 | US |