Embodiments of the present disclosure relate to the field of machine learning, and in particular to machine learning architecture for time series data prediction.
Stochastic processes may include a collection of random variables that are indexed by time. An example of a continuous stochastic process may be the Weiner process. Normalizing flows may include operations for transforming a base distribution into a complex target distribution, thereby providing models for data generation or probability density estimation. Expressive models for sequential data can contribute to a statistical basis for data prediction or generation tasks in a wide range of applications, including computer vision, robotics, financial technology, among other examples.
Embodiments of the present disclosure may provide generative models for continuous stochastic processes. In particular, models provided herein may model continuous and irregular time series data based on reversible generative models. In some embodiments, the generative models may include operations for decoding a base continuous stochastic process (e.g., Weiner process) into a complex observable process using a dynamic instance of normalizing flows, such that resulting observable processes may be continuous in time. In addition to maintaining desirable properties of static normalizing flows (e.g., sampling or likelihood determination), embodiments of the present disclosure may include operations for inference tasks, such as interpolation and extrapolation at arbitrary time stamps, which may otherwise not be possible for some example time series data sets having complex or multivariate dynamics.
Embodiments of the present disclosure may be applicable to natural processes such as environmental conditions (e.g., temperature of a room throughout a day, wind speed over a period of time), speed of a travelling vehicle over time, electricity consumption over a period of time, valuation of assets in the capital markets, among other examples. Embodiments of the present disclosure may be applied to other applications such as natural language processing, recommendation systems, traffic pattern prediction, medical data analysis, forecasting, among other examples which may be associated with irregular time series data. The continuous time generative models disclosed herein may be configured for operations associated with weather forecasting, pedestrian behavior prediction by autonomous or self-driving vehicles, or healthcare data interpolation or prediction.
In one aspect, the present disclosure may provide a system for machine learning architecture for time series data prediction. The system may include: a processor and a memory coupled to the processor and storing processor-executable instructions. The processor-executable instructions that, when executed, may configure the processor to: obtain time series data associated with a data query; generate a predicted value based on a sampled realization of the time series data and a continuous time generative model, the continuous time generative model trained to define an invertible mapping to maximize a log-likelihood of a set of predicted observation values for a time range associated with the time series data; and generate a signal providing an indication of the predicted observation value associated with the data query.
In another aspect, the present disclosure may provide a method for machine learning architecture for time series data prediction. The method may include: obtaining time series data associated with a data query; generating a predicted value based on a sampled realization of the time series data and a continuous time generative model, the continuous time generative model trained to define an invertible mapping to maximize a log-likelihood of a set of predicted values for a time range associated with the time series data; and generating a signal providing an indication of the predicted value associated with the data query.
In another aspect, the present disclosure may provide a non-transitory computer-readable medium having stored thereon machine interpretable instructions or data representing a continuous time generative model trained to define an invertible mapping based on maximizing a log-likelihood of observation values of irregular time series data. The continuous time generative model may be configured to generate a predicted value based on a sampled realization of the time series data associated with a data query.
In one aspect, the present disclosure may provide a non-transitory computer-readable medium or media having stored thereon machine interpretable instructions which, when executed by a processor may cause the processor to perform one or more methods described herein.
In various further aspects, the disclosure provides corresponding systems and devices, and logic structures such as machine-executable coded instruction sets for implementing such systems, devices, and methods.
In this respect, before explaining at least one embodiment in detail, it is to be understood that the embodiments are not limited in application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
Many further features and combinations thereof concerning embodiments described herein will appear to those skilled in the art following a reading of the present disclosure.
In the figures, embodiments are illustrated by way of example. It is to be expressly understood that the description and figures are only for the purpose of illustration and as an aid to understanding.
Embodiments will now be described, by way of example only, with reference to the attached figures, wherein in the figures:
Embodiments of the present disclosure may be may be applicable to natural processes such as environmental conditions (e.g., temperature of a room throughout a day, wind speed over a period of time), speed of a travelling vehicle over time, electricity consumption over a period of time, valuation of assets in the capital markets, among other examples.
In practice, such example natural processes may be continuous processes having data sets generated based on discrete data sampling, which may occur at arbitrary points in time (e.g., arbitrarily obtained timestamped data). Modelling such natural processes may include inherent properties based on previous points in time, which may result in a potentially unmanageable matrix of variable or data dependencies. In some scenarios, such natural processes may be modeled with simple stochastic process such as the Weiner process, which may have the Markov property (e.g., memoryless property of the stochastic process). However, it may be beneficial to provide generative models that may be more expressive than such simple stochastic processes.
It may be beneficial provide generative models for such example processes to address some of the above-suggested challenges, such that example natural processes associated with a plurality of discrete samples may be represented by a generative model for generating continuous sampled data, likelihood approximation, or inferences (e.g., interpolation/extrapolation) at any point in time. As will be disclosed herein, in addition to extending the example Weiner process to a continuous time generative process, embodiments of generative models may be applicable to other continuous stochastic processes.
The present disclosure provides generative models for continuous stochastic processes. In particular, embodiments of the present disclosure may model continuous and irregular time series data based on reversible generative models for stochastic processes. Embodiments of a generative model (e.g., continuous-time flow process) may include operations for decoding a base continuous stochastic process (e.g., Weiner process) into a complex observable process using a dynamic instance of normalizing flows. Resulting observable processes may be continuous in time. In addition to maintaining desirable properties of static normalizing flow operations (e.g., efficient sampling and likelihood determination), embodiments of the present disclosure may include operations for inference tasks, such as interpolation and extrapolation at arbitrary time stamps, which may otherwise not be possible with some example time series data sets having complex or multivariate dynamics.
Expressive models for sequential data may provide a statistical basis for downstream tasks in a wide range of domains, including computer vision, robotics, finance or the like. Deep generative architectures, for example the concept of reversibility, may address limitations associated with structured decompositions (e.g., state-space models).
In some scenarios, utility of a time series model may be based on one or more of the following properties. First, with respect to resolution, example time series models may be discrete with respect to time. Such models may make an implicit assumption of a uniformly spaced temporal grid, which precludes its application from asynchronous tasks with a separate arrival process. Second, with respect to structural assumptions, expressiveness of a temporal model may be determined by the dependencies and shapes of its variables. In particular, the topological structure may be detailed to capture dynamics of the underlying process but sparse enough to allow for robust learning and efficient inference. Third, with respect to generation, a beneficial time series model should be able to generate unbiased samples from the true underlying process in an efficient way. Fourth, with respect to inference, given a trained model, the model may support standard inference tasks, such as interpolation, forecasting, and likelihood calculation.
Deep generative modeling may enable increased flexibility while keeping generation and inference tractable, based on example operations such as amortized variational inference [29, 12], reversible generative models [43, 30], and networks based on differential equations [10, 36].
In some embodiments disclosed herein, operations for modeling of continuous and irregular time series with a reversible generative model for stochastic processes are provided. In some embodiments, operations are based on features of normalizing flows. However, instead of a static base distribution, operations of models disclosed herein may transform a dynamic base process into an observable one. For example, operations of a continuous-time flow process (CTFP) may be a type of generative model that decodes the base continuous Wiener process into a complex observable process using a dynamic instance of normalizing flows. A resulting observable process may be continuous in time. In addition to appealing features of static normalizing flows (e.g., efficient sampling and exact likelihood), operations disclosed herein may also enable a series of inference tasks that may be typically unattainable in time series models with complex dynamics, such as interpolation and extrapolation at arbitrary timestamps. Furthermore, to overcome simple covariance structure of the Wiener process, embodiments disclosed herein may augment a reversible mapping with latent variables and optimize this latent CTFP variant using variational optimization.
Reference is made to
For example, a Wiener process may be a continuous stochastic process. In some embodiments, operations may be configured for learning a complex observed process (generally illustrated and identified as reference numeral 110) through a differential deformation (generally illustrated and identified as reference numeral 120) of the base Wiener process (generally illustrated and identified as reference numeral 130), thereby preserving beneficial features of the base Wiener process.
In some embodiments, a continuous-time flow process (CTFP) may be a generative model for continuous stochastic processes. The continuous-time flow process may include one or more of the following properties: (1) provides flexible and consistent joint distributions on arbitrary and irregular time grids, with easy-to-compute density and an efficient sampling procedure; (2) the stochastic process generated by CTFP may provide continuous sample paths, promoting a natural fit for data with continuously-changing dynamics; or (3) CTFP may include operations for interpolation and extrapolation conditioned on given observations. As will be disclosed herein, operations of CTFP and embodiments of a latent variant may be tested based on one or more stochastic processes and real-world data sets, including the variational recurrent neural network (VRNN) [12] and latent ordinary differential equation (latent ODE) [44], and may illustrate beneficial properties.
Among the example traditional time series models are latent variable models following the state-space equations [16], including the variants with discrete and linear state-space [2, 27]. In non-linear examples, exact inference may be intractable and resort to approximate techniques [26, 24, 8, 7, 45] may be considered.
Embodiments of CTFP disclosed herein may be viewed as a form of a continuous-time extended Kalman filter where the nonlinear observation process is noiseless and invertible and the temporal dynamics are a Wiener process. Embodiments disclosed herein may be more expressive than a Wiener process but may retain one or more appealing properties of the Wiener process. Such appealing properties may include closed-form likelihood, interpolation, or extrapolation.
Tree-based variants of non-linear Markov models may be provided [34]. An augmentation with switching states may increase the expressiveness of state-space models; however, there may be challenges for learning [17] and inference [1]. Marginalization over an expansion of the state-space equations in terms of non-linear basis functions extends classical Gaussian processes [42] to Gaussian process dynamical models [25].
Based on application to image data, in some examples, operations may extend example variational autoencoder (VAE) [29] to sequential data [5, 12, 18, 37]. While RN N-based variational sequence models [12, 5] may model distributions over irregular timestamps, such timestamps have to be discrete. Such models may lack the notion of continuity. Accordingly, such models may not be suitable for modeling sequential data that have continuous underlying dynamics. Furthermore, such models may not be used to provide straightforward interpolation at arbitrary timestamps.
Latent ODEs [44] may utilize an ODE-RNN as encoder and may conduct operations to propagate a latent variable along a time interval using a neural ODE. Such operations may may ensure that the latent trajectory is continuous in time. However, decoding of the latent variables to observations may be done at each time step independently. In such examples, there may be no guarantee that sample paths are continuous, which may represent undesirable features similar to those observed with variational sequence models. Neural stochastic differential equations (neural SDEs) [36] may replace the deterministic latent trajectory of a latent ODE with a latent stochastic process; however, such examples may not not generate continuous sample paths.
In some examples [41], a recurrent neural process model may be provided. However, examples of the neural process family [28, 19, 20, 47] may simply model the conditional distribution of data given observations and may not provide generic generative models.
In some examples [33, 38, 46], models may include features of reversible generative models to sequential data, and thus may capture complex distributions. In some examples [38] and [46], models may include normalizing flows for modeling the distribution of inter-arrival time between events in temporal point processes. In some examples [33], models may include operations to generate video frames based on conditional normalizing flows. However, these models may use normalizing flows to model probability distributions in real space. In contrast, some embodiments described in the present disclosure may extend the domain of normalizing flows from distributions in real space to continuous-time stochastic processes.
In some embodiments disclosed herein, models may be based on stochastic processes and recent advances in normalizing flow research. A stochastic process may be defined as a collection of random variables indexed by time. An example of a continuous stochastic process may be the Wiener process.
The d-dimensional Wiener process Wτ may be characterized by the following properties: (1) W0=0; (2) Wt−Ws˜(0, (t−s)Id) for s≤t, and Wt−Ws may be independent of past values of Ws′ for all s′≤s. The joint density of (Wτ
(wτ
(wτ
The conditional distribution of pw
p
w
|w
(wt|ws)=(wt;ws,(t−s)Id), (1)
where Id is a d-dimensional identity matrix. This equation may provide a way to sample from (Wτ
The above may be known as the Brownian bridge. A property of the Wiener process is that the sample paths are continuous in time with probability one. This property may allow some embodiments of models disclosed herein to generate continuous sample paths and perform interpolation and extrapolation tasks.
Normalizing flows [43, 13, 31, 14, 40, 30, 3, 9, 32, 39] may be reversible generative models allowing density estimation and sampling. In some scenarios, it may be beneficial to estimate the density function px of a random vector Xϵd, then normalizing flows assume X=f(Z), where f:d→d is a bijective function, and Zϵd is a random vector with a simple density function pz. The probability density function may be evaluated using the change of variables relation:
where we denote the inverse of f by g and ∂g/∂x is the Jacobian matrix of g. Sampling from pX may be conducted by drawing a sample from the simple distribution z˜pZ, and then conducting operations of the bijection x=f(z).
In some examples [10, 21], models may include the continuous normalizing flow, which include operations of the neural ordinary differential equation (neural ODE) to model a flexible bijective mapping. Given z=h(t0) sampled from the base distribution pZ, it may be mapped to h(t1) based on the mapping defined by the ODE: dh(t)/dt=f(h(t), t). The change in log-density may be computed by the instantaneous change of variables formula [10]:
In some scenarios, a potential disadvantage of the neural ODE model is that it may preserve the topology of the input space, and there are classes of functions that may not be represented by neural ODEs. Some example models [15] include the augmented neural ODE (ANODE) model to address this limitation. The original formulation of ANODE may not a generative model and it may not support the computation of likelihoods pX(x) or sampling from the target distribution x˜pX. In some embodiments disclosed herein, operations may provide a modified version of ANODE that may be used as a conditional generative model.
Reference is made to
The system 200 includes a processor 202 configured to execute processor-readable instructions that, when executed, configure the processor 202 to conduct operations described herein. For example, the system 200 may be configured to conduct operations for time series data prediction based on a continuous time generative model, in accordance with embodiments of the present disclosure.
The processor 202 may be a microprocessor or microcontroller, a digital signal processing (DSP) processor, an integrated circuit, a field programmable gate array (FPGA), a reconfigurable processor, a programmable read-only memory (PROM), or combinations thereof.
The system 200 includes a communication circuit 204 to communicate with other computing devices, to access or connect to network resources, or to perform other computing applications by connecting to a network (or multiple networks) capable of carrying data. In some embodiments, the network 250 may include the Internet, Ethernet, plain old telephone service line, public switch telephone network, integrated services digital network, digital subscriber line, coaxial cable, fiber optics, satellite, mobile, wireless, SS7 signaling network, fixed line, local area network, wide area network, and others, including combination of these. In some examples, the communication circuit 204 may include one or more busses, interconnects, wires, circuits, and/or any other connection and/or control circuit, or combination thereof. The communication circuit 204 may provide an interface for communicating data between components of a single device or circuit.
The system may include memory 206. The memory 206 may include one or a combination of computer memory, such as static random-access memory, random-access memory, read-only memory, electro-optical memory, magneto-optical memory, erasable programmable read-only memory, electrically-erasable programmable read-only memory, Ferroelectric RAM or the like.
The memory 206 may store a machine learning application 212 including processor readable instructions for conducting operations described herein. In some embodiments, the machine application 212 may include operations for time series data prediction based on a continuous time generative model. Other example operations may be contemplated and are disclosed herein.
The system 200 may include a data storage 214. In some embodiments, the data storage 214 may be a secure data store. In some embodiments, the data storage 214 may store input data sets, such as image data, training data sets, or the like.
The client device 210 may be a computing device including a processor, memory, and a communication interface. In some embodiments, the client device 210 may be a computing device associated with a local area network. The client device 210 may be connected to the local area network and may transmit one or more data sets, via the network 250, to the system 200. The one or more data sets may be input data, such that the system 200 may conduct one or more operations associated with likelihood determination, data sampling, data interpolation, or data extrapolation. Other operations may be contemplated, as described in the present disclosure.
In some embodiments, the system may include a machine learning architecture having operations of a continuous-time flow process (CTFP). In some embodiments to be disclosed, a generative variant of ANODE may be provided as a component to implement operations of CTFP.
Embodiments may include a continuous-time flow process (CTFP). A generative variant of ANODE will be disclosed as a component to implement CTFP. In some scenarios, as a stochastic process may be continuous in time, embodiments of operations herein may provide interpolation and extrapolation at arbitrary time points. Further, in some embodiments, operations of a latent CTFP model may provide richer covariance structures.
In some embodiments, the machine learning architecture for continuous-time flow processes may provide that {(xτ
=log pxτ
is maximized. In some embodiments, the continuous-time flow process (CTFP) {Fθ(Wτ;τ)}τϵ[0,T] may be defined such that
X
τ
=F
θ(Wτ;τ),∀τϵ[0,T], (6)
where Fθ(⋅; τ):d→d is an invertible mapping parametrized by the learnable parameters θ for every τϵ[0, T], and Wτ is a d-dimensional Wiener process.
In some embodiments, the log-likelihood in Equation 5 may be rewritten using the change of variables formula. For example, let wτ
where τ0=0, W0=0, and
is described above.
Reference is made to
In
In some embodiments, the normalizing flow Fθ(⋅; τ) may transform a base distribution induced by Wτ on an arbitrary time grid into a more complex shape in the observation space. In some scenarios, given a continuous realization of Wτ, as long as Fθ(⋅; τ) is implemented as a continuous mapping, the resulting trajectory xτ may also be continuous.
In some embodiments, one or more normalizing flow models indexed by time τ may be used as Fθ(⋅; τ) in Equation 6. In some embodiments, operations may include the continuous normalizing flow and ANODE because it may have a free-form Jacobian and efficient trace estimator [15, 21]. In particular, in some embodiments, the following may be an instantiation of ANODE as a generative model: For any τϵ[0, T] and wτϵd, a mapping wτ to xτ may be provided by solving the following initial value problem:
where hτ(t)ϵd, tϵ[t0, t1], fθ:d××[t0, t1]→d, and gθ:×[t0, t1]→. Then Fθ in Equation 6 may defined as the solution of hτ(t) at t=t1:
F
θ(wτ;τ):=hτ(t1)=hτ(t0)+∫t
In some embodiments, the index t may represent the independent variable in the initial value problem and should not be confused with τ, the timestamp of the observation.
Based on Equation 4, the log-likelihood may be provided as follows:
where hτ
As described, in some embodiments, operations of the CTFP model may provide for interpolation and extrapolation operations. Time-indexed normalizing flows and the Brownian bridge may define conditional distributions on arbitrary timestamps. They may permit the CTFP model to provide operations for interpolation and extrapolation given partial observations, which may be beneficial with operations for time series modeling.
In some embodiments, interpolation means that a system may model the conditional distribution
for all τϵ[τi, τi+1] and i=1, . . . ,n−1. Operations may include mapping the values xτ, xτ
In some embodiments, operations for extrapolation may be provided based on Equation 1. Accordingly, the model may predict continuous trajectories into future time periods, given past observations.
In some embodiments, the CTFP model may inherit a Markov property from the Wiener process, which may be a strong assumption and may limit its ability to model stochastic processes with complex temporal dependencies. In order to enhance expressive power of the CTFP model, in some embodiments, operations may augment the CTFP model with a latent variable Zϵm, whose prior distribution may be isotropic Gaussian pZ(z)=(z; 0, Im). In particular, the data distribution may be approximated by a diverse collection of CTFP models conditioned on sampled latent variables z.
The generative model provided by Equation 6 may be augmented to Xτ=Fθ(Wτ; Z, τ), ∀τϵ[0, T], which may provide a conditional distribution Xτ
Depending on the sample of the latent variable z, the CTFP model may include different gradient fields and may provide different output distributions.
For ease of notation, the subscripts of density functions may be omitted in the present disclosure. For the augmented generative model, the log-likelihood may be =log p(xτ
Based on examples of variational autoencoder approaches [29], in some embodiments, model operations may include an approximate posterior distribution of Z|Xτ
where K is the number of samples from the approximate posterior distribution.
To illustrate some embodiments of the present disclosure, the following provides description of example models and synthetic data generated from common continuous-time stochastic processes and complex real-world datasets. Embodiments of the CTFP models and latent CTFP models may be compared with baseline models, such as latent ODEs [44] and variational RNNs (VRNNs) [12]. Example latent ODE models with the ODE-RNN encoder may be designed specifically to model time series data with irregular observation times. Example VRNN models may be variational filtering models that may demonstrate superior performance on structured sequential data.
For VRNNs, experiments were provided for appending the time gap between two observations as an additional input to the neural network. Both latent CTFP and latent ODE models may utilize ODE-RNN [44] as the inference network; GRU [11] may be used as the RNN cell in latent CTFP, latent ODE, and VRNN models. All latent variable models may have the same latent dimension and GRU hidden state dimension.
Synthetic datasets may be provided. In some experiments, three irregularly-sampled time series datasets may be simulated, and the datasets may be univariate. Geometric Brownian motion (GBM) may be a continuous-time stochastic process widely used in mathematical finance, and may satisfy the following stochastic differential equation: dXτ=μXτdτ+σXτdWτ, where μ and σ are the drift term and variance term, respectively.
Example timestamps of observations may be in the range between 0 and T=30 and may be sampled from a homogeneous Poisson point process with an intensity of λtrain=2. To further evaluate the model's capacity to capture the dynamics of GBM, experiments tested the model with observation time-steps sampled from Poisson point processes with intensities of λtest=2 and λtest=20.
In some scenarios, OrnsteinUhlenbeck process (OU Process) may be another type of continuous-time stochastic process. The OU process may satisfy the following stochastic differential equation: dXτ=θ(μ−Xτ)dτ+σdWτ. Experiments used the same set of observation intensities as in the above-described GBM experiments to sample observation time stamps in the training and test sets.
In some scenarios, to demonstrate the latent CTFP's capability to model sequences sampled from different continuous-time stochastic processes, experiments were conducted to train some embodiments of models on a dataset generated by mixing the sequences sampled from two different OU processes with different values of θ, μ, σ, and different observation intensities.
Reference is made to
In the table 400 of
The results on the test set sampled from the GBM indicate that the CTFP model may recover the true data generation process as the NLL estimated by CTFP is close to the ground truth. In contrast, latent ODE and VRNN models may fail to recover the true data distribution. On the M-OU dataset, the latent CTFP models show better performance than the other models. Moreover, latent CTFP outperforms CTFP by 0.016 nats, indicating its ability to leverage the latent variables.
Although trained on samples with an observation intensity of λtrain=2, embodiments of the CTFP model may better adapt to samples with a bigger observation intensity (and thus denser time grid) of λtest=20. In some scenarios, the superior performance of CTFP models when λtest=20 may be due to its capability to model continuous stochastic processes, whereas the baseline models may not have the notion of continuity. Such observations may be illustrated in ablation study findings (to be described in the present disclosure), where the base Wiener process may be replaced with i.i.d. Gaussian random variables, such that the base process is no longer continuous in time.
In some scenarios, for the interpolation task, the results of CTFP may be consistent with the ground truth in terms of both point estimation and uncertainty estimation. For latent ODE on the interpolation task,
In addition to observed challenges with the interpolation task, a qualitative comparison between samples may further highlight the importance of embodiments of the CTFP models' continuity when generating samples of continuous dynamics.
Experiments were also conducted on real-world datasets having continuous or complex dynamics.
The following three datasets were considered. First, Mujoco-Hopper [44] includes 10,000 sequences that are simulated by a “Hopper” model from the DeepMind Control Suite in a MuJoCo environment [48].
Second, PTB Diagnostic Database (PTBDB) [4] includes excerpts of ambulatory electrocardiography (ECG) recordings. Each sequence is one-dimensional and the sampling frequency of the recordings is 125 Hz.
Further, a Beijing Air-Quality Dataset (BAQD) [49] may be a dataset consisting of multi-year recordings of weather and air quality data across different locations in Beijing. The variables may include temperature, pressure, and wind speed, and the values may have been recorded once per hour. In some experiments, the data was segmented into sequences, each covering the recordings of a whole week.
Similar to synthetic data experiment settings, experiments compared compare the CTFP and latent CTFP models against latent ODE and VRNN. In some scenarios, the latent ODE model in the original work [44] used a fixed output variance and was evaluated using mean squared error (MSE). Such a model was adapted with a predicted output variance. In some experiments, the effect of using RealNVP [14] as the invertible mapping Fθ(⋅; τ) was explored. This experiment can be regarded as an ablation study and is described herein with reference to ablation studies.
Reference is made to
For embodiments of the CTFP model provided in the present disclosure, the reported values are exact. For the other three example models, the reported results are based on IWAE bounds using K=125 samples. Lower values may correspond to better performance. Standard deviations were based on 5 independent runs.
The table 600 of
The table 600 of
Finite-Dimensional Distribution of CTFP: As described in the present disclosure, Equation 7 (provided above) is the log density of the distribution obtained by applying the normalizing flow models to the finite-dimensional distribution of Wiener process on a given time grid. In some examples, query why the distribution described by Equation 7 necessarily matches the finite-dimensional distribution of Xτ=Fθ(Wτ, τ). In other words, it may be left to close the gap between the distributions of samples obtained by two different ways to justify Equation 7: (1) first getting a sample path of Xτ by applying the transformation defined by Fθ to a sample of Wτ and then obtaining the finite-dimensional observation of Xτ on the time grid; (2) first obtaining the finite-dimensional sample of Wτ and applying the normalizing flows to this finite-dimensional distribution.
In some scenarios, to show the finite-dimensional distribution of CTFP, operations may work with the canonical Wiener space (Ω, Σ) equipped with the unique Wiener measure μW where Ω=C([0, +∞), d) may be the set of continuous functions from [0, +∞) to d, Σ is the Borel σ-algebra generated by all the cylinder sets of C([0, +∞), d), and Wτ(ω)=ω(τ) for ωϵΩ. Further description may be provided in secondary sources (see e.g., Chapter 2 of [35]).
Given a time grid 0<τ1<τ2< . . . <τn, the distribution of observations of Wiener process on this discrete time grid may be called the finite-dimensional distribution of Wτ. It may be a push-forward measure on (d×n, (d×n)) induced by the projection mapping πτ
Proposition 1: Let Fθ(⋅,⋅) be defined as Equations 8 and 9. The mapping from (Ω, Σ, μW) to (Ω, Σ) defined by ω(τ)→Fθ(ω(τ),τ) may be measurable and therefore induces a pushforward measure μW◯Fθ−1.
As an example proof: As Fθ is continuous in both ω and τ, it may be shown that Fθ (ω(τ), τ) is also continuous in τ for each ω continuous in τ. As Fθ(⋅, τ) is invertible for each τ, Fθ(⋅, τ) is an homeomorphsim between d and d. Therefore, the pre-image of each Borel set of d under Fθ(⋅, τ) for each τ is also Borel. As a result, the pre-image of each cylinder set of C([0, +∞), d) under the mapping defined by Fθ(⋅,⋅) may also be a cylinder set, which may be enough to show the mapping is measurable.
The proposition shows Xτ is a stochastic process also defined in the space of continuous functions as Wiener process. The present example provides a solid basis for defining finite-dimensional distribution of Xτ on d×n in similar ways as Wiener process using projection. The two sampling methods mentioned above can be characterized by two different mappings from (Ω, Σ, μW) to (d×n, (d×n) (1) applying transformation defined by Fθ to a function in C([0, +∞), d) and then applying the projection π to the transformed function given a time grid; (2) applying the projection to a continuous function on a time grid and applying the transformation defined by Fθ(⋅, τ) for each τ individually. In some embodiments, the pushforward measures induced by the two mappings may be checked to agree on every Borel set of d×n as their pre-images are the same in (Ω,Σ, μW). Accordingly, the following proposition may be provided.
Proposition 2: Given a finite subset {τ1, τ2, . . . , τn}↔(0, +∞), the finite-dimensional distribution of Xτ is the same as the distribution of (Fθ(Wτ
Proof: It suffices to check that given the fixed time grid, for each Borel set B∪d×n, the preimage of B is the same under the two mappings. They are both {ω|(Fθ(Wτ
To supplement description of embodiments and features thereof that are described in the present disclosure, the following description provides further example details regarding synthetic dataset generation, real-world dataset pre-processing, model architecture as well as training and evaluation settings.
Synthetic Dataset: For the geometric Brownian motion (GBM), in some embodiments, systems may sample 10000 trajectories from a GBM with the parameters of μ=0.2 and a variance of σ=0.5 in the interval of [0, 30]. The timestamps of the observations may be sampled from a homogeneous Poisson point process with an intensity of λtrain=2. Systems may evaluate the model on the observations timestamps sampled from two homogeneous Poisson processes separately with intensity values of λtest=2 and λtest=20.
For the Ornstein-Uhlenbeck (OU) process, the parameters of the process that the system may sample trajectories from are θ=2, μ=1, and σ=10. The system may be configured to also sample 10000 trajectories and utilize the same set of observation intensity values, λtrain and λtest, to sample observation timestamps from homogeneous Poisson processes for training and test.
For the mixture of OU processes (MOU), systems may sample 5000 sequences from each of two different OU processes and mix them to obtain 10000 sequences. One OU process has the parameters of θ=2, μ=1, and σ=10 and the observation timestamps may be sampled from a homogeneous Poisson process with λtrain=2. The other OU process may have the parameters of θ=1.0, μ=2.0, and σ=5.0 with observation timestamps sampled with λtrain=20.
For the 10000 trajectories of each dataset, systems may be configured to use 7000 trajectories for training and 1000 trajectories for validation. Systems may test the model on 2000 trajectories for each value of λtest. To test the model with λtest=20 on GBM and OU process, systems may also use 2000 sequences.
Real-World Dataset Details: As described in the present disclosure, experiments were conducted to compare embodiments of the CTFP models against the baselines on three datasets: Mujoco-Hopper, Beijing Air-Quality dataset (BAQD), and PTB Diagnostic Database(PTBDB). The three datasets may be obtained at sources, such as:
In some experiments, the system padded all sequences into the same length for each dataset. The sequence length of the Mujoco-Hopper dataset was 200 and the sequence length of BAQD was 168. The maximum sequence length in the PTBDB dataset was 650. Systems were configured to rescale the indices of sequences to real numbers in the interval of [0, 120] and to obtain the rescaled values as observation timestamps for all datasets.
To make the sequences asynchronous or irregularly-sampled, systems were configured to sample observation timestamps {τi}i=1n from a homogeneous Poisson process with an intensity of 2 that is independent of the data. For each sampled timestamp, the value of the closest observation was taken as its corresponding value. The timestamps of all sampled sequences were shifted by a value of 0.2 since W0=0 deterministically for the Wiener process and there may be no variance for the CTFP model's prediction at τ=0.
To supplement description of embodiments in the present disclosure, further model architecture details will be described.
To ensure a fair comparison, systems were configured to utilize the same values for hyper-parameters including the latent variable and hidden state dimensions across all models. For experiments, systems were configured to maintain underlying architectures as similar as possible and to use the same experimental protocol across all models.
For CTFP and Latent CTFP, systems were configured to utilize a one-block augmented neural ODE module that maps the base process to the observation process. For the augmented neural ODE model, systems were configured with an MLP model consisting of 4 hidden layers of size 32-64-64-32 for the model in Equation 8 and Equation 12.
In practice, the implementation of g in the two equations may be optional and its representation power may be fully incorporated into f. This architecture may be used for both synthetic and real-world datasets. For the latent CTFP and latent ODE models described above, systems were configured to use the ODE-RNN model as the recognition network. For synthetic datasets, the ODE-RNN model consists of a one-layer GRU cell with a hidden dimension of 20 (the rec-dims parameter in its original implementation) and a one-block neural ODE module that has a single hidden layer of size 100, and it outputs a 10-dimensional latent variable. The same architecture was used by both latent ODE and latent CTFP models.
For real-world datasets, the ODE-RNN architecture used a hidden state of dimension 20 in the GRU cell and an MLP with a 128-dimensional hidden layer in the neural ODE module. The ODE-RNN model produced a 64-dimensional latent variable. For the generation network of the latent ODE (V2) model, systems were configured to use an ODE function with one hidden layer of size 100 for synthetic datasets and 128 for real-world datasets. The decoder network has 4 hidden layers of size 32-64-64-32, and mapping a latent trajectory to outputs of Gaussian distributions at different time steps.
The VRNN model is implemented using a GRU network. The hidden state of the VRNN models may be 20-dimensional for synthetic and real-world datasets. The dimension of the latent variable is 64 for real-word datasets and 10 for synthetic datasets. Systems were configured to use an MLP of 4 hidden layers of size 32-64-64-32 for the decoder network, an MLP with one hidden layer that has the same dimension as the hidden state for the prior proposal network, and an MLP with two hidden layers for the posterior proposal network. For synthetic data sampled from Geometric Brownian Motion, we apply an exponential function to the samples of all models. Therefore the distribution precited by latent ODE and VRNN at each timestamp is a log-normal distribution.
In some example experiments, the following training and evaluation settings were used. For synthetic data, systems were configured to train all models using the IWAE bound with 3 samples and a flat learning rate of 5×10−4 for all models. Systems were configured to also consider models trained with or without the aggressive training scheme proposed by [22] for latent ODE and latent CTFP.
In some experiments, systems were configured with the best-performing model among the ones trained with or without the aggressive scheme based IWAE bound, estimated with 25 samples on the validation set for evaluation. The batch size may be 100 for CTFP models and 25 for all the other models. For experiments on real-world datasets, systems were configured to conduct a hyper-parameter search on learning rates over two values of 5×10−4 and 10−4, whether using the aggressive training schemes for latent CTFP and latent ODE models. Evaluation results of the best-performing model based on IWAE bound estimated with 125 samples were provided.
Some experiments were configured to provide ablation study results. In some experiments, additional experiment results on real-world datasets were obtained.
Reference is made to
Experiments based on I.I.D. Gaussian as a base process were conducted. In this experiment, systems were configured to replace the base Wiener process with I.I.D Gaussian random variables and to keep the other components of the models substantially unchanged. This experimental model and its latent variant are named CTFP-IID-Gaussian and latent CTFP-IID-Gaussian. As a result, trajectories sampled from CTFP-IID-Gaussian may not be continuous and this experiment was conducted to study the continuous property of models and its impact on modeling irregular time series data with continuous dynamics.
Reference is made to
The table 900 of
Results based on the tables in
Experiments based on CTFP-RealNVP: In the following experiment, systems were configured to replace the continuous normalizing flow in CTFP model with another normalizing flow model, RealNVP [14]. The variant of CTFP used for the experiment described below is named CTFP-RealNVP and its latent version may be termed latent CTFP-RealNVP. Note that the trajectories sampled from CTFP-RealNVP model may still be continuous. We evaluate CTFP-RealNVP and latent CTFP-RealNVP models on datasets with high dimensional data, Mujoco-Hopper, and BAQD.
Reference is made to
The table 1000 of
The table indicates that CTFP-RealNVP outperforms CTFP. However, when incorporating the latent variable, the latent CTFP-RealNVP performs significantly worse than latent CTFP. The worse performance might be because RealNVP cannot make full use of the information in the latent variable due to its structural constraints.
The following description provides additional details for latent ODE models based on Mujoco-Hooper data. In some examples, systems may focus on point estimation and may be configured to utilize the mean squared error as the performance metric [44]. When applied to embodiments of the present disclosure and evaluated using the log-likelihood, embodiments of the model performs unsatisfactorily.
Reference is made to
To mitigate the above-described issue, in some embodiments, two modified versions of the latent ODE model may be provided. In a first version (V1), given a pretrained (original) latent ODE model, systems may be configured to conduct a logarithmic scale search for the output variance and identify a value that gives the best performance on the validation set.
In a second version (V2), systems may be configured to utilize an MLP to predict the output mean and variance. Both modified versions may have better performance than the original model, as shown in the table 1100 of
Qualitative Sample for VRNN Model: in some experiments, trajectories were sampled from the VRNN model [12] trained on Geometric Brownian Motion (GBM) by running the model on a dense time grid. The trajectories are illustrated in
A comparison of trajectories sampled from the model with trajectories sampled from GBM is provided. As illustrated, the sampled trajectories from VRNN may not be continuous in time.
In
In some experiments, systems were configured to use VRNN to estimate the marginal density of X, for each τϵ(0,5]. Some results are shown in
Conditioned on the sampled latent codes z0 and zτ, VRNN proposes p(xτ|x0, zτ, z0) at the second step. We average the conditional density over 125 samples of Zτ and Z0 to estimate the marginal density.
The marginal density estimated using a time grid with two timestamps may not be consistent with the trajectories sampled on a different dense time grid. The results indicate that the choice of time grid may have an impact on the distribution modeled by VRNN, and the distributions modeled by VRNN on different time grids may be inconsistent. In contrast, embodiments of CTFP models described in the present disclosure may not be susceptible to the above-described issues.
Reference is made to
Embodiments disclosed herein may be applicable to natural processes, such as environmental conditions, vehicle travel statistics over time, electricity consumption over time, asset valuation in capital markets, among other examples. In some other examples, generative models disclosed herein may be applied for natural language processing, recommendation systems, traffic pattern prediction, medical data analysis, or other types of forecasting based on irregular time series data. It may be appreciated that embodiments of the present disclosure may be implemented for other types of data sampling or prediction, likelihood density determination, or inference tasks such as interpolation or extrapolation based on irregular time series data sets.
At operation 1302, the processor may obtain time series data associated with a data query. A data query may be associated with a data sampling operation or a prediction operation, such as traffic pattern prediction, recommendation systems, weather forecasting, among other examples.
For embodiments associated with data sampling operations, the obtained time series data may be a set of time stamps associated with a desired prediction of an observable process. In some embodiments, the obtained time series data may be an incomplete realization of a continuous stochastic process. Accordingly, the desired sampling or prediction operations may be conducted for determining observed data points.
For embodiments associated with likelihood calculations, the obtained time series data may be an irregular time series data set and, as will be described, the irregular series data set may be an observed process to be mapped based on a reversible mapping function to a set of data points of a Weiner process (or other continuous stochastic process), for which a likelihood determination may be made.
For embodiments associated with inference tasks, such as interpolation or extrapolation, the obtained time series data may include unobserved data points and, as will be described, a conditional density of corresponding data points of the Weiner process may be determined.
At operation 1304, the processor may generate a predicted value based on a sampled realization of the time series data and a continuous time generative model. The continuous time generative model may be trained to define an invertible mapping to maximize a log-likelihood of a set of predicted observation values for a time range associated with the time series data.
As an example, {(xτ
X
τ
=F
θ(Wτ;τ),∀τϵ[0,T],
where Fθ(⋅; τ):d→d is the invertible mapping parametrized by the learnable parameters θ for every τϵ[0, T], and Wτ is a d-dimensional Wiener process. In the present example, the stochastic process may be associated with a joint distribution of (Xτ
may be maximized.
As described in the present disclosure, a plurality of different normalizing flow models may be indexed by time τ may be used as Fθ(⋅; τ). For ease of exposition, the continuous normalizing flow and ANODE is provided as an illustrating example because it has a free-form Jacobian and efficient race estimator. To illustrate, the following is an instantiation of ANODE as a generative model: For any τϵ[0, T] and wτϵd, a mapping wτ to xτ may be provided by solving the following initial value problem:
where hτ(t)ϵd, tϵ[t0, t1], fθ:d××[t0, t1]→d, and gθ:×[t0, t1]→. Then Fθ in may defined as the solution of hτ(t) at t=t1:
F
θ(wτ;τ):=hτ(t1)=hτ(t0)+∫t
The index t may represent the independent variable in the initial value problem.
Further, the log-likelihood may be provided as follows:
where hτ
In some embodiments, the predicted value may be associated with one or more observed process data points based on a sampled realization of a Weiner process and the invertible mapping to provide a time continuous observed realization of the Weiner process.
In some embodiments, the invertible mapping may be based on training operations for decoding or deforming a base continuous Weiner process (or other continuous stochastic processes) into a complex observable process based on a dynamic instance of normalizing flows, as described with reference to some embodiments of the present disclosure.
Although the above-described example of an invertible mapping is based on a d-dimensional Wiener process, it may be contemplated that the continuous-time flow process may be based on other types of continuous stochastic process.
In some scenarios, it may be beneficial to bolster expressive power of embodiments of the continuous time generative model disclosed herein with a latent variable Zϵm, whose prior distribution may be isotropic Gaussian pZ(z)=V(z; 0, Im). Thus, in some embodiments, the invertible mapping may be parameterized by a latent variable having an isotropic Gaussian prior distribution.
In some embodiments, the data distribution may be approximated by a diverse collection of CTFP models conditioned on sampled latent variable z.
In some embodiments, the continuous time generative model may be augmented to Xτ=Fθ(Wτ; Z, τ), ∀τϵ[0, T], which may provide a conditional distribution Xτ
Depending on the sample of the latent variable z, the generative model may have different gradient fields, thereby having different output distributions.
For the augmented generative model, the log-likelihood may be =log (xτ
where K may be the number of samples from the approximate posterior distribution.
In some embodiments, the obtained time series data may include observed process data points. The predicted value may represent a likelihood determination of stochastic process data points based on the observed data points and an inverse of the invertible mapping of the continuous time generative model.
In some embodiments, the obtained time series data may include unobserved process data points. The predicted value may represent a conditional probability density of stochastic data points based on the unobserved process data points and an inverse of the invertible mapping of the continuous time generative model. In some embodiments, the conditional probability density provides for data point interpolation associated with the stochastic process based on a Brownian bridge. In some embodiments, the conditional probability density provides for data point extrapolation of data points associated with the stochastic process based on a multivariate Gaussian conditional probability distribution.
Reference is made to
At operation 1402, the processor may obtain time series data. In some embodiments, the time series data may be a sequence of regularly spaced or a sequence of irregularly spaced time series data. The obtained time series data may be an incomplete realization of a continuous stochastic process {Xτ}τϵ[0,T] and the stochastic process may induce a joint distribution (Xτ
may be maximized.
At operation 1404, the processor may generate an invertible mapping associated a continuous time generative model by maximizing the likelihood of the set of observations. The continuous-time flow {Fθ(Wτ; τ)}τϵ[0,T] may be defined such that Xτ=Fθ(Wτ; τ), ∀τϵ[0, T], where Fθ(⋅; τ):d→d is an invertible mapping parametrized by the learnable parameters θ for every τϵ[0, T], and Wτ is a d-dimensional Wiener process.
In some embodiments, the log-likelihood relation may be reformulated using the change of variables formula, where wτ
where
is described above.
In some embodiments, the generated invertible mapping may be augmented with a latent variable having a prior distribution that may be an isotropic Gaussian, as described in some examples of the present disclosure.
At operation 1406, the processor may update the continuous time generative model based on the invertible mapping.
The invertible mapping and the continuous time generative model may be generated or updated over time based on sequences of irregularly spaced time series data obtained or captured over time. In some embodiments, the invertible mapping and the associated continuous time generative model may be configured to decode base continuous stochastic processes into a complex observable process using a dynamic instance of normalizing flows, thereby enabling inference tasks that may otherwise be unattainable when receiving datasets having complex or multivariate dynamics or when receiving data sets having irregularly spaced or arbitrary timestamps. In some embodiments, the invertible mapping may be augmented with latent variables, and continuous time generative models may be optimized based on variational optimization operations.
Training and generation of embodiments of the continuous time generative model, and the associated invertible mapping, may be used for operations for querying the continuous time generative model, including sampling operations, likelihood determination operations, or inference operations including data interpolation or extrapolation tasks. Example operations using the continuous time generative model may include natural language processing operations, weather forecasting operations, pedestrian behaviour prediction operations for autonomous vehicles, among other examples.
Embodiments described in the present disclosure include systems that may be configured to conduct operations of a continuous-time flow process (CTFP) model, a reversible generative model for stochastic processes, and an associated latent variant. In some embodiments, the systems may be configured to map a continuous-time stochastic process, i.e., the Wiener process, into a more complicated process in the observable space. Beneficial or desirable properties of the Wiener process may be retained, including the efficient sampling of continuous paths, likelihood evaluation on arbitrary timestamps, and inter-/extrapolation given observed data. Example experiment results described in the present disclosure illustrate advantages and superior performance of some embodiments of the proposed models on various datasets, as compared to other models.
[3] Jens Behrmann, Will Grathwohl, Ricky T Q Chen, David Duvenaud, and Joern-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, pages 573-582, 2019.
[16] James Durbin and Siem Jan Koopman. Time Series Analysis by State Space Methods. Oxford University Press, 2012.
Number | Date | Country | |
---|---|---|---|
62971143 | Feb 2020 | US |