The present disclosure is directed at artificial neural networks applied to time series forecasting.
Time Series Forecasting is among the most well-known problems in many domains such as sensor network monitoring, traffic and economics planning, astronomy, economic and financial forecasting, inventory planning, and weather and disease propagation forecasting. While many attempts have been made for using Neural Networks in Time Series Forecasting several years ago, with the recent advances in Deep Neural Networks, there also has been a rapid rise in the use of DNNs for Time Series Forecasting along with many other machine learning tasks. Considering the approaches which have been proposed to process sequential inputs, initial works in this field focused primarily on Recurrent Neural Networks such as LSTMs and GRUs.
Salinas et al., Probabilistic forecasting with autoregressive recurrent networks International Journal of Forecasting, 36(3): 1181-1191, 2020, proposed DeepAR as an auto-regressive model based on RNNs to model the probabilistic distribution of future series. Although the RNN-based models can achieve reasonable results, one of the major drawbacks to adopting these models is the vanishing/exploding gradient problem which makes them a less suitable choice for predicting long sequence time series. Advances in the state of the art overcame the vanishing gradient problem of RNNs by proposing Transformers, a deep neural network based on self-attention modules. In contrast with the RNN-based models in which the input sequence will be processed sequentially, Transformers can process all of the input sequence together using the attention mechanism which makes the model able to process longer sequences. However, none of these improvements address the need for predictions in long sequence time series, which improve the ability of the computations to provide model flexibility while at the same time inhibiting computational drift away from the desired solution of the output series.
There are several existing methods targeting the memory and time efficiency of Transformers. However, the focus of these methods is mainly on improving the attention mechanism or adding time-series-based modules such as Series-Decomposition.
Since time-series forecasting plays an important roles in many domains, there exits a vast variety of time-series forecasting known in the art. Traditional methods such as ARIMA models and deep exponential models have existed for a long time. Recurrent Neural Networks (RNNs) dominated the time series forecasting in the early machine learning based methods. DeepAR is based on training an auto-regressive RNN model on a large number of related time series. DeepState combines traditional state-space model with RNNs. Convolutional Networks (TCN) later shows a comparable or even better results across a diverse range of tasks and datasets compared with RNNs based model. Recent work in the state of the art applied transformers to time-series forecasting by leveraging self-attention mechanisms to learn complex patterns and dynamics from time series data. Examples of transformer application in the art includes: applying transformer-based model to multivariate time series forecasting; a probabilistic, non-auto-regressive transformer-based model with the integration of state space models with state-of-the-art accuracy for univariate and multivariate time-series forecasting tasks; Informer architecture by using Prob Sparse Attention instead of the Full Attention in the original transformers to improve the time complexity from O(L{circumflex over ( )}2) to O(L log L); and Autoformer using a cross-correlation-based attention (AutoCorrelation) to not only obtain the scores but also compare the local information as well as the points-wise information. However, none of these state of the art improvements address the need for predictions in long sequence time series, which improve the ability of the computations to provide model flexibility while at the same time inhibiting computational drift away from the desired solution of the output series.
Multiscale Vision Transformers have been used for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models, however, these methods focus on the spatial domain, specially designed for computer vision tasks.
It is an object of the present invention to provide a system and method for time series forecasting that obviates or mitigates at least one of the above presented disadvantages.
One object of the present invention is to provide passing an input series in multiple different resolutions in order to facilitate a network network to compute and forecast different components in Time Series Forecasting.
According to a first aspect, there is provided a method for operating a neural network using an encoder-based model to provide a time series forecast, the method comprising: down sampling a time series dataset to generate an initial input having a first scale resolution, such that the first scale resolution is less than a scale resolution of the time series dataset; processing as a first iteration, using the model, the initial input to generate a first output; upsampling by an upsampling function the first output to generate a second input having a second scale resolution, the second scale resolution being higher than the first scale resolution, such that the second input is based on the first output; and processing as a second iteration, using the model, the second input to generate a second output; wherein the second output represents a time series forecast of the time series dataset.
According to another aspect, there is provided a system comprising: a processor; a database storing a time series dataset that is communicatively coupled to the processor; and a memory that is communicatively coupled to the processor and that has stored thereon computer program code that is executable by the processor and that, when executed by the processor, causes the processor to retrieve the time series dataset from the database and to use the time series dataset using an encoder-based model to provide a time series forecast by: down sampling a time series dataset to generate an initial input having a first scale resolution, such that the first scale resolution is less than a scale resolution of the time series dataset; processing as a first iteration, using the model, the initial input to generate a first output; upsampling by an upsampling function the first output to generate a second input having a second scale resolution, the second scale resolution being higher than the first scale resolution, such that the second input is based on the first output; and processing as a second iteration, using the model, the second input to generate a second output; wherein the second output represents a time series forecast of the time series dataset.
According to another aspect, there is provided an artificial neural network for operating a neural network using an encoder-based model to provide a time series forecast, by: down sampling a time series dataset to generate an initial input having a first scale resolution, such that the first scale resolution is less than a scale resolution of the time series dataset; processing as a first iteration, using the model, the initial input to generate a first output; upsampling by an upsampling function the first output to generate a second input having a second scale resolution, the second scale resolution being higher than the first scale resolution, such that the second input is based on the first output; and processing as a second iteration, using the model, the second input to generate a second output; wherein the second output represents a time series forecast of the time series dataset.
According to another aspect, there is provided a non-transitory computer readable medium having stored thereon computer program code that is executable by a processor and that, when executed by the processor, causes the processor for operating a neural network using an encoder-based model to provide a time series forecast, the method comprising: down sampling a time series dataset to generate an initial input having a first scale resolution, such that the first scale resolution is less than a scale resolution of the time series dataset; processing as a first iteration, using the model, the initial input to generate a first output; upsampling by an upsampling function the first output to generate a second input having a second scale resolution, the second scale resolution being higher than the first scale resolution, such that the second input is based on the first output; and processing as a second iteration, using the model, the second input to generate a second output; wherein the second output represents a time series forecast of the time series dataset.
This summary does not necessarily describe the entire scope of all aspects. Other aspects, features and advantages will be apparent to those of ordinary skill in the art upon review of the following description of specific embodiments.
In the accompanying drawings, which illustrate one or more example embodiments:
Multi-scale neural networks using multi-scale and hierarchical processing is a known technique in deep neural networks DNN literature. Different transformations of a time series have been used such as down-sampling and smoothing along with the original signal in parallel as a part of the network to better capture temporal patterns and reduce the effect of random noise. Many attempts have been made in the state of the art on improving recurrent neural networks RNN in tasks such as language processing, computer vision, time-series analysis, and Speech Recognition. However, these methods are mainly focused on proposing a new RNN-based module, which is unfortunately not applicable to transformers directly. This same direction has been also investigated in Transformers, TCN, and MLP models. In the most recent work, multi-scale segment-wise correlations have been used as a multi-scale version of the self-attention mechanism. However, these known works, including applications to Transformers, do not use a model-agnostic framework to utilize multi-scale time series in encoder-based models (e.g. transformers), while keeping the number of parameters and time complexity roughly the same, as is provided by the below discussed method 200 using the provided framework/network 100. As such, it is understood that state of the art solutions using transformers do not address the need for the method 200 provided predictions in long sequence time series, which improve the ability of the computations to provide model flexibility while at the same time can inhibit computational drift away from the desired solution of the output series.
Referring to
In particular, provided is a general multi-scale framework 100 that can be applied to the state-of-the-art transformer-based time series forecasting models 106 (FEDformer, Autoformer, etc.). By iteratively refining a forecasted time series (e.g. dataset 90) at multiple scales (e.g. see
As further described below, we enable scale-awareness (iterative multiscale application of the network 100) showcased by example in
Referring to
It is recognised that the dataset 90 can have implict/inherent constituent parts present in the observation/time components, such as but not limited to: 1) level, the baseline value for the series if it were a straight line; trend, the optional and often linear increasing or decreasing behavior of the series over time; seasonality, the optional repeating patterns or cycles of behavior over time; and noise, the optional variability in the observations that cannot be explained by the model. For example, all time series datasets 90 can have a level, most have noise, and the trend and seasonality can be optional. Further, it is recognised that features of many time series datasets 90 can be trends and seasonal variations, while another feature time series datasets can be that observations close together in time tend to be correlated (serially dependent)
Referring again to
The transformer model 106 can also employ embedding our input 102, 105 to have the same number of features as the hidden dimension of the model 106. The embedding can consist of three parts: a value embedding, a temporal embedding, and a position embedding. We concatenate a new value 1/si−0.5 to the temporal embedding before passing it to the linear layer to emphasize the input scale. We can also sample by a factor of si from the position embedding. In addition, to provide that the transformer model 106 can distinguish between the given lookback values and the prediction of the previous steps, we can further concatenate a binary value to the series before value embedding showing if each observation is coming from the lookback window or the prediction. See the Appendix for an example of the input embedding function.
The transformer model 106 uses an encoder—decoder architecture. The encoder 106a can consist of encoding layers that process the input dataset 90 iteratively one layer after another to provide the input 102, while the decoder 106b consists of decoding layers that do the same thing to the encoder's 106a output 104. The function of each encoder 106a layer is to generate encodings that contain information about which parts of the inputs of the dataset 90 are relevant to each other. The encoder 106a passes its encodings to the next encoder 106a layer as inputs. Each decoder 106b layer does the opposite, taking all the encodings and using their incorporated contextual information to generate an output sequence 105, which is then provided as an input for the next selected higher resolution.
To provide for this, each encoder 106a and decoder 106b layer makes use of the attention mechanism. In general, for each input, attention weighs the relevance of every other input and draws from them to produce the output. Each decoder 106b layer has an additional attention mechanism that draws information from the outputs of previous decoders 106b, before the decoder 106b layer draws information from the encodings. Both the encoder 106a and decoder 106b layers can have a feed-forward neural network for additional processing of the outputs and contain residual connections and layer normalization steps, as desired. As an example embodiment, each encoder 106a can consist of two major components: a self-attention mechanism and a feed-forward neural network. The self-attention mechanism accepts input encodings from the previous encoder 106a and weighs their relevance to each other to generate output encodings. The feed-forward neural network further processes each output encoding individually. These output encodings are then passed to the next encoder 106a as its input, as well as to the decoders 106b. For example, the first encoder 106a takes positional information and embeddings of the input sequence dataset 90 as its input, rather than encodings. The positional information is utilized for the transformer model 106 to make use of the order of the sequence of the dataset 90. For example, each decoder 106b can consist of three major components: a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network. The decoder 106b functions in a similar fashion to the encoder 106a, but an additional attention mechanism is inserted which instead can draw relevant information from the encodings generated by the encoders 106a.
It is recognized that each iteration 101 can simply use the same encoder 106a used to process the original dataset 90. Alternatively, each stage (e.g. iteration 101) can use a respective encoder 106a—decoder 106b pair, such that the encoder 106a used at each iteration 101 can be different from the previous encoder 106a used for the previous iteration 101.
It should be recognized that the horizon window of the input 102 is selected as the lowest initial resolution of the dataset 90 (e.g. 96—see
Like the first encoder 106a, the first decoder 106b takes positional information and embeddings of the output sequence 104 as its input, rather than encodings. The transformer model 106 does not use the current or future output to predict an output, so the output sequence can be partially masked to inhibit this reverse information flow. The last decoder 106b is followed by a final linear transformation and softmax layer, to produce the output probabilities 104 over the dataset 90.
Also included in the network 100 is the pooling function 108, e.g. a pooling layer, used to reduce (e.g. downsample) the temporal size of the input series of the dataset 90, so that number of computations in the network 100 can be reduced. For example, pooling 108 performs downsampling by reducing the size of the series dataset 90 and sends only the considered relevant data to the next layers in the transformer model 106. For example, the pooling function 108 is used to select the initial scale resolution (e.g., 96—see table 1) of the dataset 90 (e.g. as a defined horizon window) that takes the dataset 90 and partitions it into subsections.
Also included in the network 100 is an upsampling function 110, which is used to upsample to the next higher scale resolution of the output 104. For example, the upsampling function 110 can upscale the output 104 from the resolution 96 to the resolution 192, and then the output 104 from the resolution 196 to the resolution 336, and then the output 104 from the resolution 336 to the resolution 720. For example, each of the rows of the table 1 represent the results of one individual operational of the network 106, such that row 96 represents one iteration 101 of the network 106, row 192 represents two iterations 101, row 336 represents three iterations 101 and row 720 represents four iterations 101, as discussed above.
Also included in the network 100 can be a normalization function 112, used to process the output 105 of the decoder using a (e.g zero-mean) normalization, as further described below. It is recognised that this function 112 can be optional. For example, the normalization function 112 can be used only on the input 102 and not on any of the output 105. For example, the normalization function 112 can be used on the input 102 and on each of the output 105. For example, the normalization function 112 can not be used on the input 102 and instead on one or more of the output 105.
Also included in the network 100 can be a loss function 114, used to process the output 105 of the decoder using a selected loss function 114, as further described below. For example, the loss function 114 takes a theoretical proposition of the output 104 to a practical one. Building an accurate predictor model 106 uses constant iteration of the problem. The criteria by which a statistical model 106 is scrutinized is its performance—how accurate the model's 106 decisions are, by way of the loss function 114, which calculates how far a particular iteration output 104 of the model 106 is from the actual values (e.g. of the dataset 90). In particular, the loss function 114 measures how far an estimated value output 104 is from its true value. The loss function 114 can be thought of as maping decisions to their associated costs, as further discussed below. In this way, the loss function 114 operates on the output 104 to provide the output dataset 120 as the time series prediction (e.g. a generated future series based on the original dataset 90).
While the multi-scale framework 100 can reduce the error of the final prediction output 104 for the horizon window of the original input 102, we found that further changes to the loss function 114 can also be effective in the final results output 104 of the last iteration 101. Using MSE (mean square) loss for training the model 106 can make the training process noisy in the presence of outliers. In such scenarios, using more robust loss functions such as Huber loss can improve the performance. However, in datasets 90 without significant outliers, using Huber loss can hinder the training of harder samples. Considering this, we can use adaptive loss function 114 proposed by Barron by adapting this loss function for time-series forecasting via the transformer model 106, see the Appendix for further details.
In operation of the network 100 of
In our experiments, we used four public datasets with different characteristics to compare our framework 100 with the baselines of Table 1. Electricity Consuming Load (ECL) which collects the electricity consumption (Kwh) of 321 clients. Due to the missing data, the dataset is converted into hourly consumption of 2 years and set ‘MT 320’ as the target value. The train/val/test is 15/3/4 months. Traffic which is the hourly occupancy rate of 963 car lanes of San Francisco bay area freeways. Weather contains local climatological data for nearly 1,600 U.S. locations, 4 years from 2010 to 2013, where data points are collected every 1 hour. Each data point consists of the target value “wet bulb” and 11 climate features. The train/val/test is 28/10/10 months. Exchange-Rate represents the collection of the daily exchange rates of eight foreign countries including Australia, British, Canada, Switzerland, China, Japan, New Zealand and Singapore ranging from 1990 to 2016.
In comparison with the baselines, Table 1A shows the results of the final iteration output 104 of the framework 100 and the loss function 114 (as Autoformer-MSA representing dataset 90 processed using iterations 101 described above and Informer-MSA representing dataset 90 processed using iterations 101 described above) as compared with the baselines (entitled Autoformer and Informer). As shown, the loss function 114 using the mean square error (MSE) and the mean average error (MAE) are presented. To have a better comparison, each experiment is repeated 5 times and the average is reported. Our operation of the multiscale framework 100 improved the baselines in almost all of the experiments and some cases such as exchange-rate dataset with Informer as the baseline it achieves more than 50% improvement.
Table 1A: Comparison of the MSE and MAE results for the multi-scale framework 100 version of Informer model 106 and Autoformer model 106 with their original models as the baseline. Bold numbers are the better one in comparison of our framework 100 and the baseline version. See Table 1A below.
0.126
0.250
0.168
0.298
0.253
0.373
0.427
0.484
0.441
0.495
0.500
0.535
0.928
0.751
1.017
0.790
0.163
0.226
0.210
0.279
0.221
0.290
0.289
0.333
0.282
0.340
0.418
0.427
0.369
0.396
0.595
0.532
0.188
0.303
0.203
0.315
0.197
0.310
0.219
0.331
0.224
0.333
0.253
0.360
0.249
0.358
0.293
0.390
0.567
0.350
0.597
0.369
0.589
0.360
0.655
0.399
0.619
0.619
0.383
0.761
0.455
0.642
0.397
0.924
0.521
Table 1B is shown below, provided as further results of the method 200.
0.109
0.240
0.126
0.259
0.168
0.241
0.353
0.253
0.373
0.427
0.452
0.498
0.441
0.495
0.500
1.172
0.839
0.928
0.751
1.017
0.220
0.289
0.163
0.226
0.210
0.341
0.385
0.221
0.290
0.289
0.447
0.455
0.282
0.340
0.418
0.640
0.565
0.369
0.396
0.595
0.182
0.297
0.188
0.303
0.203
0.188
0.300
0.197
0.310
0.219
0.210
0.324
0.224
0.333
0.253
0.232
0.339
0.249
0.358
0.293
0.564
0.351
0.567
0.350
0.597
0.570
0.349
0.589
0.360
0.655
0.576
0.349
0.609
0.383
0.761
0.602
0.360
0.642
0.397
0.924
2.745
1.075
3.370
1.213
3.742
2.748
1.072
3.088
1.164
3.807
2.444
1.041
2.891
1.138
3.940
2.678
1.071
2.954
1.112
3.670
0.298
0.182
0.311
0.179
0.305
0.484
0.375
0.446
0.439
0.486
0.535
0.605
0.591
0.563
0.577
0.799
1.089
0.857
0.867
0.766
0.279
0.199
0.263
0.228
0.291
0.333
0.294
0.355
0.302
0.357
0.427
0.463
0.464
0.441
0.456
0.532
0.493
0.471
0.563
0.552
0.315
0.183
0.291
0.190
0.300
0.331
0.194
0.304
0.200
0.310
0.360
0.209
0.321
0.209
0.322
0.390
0.234
0.340
0.228
0.335
0.369
0.615
0.377
0.612
0.371
0.399
0.613
0.367
0.608
0.368
0.455
0.617
0.360
0.604
0.356
0.521
0.638
0.360
0.634
0.360
1.252
3.534
1.121
3.437
1.148
1.272
3.652
1.235
4.055
1.248
1.272
3.506
1.168
4.055
1.248
1.234
3.487
1.177
3.828
1.224
For example, Table 1B shows Comparison of the MSE and MAE results for our multi-scale framework 100 version of different methods (-MSA) with respective baselines. Results are given in the multi-variate setting, for different lengths of the horizon window. The best results are shown in Bold. Our method 200 can outperform vanilla version of the baselines over almost all datasets and settings. The average improvement (error reduction) is shown in numbers at the bottom with respect the base models, recognizing that Table 1B shows
with ξ=(Xiout−Xi(H)) in step i. The parameters α and c, which modulate the loss sensitivity to outliers, are learnt in an end-to-end fashion during training. To the best of our knowledge, this is the first time this objective has been adapted to the context of time-series forecasting.
Provided below is an overview of an example embodiment of the iterative model (see Appendix for further details on the equations used for the model 106 operated by the network/framework 100). Given the lookback window of the input series χL={x1t, . . . , xLt|xitϵ}, the goal is to predict the horizon window χH={xL+1t, . . . , xL+Ht|xitϵ} (as the output 104) in which L and H are respectively the length of lookback window horizon window as provided by the upsampling function 110 for each iteration 101 and the initial input 102 provided by the pooling function 108 for the input 102. Following the previous works, for passing the input 102, 105 to the transformer model 106, we consider χenc={x1t, . . . , xLt} as the input to the encoder 106a and we pass the half of the observations padded with zero to form χdec={x1t, . . . , xL/2t, 0,0, . . . , 0} as the input to the decoder 106b with the length of L/2+H. As such, the framework 100 applies successive transformer modules to iteratively refine a time-series forecast, at different temporal scales.
While current state of the art methods are all focused on improving the performance and efficiency of the attention modules, one missing direction now provided by the network 100 and associated transformer model 106 is instead improving the flexibility of the model 106 in a model agnostic way, such that successive upsampled outputs 105 are iterated 101 using the same model 106. In other words, the same model 106 is used for each of the different upsampled outputs 105, as well as for the original input 102. As such, the network 100 uses the same model 106 to predict the output 104 in different scales (such that the original input 102 and each successive output 105 are provided at increasing scale resolutions). In other words, the resolution of the output 104 for the first iteration 101 of the model 106 is the lowest resolution (e.g. 96), the next iteration 101 of the model 106 is using the output 105 upsampled to the next higher resolution (e.g. 196), the next iteration 101 of the model 106 is using the output 105 upsampled to the next higher resolution (e.g. 336), and the further iterations 101 continue to be upsampled until the final resolution of the original data set 90 (e.g. 720) is reached.
The framework 100 is shown in
Given a set of input series (Xenc,s
si=AVg(Xenc,s
X
dec,s
=X
dec,s
−
si (2a)
X
enc,s
=X
enc,s
−
si (3a)
In the above equations, {circumflex over (X)}si ϵRd is the average over the temporal dimension of the whole series including concatenation of both lookback window (of the upsampling function 110) and the horizon window (of the pooling function 108) lengths. A more detailed explanation and equations of the optional normalization process can be found in the Appendix, including the cross-scale normalization.
Referring to
To verify the performance improvement of both the framework 100 and the adaptive loss function 114, we trained the models 106 on all four combinations of baselines with and without multi-scale resolution iteration 101 (as discussed above) and using MSE loss or Adaptive loss functions 114 for training. Table 2 and Table 3 show the effect of multi-scale MS and loss function 114 respectively for Informer and Autoformer models 106 using the framework 100.
0.210 ± 0.016
0.279 ± 0.016
0.289 ± 0.011
0.333 ± 0.005
0.418 ± 0.039
0.427 ± 0.028
0.595 ± 0.043
0.532 ± 0.024
0.203 ± 0.011
0.315 ± 0.011
0.219 ± 0.002
0.331 ± 0.003
0.253 ± 0.008
0.360 ± 0.007
0.293 ± 0.006
0.390 ± 0.006
0.163 ± 0.008
0.226 ± 0.011
0.221 ± 0.009
0.290 ± 0.016
0.282 ± 0.024
0.340 ± 0.025
0.369 ± 0.041
0.396 ± 0.032
0.188 ± 0.004
0.303 ± 0.005
0.197 ± 0.003
0.310 ± 0.003
0.220 ± 0.003
0.333 ± 0.012
0.249 ± 0.009
0.358 ± 0.007
Table 2A shows Multi-scale framework without cross-scale normalization. Correctly normalizing across different scales (as per our cross-mean normalization) can be used to obtain improved performance when using the multi-scale framework 100.
0.288
0.342
0.191
0.277
0.388
0.435
0.368
0.422
0.281
0.360
0.393
0.434
0.447
0.469
0.364
0.397
0.566
0.528
0.640
0.574
0.425
0.434
0.978
8.723
0.201
0.317
0.197
0.312
0.344
8.421
0.200
0.314
0.219
0.329
0.344
8.426
0.214
0.330
0.263
0.359
0.358
0.440
0.239
0.350
0.290
0.380
0.386
0.452
Table 3A shows a single-scale framework with cross scale normalization “-N”. The cross-scale normalization (which in the single-scale case corresponds to mean-normalization of the output) does not improve the performance of the Autoformer, as it already has an internal trend-cycle normalization component. However, it does improve the results of the Informer and FEDformer.
0.234
0.292
0.267
0.334
0.253
0.333
0.287
0.337
0.323
0.376
0.357
0.408
0.436
0.443
0.364
0.397
0.459
0.461
0.545
0.504
0.425
0.434
0.870
0.676
0.194
0.307
0.197
0.312
0.247
0.356
0.195
0.304
0.219
0.329
0.291
0.394
0.200
0.310
0.263
0.359
0.321
0.416
0.225
0.332
0.380
0.280
0.362
0.434
Referring to
An example Algorithm 1 of the method 200 can be as follows, using the equations provided in the Appendix for example.
Equation (1) and (2) of the paper
Equation (5) and (6) of the paper
return the prediction at all scales
As provided above with respect to the Tables and plots, example datasets 90 used included four public datasets with different characteristics to evaluate the framework 100. Electricity Consuming Load (ECL) corresponds to the electricity consumption (Kwh) of 321 clients. Traffic aggregates the hourly occupancy rate of 963 car lanes of San Francisco bay area freeways. Weather contains 21 meteorological indicators, such as air temperature, humidity, etc, recorded every 10 minutes for the entirety of 2020. Exchange-Rate collects the daily exchange rates of 8 countries (Australia, British, Canada, Switzerland, China, Japan, New Zealand and Singapore) from 1990 to 2016. National Illness (ILI) corresponds to the weekly recorded influenza-like illness patients from the US Center for Disease Control and Prevention. We consider horizon lengths of 24, 32, 48, and 64 with an input length of 32.
An example computer system, for implementing the framework 100 and method 200, in respect of which the technology herein described can be implemented is presented as a block diagram in
The computer 406 may contain one or more processors or microprocessors for implementing the method 200 of the framework 100, such as a central processing unit (CPU) 410. The CPU 410 performs arithmetic calculations and control functions to execute software stored in a non-transitory internal memory 412, preferably random access memory (RAM) and/or read only memory (ROM), and possibly additional memory 414. The additional memory 414 is non-transitory may include, for example, mass memory storage, hard disk drives, optical disk drives (including CD and DVD drives), magnetic disk drives, magnetic tape drives (including LTO, DLT, DAT and DCC), flash drives, program cartridges and cartridge interfaces such as those found in video game devices, removable memory chips such as EPROM or PROM, emerging storage media, such as holographic storage, or similar storage media as known in the art. This additional memory 414 may be physically internal to the computer 406, or external as shown in
The one or more processors or microprocessors may comprise any suitable processing unit such as an artificial intelligence accelerator, programmable logic controller, a microcontroller (which comprises both a processing unit and a non-transitory computer readable medium), AI accelerator, system-on-a-chip (SoC). As an alternative to an implementation that relies on processor-executed computer program code, a hardware-based implementation may be used. For example, an application-specific integrated circuit (ASIC), field programmable gate array (FPGA), or other suitable type of hardware implementation may be used as an alternative to or to supplement an implementation that relies primarily on a processor executing computer program code stored on a computer medium.
Any one or more of the methods described above may be implemented as computer program code and stored in the internal and/or additional memory 414 for execution by the one or more processors or microprocessors to effect neural network pre-training, training, or use of a trained network for inference.
The computer system 400 may also include other similar means for allowing computer programs or other instructions to be loaded (e.g. the model 106 and associated method 200 instructions). Such means can include, for example, a communications interface 416 which allows software and data to be transferred between the computer system 400 and external systems and networks. Examples of communications interface 416 can include a modem, a network interface such as an Ethernet card, a wireless communication interface, or a serial or parallel communications port. Software and data transferred via communications interface 416 are in the form of signals which can be electronic, acoustic, electromagnetic, optical or other signals capable of being received by communications interface 416. Multiple interfaces, of course, can be provided on a single computer system 400.
Input and output to and from the computer 406 is administered by the input/output (I/O) interface 418. This I/O interface 418 administers control of the display 402, keyboard 404A, external devices 408 and other such components of the computer system 400. The computer 406 also includes a graphical processing unit (GPU) 420. The latter may also be used for computational purposes as an adjunct to, or instead of, the (CPU) 410, for mathematical calculations.
The external devices 408 include a microphone 426, a speaker 428 and a camera 430. Although shown as external devices, they may alternatively be built in as part of the hardware of the computer system 400. For example, the camera 430 and microphone 426 may be used to retrieve multi-modal video content for use to train the network 100 and/or method 200, or for processing by a trained network 100 or trained method 200.
The various components of the computer system 400 are coupled to one another either directly or by coupling to suitable buses. The term “computer system”, “data processing system” and related terms, as used herein, is not limited to any particular type of computer system and encompasses servers, desktop computers, laptop computers, networked mobile wireless telecommunication computing devices such as smartphones, tablet computers, as well as other types of computer systems.
In view of the above, it is recognized that the example network 100 and associated method 200 provide the following: (1) a novel iterative scale-refinement paradigm that can be readily adapted to a variety of encoder-based (e.g. transformer) time series forecasting architectures; (2) minimize potential distribution shifts between scales and windows by introducing cross-scale normalization on outputs of the model 106 at one or more of the iterative steps/scales; (3) using Informer and AutoFormer, two state-of-the-art transformer architectures as backbones, we demonstrate empirically the effectiveness of the method 200 on a variety of datasets. Therefore, depending on the choice of model 106 architecture, our multi-scale framework 100 can result in mean squared error reductions ranging from 5:5% to 38:5%; and (4) via a detailed ablation study of our findings, we demonstrate the validity of our architectural and methodological choices.
The above presented framework 100 and method 200 have been shown to be beneficial when applied to transformer-based, deterministic time series forecasting. However, the framework 100 and method 200 are not limited to those settings, rather the framework 100 and method 200 can be extended to probabilistic forecasting and non transformer-based encoders 106a,b, both of which are closely coupled with our primary application. It is recognized that what is common for the various described applications of the framework 100 and method 200 is that the forecasting model 106 (e.g. transformer based, non transformer based, etc.) uses an encoder-decoder architecture.
For example, we show that our above presented framework 100 and method 200, using multiscale encoding 106a,b, can improve performance in a probabilistic forecasting setting (please refer to Tables 4, 5 below for example results). We adopt the probabilistic output of DeepAR (Salinas et al., 2020), which is the most common probabilistic forecasting treatment. In this setting, instead of a point estimate, we have two prediction heads models 106, predicting the mean and standard deviation, trained with a negative log likelihood loss (NLL). NLL and continuous ranked probability score (CRPS) are used as evaluation metrics. All other hyperparameters remain unchanged. Here, again, the operation of the framework 100 and method 200 applied to the non transformer based models 106 continue to outperform the probabilistic
Informer.
While we have mainly focused on improving transformer-based models 106, they are not the only encoders 106a,b. Recent models such as NHits (Challu et al., 2022) and FiLM (Zhou et al., 2022a) attain competitive performance, while assuming a fixed length univariate input/output. They can be less flexible compared with variable length of multi-variate input/output, but result in strong performance and faster inference than transformers, making them interesting to consider. The application of the framework 100 and method 200 demonstrates a statistically significant improvement, on average, when adapted by NHits and FiLM based models 106 to iteratively refine predictions.
The results mentioned above demonstrate that framework 100 and method 200 can adapt to settings distinct from point-wise time-series forecasts with transformers, such as probabilistic forecasts and non-transformer models.
Table 4 shows the comparison of probabilistic methods for Informer by following the probabilistic output of DeepAR (Salinas et al., 2020), which is the most common probabilistic forecasting treatment.
0.202 ± 0.01
0.452 ± 0.0
0.284 ± 0.02
0.818 ± 0.
0.414 ± 0.06
1.724 ± 0.43
0.570 ± 0.03
2.210 ± 0.21
0.250 ± 0.02
0.392 ± 0.
0.294 ± 0.01
0.610 ± 0.04
0.308 ± 0.02
0.728 ± 0.
0.438 ± 0.04
1.270 ± 0.14
0.238 ± 0.01
0.578 ± 0.01
0.290 ±
0.776 ± 0.01
0.324 ± 0.03
0.904 ± 0.10
0.358 ± 0.01
1.022 ± 0.04
0.288 ± 0.01
1.094 ± 0.0
0.312 ± 0.01
1.102 ± 0.04
0.368 ± 0.02
1.194 ± 0.05
0.442 ± 0.02
1.378 ± 0.06
indicates data missing or illegible when filed
Table 5 shows the comparison results of NHiTs (Challu et al., 2022) and FiLM (Zhou et al., 2022a) as two baselines. For each method, we copy original model to have model for different scales and we concatenate the input with the output of previous scale for the new scale. The training hyperparameters such as optimizer and learning rate is the same as the previous baselines. The shown effect of applying our proposed framework to NHits and FiLM as two non-transformer based models. Best results are shown in Bold.
0.218 ± 0.01
0.087 ± 0.0
0.206 ± 0.00
0.081 ± 0.00
0.197 ± 0.00
0.332 ± 0.01
0.186 ± 0.01
0.306 ± 0.00
0.156 ± 0.00
0.284 ± 0.00
0.347 ± 0.03
0.442 ± 0.02
0.253 ± 0.00
0.378 ± 0.0
0.761 ± 0.20
0.662 ± 0.0
0.728 ± 0.01
0.659 ± 0.00
0.167 ± 0.00
0.211 ± 0.00
0.194 ± 0.00
0.232 ± 0.00
0.208 ± 0.00
0.253 ± 0.00
0.235 ± 0.00
0.269 ± 0.00
0.261 ± 0.00
0.261 ± 0.00
0.294 ± 0.00
0.275 ± 0.00
0.303 ± 0.00
0.331 ± 0.00
0.348 ± 0.00
0.350 ± 0.00
0.337 ± 0.00
indicates data missing or illegible when filed
The embodiments have been described above with reference to flow, sequence, and block diagrams of methods, apparatuses, systems, and computer program products. In this regard, the depicted flow, sequence, and block diagrams illustrate the architecture, functionality, and operation of implementations of various embodiments. For instance, each block of the flow and block diagrams and operation in the sequence diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified action(s). In some alternative embodiments, the action(s) noted in that block or operation may occur out of the order noted in those figures. For example, two blocks or operations shown in succession may, in some embodiments, be executed substantially concurrently, or the blocks or operations may sometimes be executed in the reverse order, depending upon the functionality involved. Some specific examples of the foregoing have been noted above but those noted examples are not necessarily the only examples. Each block of the flow and block diagrams and operation of the sequence diagrams, and combinations of those blocks and operations, may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. Accordingly, as used herein, the singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise (e.g., a reference in the claims to “a challenge” or “the challenge” does not exclude embodiments in which multiple challenges are used). It will be further understood that the terms “comprises” and “comprising”, when used in this specification, specify the presence of one or more stated features, integers, steps, operations, elements, and components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and groups. Directional terms such as “top”, “bottom”, “upwards”, “downwards”, “vertically”, and “laterally” are used in the following description for the purpose of providing relative reference only, and are not intended to suggest any limitations on how any article is to be positioned during use, or to be mounted in an assembly or relative to an environment. Additionally, the term “connect” and variants of it such as “connected”, “connects”, and “connecting” as used in this description are intended to include indirect and direct connections unless otherwise indicated. For example, if a first device is connected to a second device, that coupling may be through a direct connection or through an indirect connection via other devices and connections. Similarly, if the first device is communicatively connected to the second device, communication may be through a direct connection or through an indirect connection via other devices and connections. The term “and/or” as used herein in conjunction with a list means any one or more items from that list. For example, “A, B, and/or C” means “any one or more of A, B, and C”.
It is contemplated that any part of any aspect or embodiment discussed in this specification can be implemented or combined with any part of any other aspect or embodiment discussed in this specification.
The scope of the claims should not be limited by the embodiments set forth in the above examples, but should be given the broadest interpretation consistent with the description as a whole.
It should be recognized that features and aspects of the various examples provided above can be combined into further examples that also fall within the scope of the present disclosure. In addition, the figures are not to scale and may have size and shape exaggerated for illustrative purposes.
We denote X(L) and X(H) the look-back and horizon windows for the for respectively, of corresponding lengths L, H. Given a starting time to we can express these time-series of dimension dx, as follows: X(L)={xt|xiϵ, tϵ[t0, t0+L|} and X(H)={xt|xiϵ, tϵϵt0+L+1, t0+L+H]}. The gOal of the forecasting task is to predict the, horizon window X(H) given the load-back window X(L).
Given an input time-series X(L), we iteratively apply the same neural module multiple times at different temporal scales. Concretely, we consider a set of scales S={sm, . . . , s2, s3, 1} (i.e. for the default scale of s=2, S is a set of consecutive powers of 2), where m=└logsL ┘−1 and s is a downscaling factor. The input to the encoder at the i-th step (0≤i≤m) is the original look-back window X(L), downsampled by a scale factor of si∝sm-i via an average pooling operation. The input to the decoder, on the other hand, is Xi-1out upsampled by a factor of s via a linear interpolation.
Finally, X0dec is initialized to an array of 0s. The model performs the following operations:
where Xi(L) and Xi(H) are the look-back and horizon windows at the ith step at time t with the scale factor of sm-i and with the lengths of L,t and H,i, respectively. Assuming x′t,t-1 is the output of the forecasting module at step i−1 and time t, we can define Xienc and Xidec as the inputs to the normalization:
Finally, we calculate the error between Xi(H) and Xiout as the loss function to train the model. Please refer to Algorithm 1 for details on the sequence of operations performed during the forward pass.
Given a set of input series (Xienc, Xidec), with dimensions L
where
Following the previous works, we embed our input to have the same number of features as the hidden dimension of the model. The embedding consists of three parts: (1) Value embedding which uses a linear layer to map the input observations of each step xt to the same dimension as the model. We further concatenate an additional value 0, 0.5, or 1 respectively showing if each observation is coming from the look-back window, zero initialization, or the prediction of the previous steps. (2) Temporal Embedding which again uses a linear layer to embed the time stamp related to each observation to the hidden dimension of the model. Here we concatenate an additional value 1/si−0.5 as the current scale for the network before passing to the linear layer. (3) We also use a fixed positional embedding which is adapted to the different scales s, as follows:
Using the standard MSE objective to train time-series forecasting models leaves them sensitive to outliers. One possible solution is to use objectives more robust to outliers, such as the Huber loss (Huber, 1964). However, when there are no major outliers, such objectives tend to underperform. Given the heterogeneous nature of the data, we instead utilize the adaptive loss (Barron, 2019):
Implementation details: Following previous work (Xu et al., 2021; Zhou et al., 2021), we pass Xenc=X(L) as the input to the encoder. While an array of zero-values would be the default to pass to the decoder, the decoder instead takes as input the second half of the look-back window padded with zeros Xdec={/2, . . . , , 0, 0, . . . , 0} with length L/2+H. The hidden dimension of models is 512 with a batch size of 32. We use the Adam optimizer with a learning rate of 1e-4. The look-back window size is fixed to 96, and the horizon is varied from 96 to 720. We repeat each experiment 5 times and report average values to reduce randomness. For additional implementation
Number | Date | Country | |
---|---|---|---|
63342399 | May 2022 | US |