This application claims the benefit of priority to Indian Patent Application No. 202311079656, filed Nov. 23, 2023, the entire content of which is incorporated herein by reference.
The present invention relates to a computer-implemented method for training a neural network to forecast multivariate data in a forecast location in a network and further relates to a computer-implemented method for forecasting multivariate data in a forecast location in a network.
The rise of smart devices and Internet of Things (IoT) applications has resulted in a significant increase in mobile data traffic. To meet the growing demands and enhance user experience, cellular networks require infrastructure upgrades. The advent of 5G technology addresses these needs by offering high-speed data transfer with low latency. With the expansion of the 5G network, optimizing the utilization of network resources has become increasingly important. Accurately predicting network traffic and performance is vital for improving the efficiency of network operators, service providers, and equipment manufacturers. However, forecasting network demand is challenging due to its non-stationary characteristics influenced by multiple factors or variables such as user mobility patterns, weather, and social events.
In general, forecasting of multivariate (i.e. multiple variable) data in a computer or telecommunications network can also be complicated by missing data from a time series sequence of data, when one or more, or even all of the values for variables are missing at some time points in the time series.
It is therefore desirable to improve the accuracy of forecasting multivariate data using neural networks. It is also desirable to enhance the accuracy of time series forecasting models when faced with data insufficiency caused by irregular, missing records, without relying on data imputation techniques.
The invention is defined in the independent claims, to which reference should now be made. Further features are set out in the dependent claims.
According to an aspect of the invention there is provided a computer-implemented method for training a neural network to forecast multivariate data in a forecast location in a network. The method comprises inputting a dataset from the forecast location and one or more adjacent locations, the dataset comprising spatio-temporal characteristics of each location and multivariate data recorded at each location, determining a longest time-series sequence length of the dataset for which an occurrence frequency of the longest time-series sequence length appearing in the dataset is higher than a threshold number of the dataset, a time-series sequence length indicating a total length of consecutive time steps with complete data in the dataset, training a forecast location neural network based on the determined longest time series sequence length to encode the multivariate data from the forecast location into a forecast location vector, training for each of the one or more adjacent locations an adjacent location neural network based on the determined longest time series sequence length to encode the multivariate data from each of the one or more adjacent locations each into an adjacent location vector, combining the one or more adjacent location vectors into a combined adjacent location vector, composing the forecast location vector and combined adjacent location vector into a final combined vector, and decoding the final combined vector to generate a forecast from the dataset for the forecast location.
Reference is made, by way of example only, to the accompanying drawings in which:
Supervised deep learning (DL) models like Long Short-Term Memory (LSTM) neural networks and combined Convolutional Neural Networks (CNNs) and LSTMS (called Conv-LSTM) have proven effective for time series forecasting and have been applied to network traffic forecasting. Another popular module for forecasting is transformers. These models can detect periodicity and seasonality in network traffic data, leveraging temporal patterns for predictions. However, they do not account for the interdependencies between adjacent network cells, which can influence the traffic of the cell under consideration. Additionally, the accuracy of these models relies on abundant historical network traffic data for training, making them vulnerable to the problem of missing data. Traditional methods like imputation with mean or median are ineffective for non-stationary data. Consequently, addressing extensive missing data in a non-stationary context poses a significant challenge when using supervised data imputation techniques.
Forecasting network usage demands is important for network performance analysis and resource allocation. Effective forecasting often requires historic records of long time series-sequences (data from a sequence of consecutive time points). Practically obtaining longer time sequences is difficult and there will likely always be missing data in the time series records. These data gaps are primarily caused by user mobility, faulty sensors (for example, faulty or missing cell tower location signals or imprecise GPS sensor on the mobile device carried by the user), or intermittent network outages. Known models may not be trained on the longer time series due to the missing data and hence forecasting accuracy drops. Traditional data imputation techniques, such as mean or median imputation are commonly employed to handle missing data in time series sequences. However, the inventors found that these techniques are not effective when dealing with 5G data.
Disclosed herein is a proposed novel forecasting model for multi-step, multi-variate, and spatiotemporal time series analysis. In order to address the issue of missing data, the inventors employed a dynamic approach to select representative time step sequences from the dataset for modelling purposes. Specifically, in an example, a Longest Common Continuous Frequent Sequences (LCCFS) algorithm is introduced to dynamically identify the most suitable length of time series sequences for training the model. Also introduced is a method to incorporate spatial and local features by considering the influence of neighbouring cells on the target cell, improving prediction accuracy in complex scenarios. In an example, the inventors approach utilizes an encoder stack of Bidirectional-LSTM (BiLSTM) networks to capture the impact of changes in neighbouring cells on the target cell. The BiLSTM outputs may be concatenated and a self-attention module may be employed to assess the influence of neighbouring cells. Furthermore, in an example, to achieve multi-step and multi-variate forecasting, a RepeatVector available at Keras Team, “Keras Documentation: Repeatvector Layer,” https://keras.Io/api/layers/reshaping layers/repeat vector/, accessed on May 6, 2023, a BiLSTM network, and a TimeDistributed dense layer (TDL) available at “Keras Documentation: Timedistributed Layer,” https://keras.io/api/layers/recurrent layers/time distributed/, accessed on May 6, 2023 are used.
The dataset table in
The known technique uses an unsupervised method consisting of common statistical (e.g., mean, median, etc) or interpolation strategies or unsupervised learning to impute missing data in a data set. As shown in
The inventors plotted the results of the known data imputation techniques for forecasting the multivariate data in a graph 120, titled “Data imputation Performance Average-MAE by taking 10 samples of size 1000 each”. The inventors used Mean absolute error as an indicator for the accuracy of the data imputation technique. The graph shows each of the data imputation techniques above for varying percentages of missing data. The graph shows that generally the last valid observation and next valid observation provided the lowest MAE and the mode generated the highest MAE. Furthermore, the inventors found that for all imputation techniques the MAE increased as the percentage of missing data increased.
The accuracy of models using data imputation techniques are therefore very reliant on abundant historical network traffic data for training, making them vulnerable to the problem of missing data. As the percentage of missing data increases the error in the data imputation also increases, leading to inaccurate forecasts. That is, the data imputation increases to a very high level if the percentage of noisy data increases. The inventors found that the imputation techniques are not the best proxy for missing data and they add noisy information to the existing data. Hence using complete data with all missing values filled with imputed data (but noisy entry added due to the imputation technique) may lead to poor performance in forecasting multivariate, multi-step time series forecasting models.
Supervised data imputation techniques require labelled training data for learning. Labelled training data may be difficult to prepare for may data points and often requires a human input for verification. Hence, in some instances labelled training data may not be available. Despite using labelled training data, the inventors identified that the missing data imputed using the supervised models would introduce noise into the system and would show a very high error rate with an increase in the percentage of missing data. As a result, multi-variate, multi-step time series forecasting using the imputed data would perform poorly.
In the realm of 5G forecasting using deep learning (DL), researchers have explored various approaches to improve prediction accuracy. Oliveira et al. available at T. P. Oliveira, J. S. Barbar, and A. S. Soares, “Computer network traffic prediction: a comparison between traditional and deep learning neural networks,” International Journal of Big Data Intelligence, vol. 3, no. 1, pp. 28-37, 2016 achieved better results with recurrent neural networks (RNN) compared to stacked auto-encoders for Internet traffic prediction, however still used imputed data, thereby introducing unwanted noise into the dataset.
In another forecasting method, Wang et al. available at J. Wang, J. Tang, Z. Xu, Y. Wang, G. Xue, X. Zhang, and D. Yang, “Spatiotemporal 7odelling and prediction in cellular networks: A big data enabled deep learning approach,” in IEEE INFOCOM 2017-IEEE conference on computer communications. IEEE, 2017, pp. 1-9 combined an auto-encoder with a long short-term memory (LSTM) network to consider spatial dependency but faced challenges with lossy representations and capturing nearby cell dependencies. The inventors surprisingly found that the call data records (CDRs) for a forecast location were dependent of CDRs for neighbouring cells. For example, if a user is travelling between locations in a city their data may be transmitted to multiple nearby cells. The method in Wang et al failed to capture this dependency, thereby leading to inaccuracy in the forecast. Furthermore, this known method used data imputation techniques, thereby introducing noise into the dataset.
In yet further forecasting methods, Zhang et al. available at C. Zhang, H. Zhang, D. Yuan, and M. Zhang, “Citywide cellular traffic prediction based on densely connected convolutional neural networks,” IEEE Communications Letters, vol. 22, no. 8, pp. 1656-1659, 2018 introduced a densely connected convolutional neural network (CNN) for citywide traffic forecast, considering spatial and temporal dependencies. Recent work by Lin et al. available at J. Lin, Y. Chen, H. Zheng, M. Ding, P. Cheng, and L. Hanzo, “A datadriven base station sleeping strategy based on traffic prediction,” IEEE Transactions on Network Science and Engineering, 2021 proposed an intelligent data-driven base station (BS) sleeping mechanism using a multigraph convolutional network (MGCN) to capture spatial information for spatiotemporal cellular traffic prediction. They incorporated hourly, daily, and weekly periodic data into a multi-channel LSTM system to extract temporal features. The MGCN-LSTM model outperformed other models in terms of forecast accuracy. In terms of energy-saving approaches, Gao et al. available at Y. Gao, M. Zhang, J. Chen, J. Han, D. Li, and R. Qiu, “Accurate loadprediction algorithms assisted with machine learning for network traffic,” in 2021 International Wireless Communications and Mobile Computing (IWCMC). IEEE, 2021, pp. 1683-1688 presented load prediction models for traffic anticipation in cells. They employed a linear ensemble model with sub-models using linear regression and regression tree techniques, and trained the data with a residual convolutional neural network (ResNet) available at K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778. While these previous works addressed spatio-temporal correlation in time series forecasting, the existing methods do not adequately account for challenges arising from missing training records in supervisory data. The use of data imputation techniques in these known methods introduce noisy entries into the dataset which lead to poor performance of the forecasting models. Furthermore, these models fail to capture the dependency of neighbouring or adjacent cells on the central cell (or forecast location), a factor the inventors found to have an effect on the accuracy of the forecast.
The method developed by the inventors forecasts spatiotemporal, multi-step, and multi-variate 5G usage time series accurately by capturing important characteristics and leveraging the training data. Unlike traditional methods relying on data imputation for missing records, the inventors' approach doesn't explicitly require such techniques. The inventors' algorithm identifies optimal time series patterns using “Time series step-size identification” and constructs a deep learning model that integrates all factors for precise forecasting.
In an inputting step S10 a dataset from a forecast location and one or more adjacent locations is input into the model. The dataset comprises spatio-temporal characteristics of each location and multivariate data recorded at each location.
The inventors utilized Telecom Italia's real-world dataset for time series forecasting of cellular traffic data in Milan. The dataset spans an 11-week period and consists of millions of call data records (CDRs) sampled at 10-minute intervals. However, of course any suitable dataset may be used.
The CDRs in the Telecom Italia's dataset contain eight features related to cellular network usage, including CellID (cellular site identifier), Datetime, SmsIn (incoming SMS count), SmsOut (outgoing SMS count), CallIn (incoming call count), CallOut (outgoing call count), and Internet (internet activity). Milan's city map is divided into 100×100 areas, each covering 0.05 km2, and the dataset's CDRs are spatially aggregated based on the coverage area of the processing base station. Despite substantial missing data records, researchers have successfully utilized this dataset for predicting future network traffic and studying cellular network dynamics. Its significance extends not only to 4G networks but also to 5G and beyond, making it an invaluable resource for time-series forecasting in the realm of cellular networks.
In step s10 the data may be prepared or pre-processed before being input into the model. For example, the original Milan dataset used by the inventors was initially collected at 10-minute intervals, but forecasting at this level of granularity may lead to network instability or excessive overhead. To overcome these challenges, the inventors resampled the data by aggregating the traffic on an hourly basis. For training the proposed learning model, the inventors extracted multiple features from the Milan dataset. In addition to the original network features such as network cell ID, day of the week, and call records (internet, SMS-in/out, calls-in/out), the inventors included six additional specially selected features. These features encompassed the day of the observation, time of the day, indicator variable whether its a working hour (9 am to 5 pm), holiday indicator, weekday indicator, and cell ID. The units of the processed dataset used in this method are the same as the dataset discussed in relation to
To preprocess the continuous variables, the inventors applied clipping to handle outliers by setting the threshold at the 95th percentile. Following that, they performed min-max standardization to scale the variables within the range of 0 to 1. The inventors time series forecasting analysis considered the correlation between the call data records of the targeted cell, that is the forecast location, and its adjacent (neighbouring) cells. The inventors found that accounting for all eight neighbouring cells surrounding each cell was preferable and produced the highest accuracy forecasts. However, capturing the effect of the dependency of one or more adjacent cells (or adjacent locations) also improved the forecasting accuracy for the forecast location. In situations where cells were located at corners or sides and lacked all eight neighbouring cells, the inventors employed zero-based padding to prepare the time series data. For the neural network, the inventors provided the original five features from each of the eight neighbouring cells as inputs. Additionally, for the central cell (C0), all 11 features, incorporating the six selected features were used.
In a determining step S20, a longest time-series sequence length of the dataset is determined. The time-series sequence length indicates a total length of consecutive time steps with complete data in the dataset, with an occurrence frequency of the longest time-series sequence lengths appearing in the dataset being higher than a threshold number of the dataset.
The inventors developed a method of “Longest Common Continuous Frequent Sequences” (LCCFS) to identify a longest time-series sequence length in the dataset. More details of the LCCFS are provided in relation to
In a training step S30 a forecast neural network is trained based on the determined time series sequence length to encode the multivariate data from the forecast location into a forecast location vector. For example, the inventors used a Bidirectional Long short-term memory (Bi-LSTM) recurrent neural network, but of course any appropriate neural network may be used. For example, an appropriate neural network may be another recurrent neural network.
A bidirectional Long-short term memory (BiLSTM) model is a type of recurrent neural network (RNN) that can learn long-term dependencies in sequence data. It does this by using two LSTMs, one that reads the sequence from left to right and the other that reads it from right to left (that is, one LSTM reads the sequence forwards in time, and one reads the sequence backwards in time). The outputs of the two LSTMs are then combined to give a representation of the sequence that takes into account both the past and the future.
The BiLSTM is updated in two directions: forward and backward. The forward LSTM reads the sequence from left to right, and the backward LSTM reads the sequence from right to left. The outputs of the two LSTMs are then combined to give a representation of the sequence that takes into account both the past and the future. The BiLSTM is updated using, for example, a backpropagation algorithm. Backpropagation is an algorithm that is used to update the weight of a neural network in order to minimize a loss function. In the case of the BiLSTM, the loss function is the error between the predicted output of the BiLSTM and the actual output in the sequence (for example determined using Mean absolute error).
The functionality of the BiLSTM can be shown as follows:
Here xt represents the input at time step t, ht represents the hidden state of the forward LSTM at time step t, gt represents the hidden state of the backward LSTM at time step t, h{t-1} represents the hidden state of the forward LSTM at the previous time step t−1 and g{t+1} represents the hidden state of the backward LSTM at the next time step t+1.
LSTMforward and LSTMbackward denote the LSTM functions for forward and backward directions, respectively. ‘concatenate’ is the operation that concatenates the outputs of the forward and backward LSTMs. ‘f’ is the activation function that transforms the concatenated output into the final output yt. The Bidirectional LSTM processes the input sequence xt from left to right with the forward LSTM and from right to left with the backward LSTM. The final output yt goes to the next layer.
In a training step S40, for each of the one or more adjacent locations an adjacent location neural network is trained based on the determined time series sequence length to encode the multivariate data from each of the one or more adjacent locations each into an adjacent location vector. As above, for each of the adjacent location neural networks the inventors trained a BiLSTM neural network.
In a combining step S50 the one or more adjacent location vectors are combined into a combined adjacent location vector. In an example with more than one adjacent location, the combining step may comprise concatenating the adjacent location vectors for each adjacent location into a concatenated location vector. The concatenated location vector may be input into a sequence of two multilayer perceptron layers, the output of a first multilayer perceptron layer being input into a second multilayer perceptron layer. The inventors found that the multilayer perceptron layers may capture the influence of neighbouring cells on the central cells in a subsequent stage.
The skilled person would understand that in an example with one adjacent location vector, the combing step may combine, or transform, the one adjacent location vector into the combined adjacent location vector. For example, the one adjacent location vector may be passed through the multilayer perceptron layers with the output being the combined adjacent location vector.
In a composing step S60 the forecast location vector and combined adjacent location vector are composed into a final combined vector. The forecast location vector and combined location vector may be composed into the final combined vector by concatenating the forecast location vector and combined adjacent location vectors into a final latent vector. The composing may further comprise inputting the final latent vector into a self-attention mechanism and determining as an output the final combined vector.
In a decoding step S70 the final combined vector may be decoded to generate a forecast from the dataset for the forecast location. The decoding step may comprise training an output neural network based on the determined time series sequence length to encode the final combined vector, and inputting hidden layers of the trained output neural network into a time distributed dense layer to generate an output, the output from the time distributed dense layer being the forecast.
The output neural network may be a BiLSTM neural network. Hence, each hidden layer input into the Time distributed layer may be a hidden layer of the Bi-LSTM neural network. The skilled person would understand that the forecast generated by the Time distributed dense layer may be used to train each of the above neural networks, that is the forecast location neural network, each of the adjacent location neural networks and the output neural network. For example, in a first training run the neural networks may be assigned random weights and the multivariate time series data input into them. The forecasting method may then generate a forecast.
The generated forecast may then be used to inform a backpropagation step for training each of the neural networks. For example, the generated forecast be used to calculate a mean absolute error for backpropagation. More detail of backpropagation is given in connection with
The final combined vector may be replicated using, for example, a repeat vector unit and each replicated final combined vector may be decoded to generate a forecast at a different time point. Using such a method, a multi-step forecast may be generated. That is, the method may be used to generate a forecast for a time step t+1 and a time step t+2.
The inventors found that the accuracy of the forecast may be improved by training more than one model using different time-series sequence lengths. For example, the forecasting method may further include determining one or more shorter time-series sequence lengths of the dataset and repeating, for the one or more shorter time-series lengths, the steps of training the forecast locations neural network and the one or more adjacent location neural networks, combining into a combined adjacent location vector, composing into a final combined vector, decoding to generate a forecast. Once the neural networks are trained, an ensemble of the generated forecasts for each of the longest time-series sequence length and the one or more shorter time-series lengths may be determined to generate a final forecast for the forecast location. That is, for example, an ensemble of each of the trained models may be taken and the output of the ensemble used as a forecast. The ensemble may be a weighted average ensemble determined using, for example, a grid search method.
Example details of the LCCFS process are as follows:
In a determining step s100, a number of data records with complete data may be determined. A data record may comprise, for example, multivariate data recorded at each location at each time step in the dataset.
Let d represent a day number, with 1≤d and let t represent the hour of the day, with 1≤t≤24. Now, considering that K % of the data may be missing in total, based on the values of d and t, the total number of data records with complete data (that is, valid time series records) may be determined using the following equation:
In a setting step s110, a threshold number may be set from the complete data in the dataset. For example, in the proceeding steps for determining a longest continuous common sequence, an occurrence frequency of a time-series may be compared to the threshold number set from the complete data. The inventors found through an empirical method that, with 35% missing data, setting a threshold number as 50% of the complete data (for example 12, in a single day data set uses the figures above) generated the most accurate forecast. Of, course, the threshold number may vary depending on the percentage of missing data in the dataset.
In a second determining step s120, a length and occurrence frequency for each continuous common sequence in the multivariate data may be determined. The length may be a number of complete consecutive multivariate data in the dataset and the occurrence frequency the number of times the continuous common sequence occurs in the dataset.
An example of determining the length and occurrence frequency is as follows. Let Di represent the data corresponding to the ith day, where 1≤i≤(Total number of days). Additionally, let Hd=i,t=1, Hd=i,t=2, . . . , Hd=i,t=4 denote the hours for the ith day.
where, as before, d represent a day number, with 1≤d and let t represent the hour of the day, with 1≤t≤24.
However, due to the presence of missing data, some of Hd=1,t=1, Hd=i,t=2, . . . , Hd=i,t=24 may contain null (or missing) values, which can be spread across all the dates. The inventors scanned all the records of all the dates to obtain a list of all continuous common sequence along with their occurrence frequency (and the total percentage of valid time series records (i.e., the records where Hd,t is not null) may also be obtained). Let the length of a continuous common sequence of hourly records without any missing or null values be denoted by LC, where 1≤LC≤24. The list of all time series sequences with a step size of w may be denoted by SLC=w and be given by:
where 1≤h≤24. The count of the number of time series sequences with a step size of w may be further represented as |SLC=W|. To obtain the LCCFS, the inventors calculated all SLC for each value of LC and determined their respective counts |SLC|.
In a setting step s130, the (longest) continuous common sequence in the multivariate data with an occurrence frequency above the threshold value may be set as the longest time-series sequence length. Valid time series records, excluding missing data, may be determined based on the percentage K % of missing data. For example, the selection based on the LCCFS may apply the following steps.
1) For all values of w, check SL
2) From the remaining list of time series, select the list of time series from SLC having the highest value of w. Suppose w′ is the highest value of w after elimination applied in the previous step. In this example, this results in the selection of list of time series SL
In this example, the inventors used a Longest common continuous frequency sequence to determine a longest time-series sequence length of the dataset.
However, of course other methods may be used to determine an appropriate length for the longest time-series sequence.
The inventors found that with the Milan dataset, the computed LCCFS size was 7. In terms of the multi-step model design perspective, the best model employed the initial five time series steps for training and the subsequent two continuous sequences for forecasting. From the longest time-series sequence length, the inventors were able to acquire a strong understanding of time series patterns and enable effective predictions. In an example method using three neural network models with different time series sequence lengths, the highest sequence length from LCCFS was obtained for sequence length 7. The second highest for sequence length 6 and third highest for sequence length 5. The inventors utilized all first three ranks of LCCFS of these sequence length for designing the ensemble of the system disclosed herein.
Module 1: In a first module, the impact of time series features from eight neighbouring cells on a central/targeted cell may be examined. As discussed above the inventors used the Milan dataset and investigated multivariate time-series data associated with 5G network usage. Of course, this method may be applied to any multivariate data. The inventors assumed that changes in 5G network usage requirements for each cell Cj (where 1≤j≤8) over time may influence the forecast for the targeted cell (or target location) C0. To account for the influence of each adjacent neighbouring cell Cj (or adjacent locations), where (1≤j≤8), on C0, the inventors utilized a stack of neural networks. In this example the neural networks used by the inventors were a stack of BiLSTM networks 610, 615a-615g (that is, recurrent neural networks). However, any suitable neural network may be used. The inventors found that the BiLSTM encoders effectively capture temporal dynamics and encode neighbourhood data for accurate forecasting in multivariate-multistep time series scenarios.
As discussed in relation to
The dataset used by the inventors, and as discussed in relation to
The dataset table in
The dataset may be processed as discussed in relation to
Let xj=(xj1, xj2, . . . , xji, . . . , xjn) be the input sequence representing the features from cell Cj, where xji denotes the feature list (i.e., the data) at the ith time step. Further let the total number of time steps be denoted as n. For the adjacent neighbour cells (or locations), the inventors considered the 5 features originally available in the dataset, k=5 features, available at each time step. For the forecast location, (that is the central cell), the inventors considered the 5 original features available in the dataset and also considered the 6 additional features they generated. Hence, for the central cell C0 the value of k=11.
For each of the adjacent locations C1-C8, the inventors trained a neural network (which may be called adjacent location neural networks and in this example were Bi-LSTM model,) based on the determined time series sequence lengths to encode the multivariate data from each adjacent location into an adjacent location vector.
When the sequence xj is passed through a BiLSTM layer, the output may be represented as hj=(hji, . . . , hji, . . . , hjn), where hji is the output at time step i and may be computed as:
where, →hji is the forward hidden state at time step i, and ←hji represents the backward hidden state at time step i.
The inventors utilized the output of the last layer of BiLSTM, that is an adjacent location vector for each of the adjacent cells, as the impact of cell's 5G Network usage. Furthermore, the last hidden layer of the Bidirectional Long ShortTerm Memory (BiLSTM) denoted as hjn, which serves as the output of the BiLSTM stack, was used. The inventors combined each of the adjacent location vectors into a combined adjacent location vector.
For example the outputs of the Bi-LSTM models which may be denoted as yn,Cj for all 1≤j≤8, may be combined by a concatenation step 620 to form a single vector in the subsequent step. The concatenation operation inherently captures the spatial arrangement of neighbouring cells through its fixed order of arrangement. The concatenated output vector may be represented as x′, and given as
Furthermore, the combining step may comprise inputting the concatenated location vector into a sequence of two multilayer perceptron layers, the output of a first multilayer perceptron layer being input into a second multilayer perceptron layer. For example, the inventors passed the vector x′ through a sequence of two multilayer perceptron (MLP) layers. The inventors found that this may capture the influence of neighbouring cells on the central cells in a subsequent stage.
The output Y′ after the first perceptron layer may be computed as:
where W′=(w1′, w2′, . . . , wn′) is a weight vector and b′ is a scalar bias term and the f represents the activation function. In this example the inventors used the SeLu activation function, available at D. Pedamonti, “Comparison of non-linear activation functions for deep neural networks on mnist classification task,” arXiv preprint arXiv:1804.02763, 2018. The output Y′ from this layer may be further given to a second perceptron layer, which may do a similar operation as in equation 9 resulting in an output Y “.
Similarly, for the forecast location (i.e., the central cell C0), the inventors trained a neural network (which may be referred to a forecast location neural network and in this example is a Bi-LSTM model 610) based on the determined time series sequence lengths to encode the multivariate data from the forecast location into a forecast location vector. The temporal influence of the 5G usage features in the dataset and other factors may be utilized. An objective was to capture the impact of temporal changes in 5G usage on the central cell and predict its future 5G data usage. To achieve this, the inventors utilized the BiLSTM layer 610 by inputting the central cell's feature. By using the same type of BiLSTM layer as described before, the output of the last block (yi,Cn) specifically for the central cell (ym,C0) was determined.
In a composing step the forecast location vector and combined adjacent location vector may be composed into a final combined vector. For example, to compose the forecast location vector and combined adjacent location vectors into a final combined vectors the inventors used a concatenation operation 640. The output Y″ and the output of the last block of the BiLSTM model yi,Cn were concatenated into a single vector. Let z represent the concatenation of both outputs as given in equation 7.
The final combined vector may be replicated using a repeat vector unit 650. For example, the repeat vector unit is shown as a second notational module in
Module 2: This module of the architecture incorporates the RepeatVector unit to expand the output of the forecasting model for multi-step forecast. By utilizing a RepeatVector operation, the output obtained from equation 10 in module 2 may be replicated r times. Hence, given an input with, for example, z∈Rn×1, the RepeatVector operation may produce Zrepeat as shown below:
Here, each element z(i) within the matrix Zrepeat corresponds to the original vector z. The inventors used the model to forecast the output for two consecutive time-steps. That is, for example, given an input at time t, the model may predict a forecast for times t+1 and t+2. Hence, as the model was set to predict two consecutive time-steps, the value of r was set to 2. Of course, the model may be used to predict less than two consecutive time steps (i.e., one time step) or more than 2 consecutive time steps.
Composing the forecast location vector and combined adjacent location vectors into the final combined vector may additionally comprise inputting the single vector (for example the concatenated vector z in equation 10) into a self-attention mechanism 660a, 660b and determining as an output the final combined vector. The self-attention mechanism 660a 660b is shown as part of a notional module 3 in
Module 3: In this particular module of the architecture, a self-attention layer may be employed to facilitate and enhance the multi-variate time-series forecast in the final output. The self-attention layer, originally proposed in A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017, may be used to evaluate the importance of the features computed in the preceding stages by utilizing learned attention weights.
In this example with two vectors generated from the repeat vector operation, each vector zi within the Zrepeat matrix underwent a self-attention operation, resulting in the calculation of context vectors Vl(i) for each of the n elements in z(i) as follows:
Here zm(i) denotes the mth element of z(i). Further, the attention weight αl,m(i) is assigned to the lth element of the input sequence when computing the mth element of the context vector is determined through a softmax function applied to a set of learned attention scores. These attention scores, for example, quantify the similarity between each pair of elements in the input sequence. The calculation of attention weights may be defined as follows:
Here in equation 10, score(zl(i),zm(i)) represents a learned function that computes a similarity score between the lth and mth elements of the input sequence. The inventors used a score function which employed a scaled dot product, which may be expressed as follows:
In this expression, W′query and W′key denote learned weight matrices that project the input sequence into the query and key spaces, respectively.
The final combined vector may be decoded in a decoding step to generate a forecast from the dataset for the forecast location. The decoding step (or decoding module) may form part of the notional module 3 shown in
In this example, the outputs Vl(i) obtained from the self-attention layer are passed through a stack of BiLSTM layers 670a, 670b. That is, the neural network used by the inventors was a BiLSTM neural network. Of course, any suitable neural network may be used. However, in contrast to the previous BiLSTM layers (i.e. the Bi-LSTM layer used in the notional module 2), all the hidden layers of this BiLSTM layers may be collectively provided as input to a subsequent TimeDistributed dense layer (TDL) 680a, 680b. The inventors found that an inclusion of the TDL may further enhance the prediction of multivariate features by independently processing the hidden layer outputs for each time step.
In this example, within the TimeDistributed dense layer (TDL), the hidden state outputs (h)1(i), h2(i), . . . , hl(i), . . . , hn(i)) are processed individually. Each element hl(i) at time step l was passed through L dense layers with shared learnable parameters θ, resulting in an output yol(i) computed as follows:
Here, f represents an optional activation function, where, for example the inventors used SeLU. Furthermore, in this example the value of L was set to 5. The value 5 was chosen as a design parameter. The inventors repeated this operation defined in equation 1 above repeated for all l in the range of 1 to K, where K represents the total number of variables in the multi-variate output. In an example where a vector repeat operation has been used, as discussed above, the entire operation described in this module may be repeated for all r vectors in Zrepeat.
Hence, in this example the proposed system utilizes a sequence of length n as the input. Each element in this sequence may comprise 11 features from a central target cell C0 and 5 features from each of the eight neighbouring cells (C1 to C8) till a time ‘t’. From this given input, the proposed system may forecast the network usage of C0 for the subsequent two time steps (t+1 and t+2) for five usage variables: smsin, smsout, callin, callout, and internet.
The inventors trained the neural network using backpropagation techniques. Mean Absolute Error (MAE) is a widely used metric to assess the accuracy of time series forecasting models (see T. Chai and R. R. Draxler, “Root mean square error (rmse) or mean absolute error (mae)?-arguments against avoiding rmse in the literature,” Geoscientific model development, vol. 7, no. 3, pp. 1247-1250, 2014. It measures the average absolute difference between each predicted value and its corresponding actual value. The formula for MAE in time series forecasting is:
Here, n is the total number of observations in the dataset, Yi represents the actual value of the i-th observation, Ŷi represents the predicted value of the i-th observation, |⋅| denotes the absolute value function, and Σ indicates the sum of the absolute differences between the predicted and actual values across all observations.
To train the neural network, the computed MAE loss was back-propagated through the system and the weights of each neural network in the system were optimized through an ADAM based optimizer based on stochastic gradient descent. For example, the training progress was monitored, and the process was halted if the MAE improvement was less than or equal to 0.01 for five consecutive epochs. The last saved model was chosen as the final model. Of course, other training regimes may be used to train the neural network.
In this example, the architecture of a single neural network model and training method for that method has been described. However, the inventors found that multiple neural network models may be trained, and an ensemble taken to improve the accuracy of the forecast. Example methods for training an ensemble of neural network models are given in connection with
To train the neural network model the inventors used multivariate data, in this case taken from the Milan dataset. To use the multivariate data in training, an assumption that complex relations between multiple time-series was made. To emphasize the relationships among multiple time-series, the problem of multivariate time-series forecasting based on a data structure called multivariate temporal graph was formulated (in this case the inventors selected it as a case of non-Euclidean learning. The graph may be denoted as
where, X={xit}ϵN×T stands for the multivariate time-series input, where N is the number of time-series (nodes), and T is the number of timestamps.
The observed values at timestamp t may be denoted as Xt ϵT. Wϵ
N×N is the adjacent matrix, where wij>0 indicates that there is an edge connecting nodes i and j, and wij indicates the strength of this edge.
Given observed values of previous K timestamps, Xt−k, . . . , Xt−1 the task of multivariate timeseries forecasting aims to predict the nodes values in a multivariate temporal graph G=(X,W) for the next H timestamps, denoted by {circumflex over (X)}t, {circumflex over (X)}t+k, . . . , {circumflex over (X)}t+H−1. These values can be inferred by the forecasting model M with parameter φ and a graph structure G, where G can be input as prior or automatically inferred from data.
Accurately predicting 5G network usage may be useful for efficient network resource allocation and high-quality service delivery. However, dynamic usage patterns, nonstationary usage distribution, and the presence of a high amount of missing data make forecasting future network usage statistics challenging. To address these challenges, the inventors propose a neural network based time series forecasting approach which may handle multistep, multi-variate, and spatiotemporal time series forecasting problems. The proposed approach may scan the data for continuity of time-steps and identify an optimal sequence length in order to address the issue of missing records in supervisory data. It may use separate BiLSTM layers to capture the impact of temporal 5G data usage changes in adjacent networks. Further, it may utilize adjacent cell's impact captured in permutationally invariant ordering for the central/targeted cell. In examples disclosed herein, the learning techniques used to capture the impact of changes in neighbouring cells on the target cell and for multi-step and multivariate forecasting are: Self-attention, a RepeatVector, a BiLSTM network, and a Time distributed dense layer. The proposed approach outperforms existing state-of-the-art methods on the Milan dataset and may improve forecast accuracy in various applications, including network traffic forecasting.
In a first input block 705a, 705b, input data may be loaded into the system. As above, the data used by the inventors was taken from the Milan dataset. For the forecast location, the 5 features in the dataset were input along with 6 further selected features. For each of the neighbouring cell, the 5 features in the Milan dataset were input.
The input data may be input into an Encoder unit 710, 715. For example, the forecast location data may be input into an Encoder unit 1710. The Encoder unit 1 may consist of a neural network such as the BiLSTM neural network discussed in relation to
Similarly, for each of the adjacent locations (e.g., the neighbouring cells), an Encoder unit 2 may encoder the input features (e.g., the call data records, CDRs), each into adjacent location vectors. As before, the Encoder unit 2 may consist of BiLSTM layers for each of the adjacent locations.
Outputs from the Encoder unit 1 and Encoder unit 2 may be input into a concatenation layer 720. For example, the output of the Encoder unit 1 may be a forecast location vector. The output of the Encoder unit 2 may be adjacent location vectors for each of the adjacent cells. The adjacent cell vectors may be combined, using, for example concatenation, into a concatenated location vector. The concatenated location vector may be input into a sequence of two multilayer perceptron layers, the output of a first multilayer perceptron layer being input into a second multilayer perceptron layer. The second multilayer perceptron layer may out a combined adjacent location vector.
The concatenation layer 720 may further compose the forecast location vector and combined adjacent location vector into a final combined vector. That is, the concatenation layer may concatenate the forecast location vector and combined adjacent location vector into a final latent vector.
In an example where consecutive multi-step time-series are forecast by the model, the final latent vector may be input into a repeat operation unit (not shown), for example, the repeat operation discussed in relation to
The final latent vector (or final latent vectors if a repeat operation is used) may be input into a self-attention unit 730. For example, the self-attention mechanism may be the self-attention mechanism discussed in relation to
The output from the self-attention unit may be input into a Decoder unit 740. The decoder unit may include a neural network (which may be referred to as an output neural network) and a Time distributed wrapper, as described above. The neural network may be a BiLSTM neural network as in
In an example where a repeat vector is used, the decoding unit may output multiple time-step forecasts (or predictions). For example, a first output 750a may be a forecast containing call data records for the forecast location at a time t+1. A second output 750b may be a forecast containing call data records for the forecast location at a time t+2.
The inventors evaluated the effectiveness of the neural network against other known methods for forecasting. The inventors utilized two baseline models for comparison purposes. To ensure accurate reproduction of the baseline results, the methodologies outlined in C. Zhang, H. Zhang, D. Yuan, and M. Zhang, “Citywide cellular traffic Prediction based on densely connected convolutional neural networks,” IEEE Communications Letters, vol. 22, no. 8, pp. 1656-1659, 2018, and M. Mohseni, S. Nikan, and A. Shami, “Ai-based traffic forecasting in 5g network,” in 2022 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). IEEE, 2022, pp. 188-192 were thoroughly reviewed and the inventors followed the same experimental procedures. The subsequent sections provide further details on the inventors' approach.
Spatiotemporal baseline: An LSTM, RNN and Perceptron based model is considered (also represented by AR-LSTM based model by Mohseni et al. The work reported in Mohseni et al, considered the model disclosed therein as a baseline for spatiotemporal forecasting, to reproduce the finding by Mohseni et al, for evaluation purposes, the inventors also used the same architecture and settings, during the entire experiment.
2D-ConvLSTM: The 2D-ConvLSTM framework comprises four layers of 2D Convolutional LSTM and one 3D convolutional layer. It takes input data with a shape of (24, 100, 100, 4) to perform multivariate analysis. To reproduces the model 2D-ConvLSTM model, the inventors followed the settings outlined by Mohseni et al for all their experiments. The inventors used the reproduced model with the Milan dataset to produce a ‘like-for-like’ comparison. Each grid in the Milan dataset consists of 1487 records after resampling over a one-hour period. In line with Mohseni et al, the authors allocated 70% (1040 records) of the dataset to the training set, 20% (298 records) to the validation set, and the remaining 10% (149 records) to the test set for each grid. Two separate experiments were conducted to demonstrate the effectiveness of the system disclosed herein.
Experiment-1: In contrast to Mohseni et al, the inventors' experimental setup involved training their model on 80% of the available data and testing it on the remaining 20%. To prevent overfitting and optimize model performance, the inventors implemented early stopping based on Mean Absolute Error (MAE). The training progress was monitored, and the process was halted if the MAE improvement was less than or equal to 0.01 for five consecutive epochs. The last saved model was chosen as the final model. In this example, in the testing phase, the inventors employed a rolling-based forecasting approach to evaluate the system disclosed herein against the baseline models. This involved using a sliding window technique to forecast data for each day. Starting with a 5-time step history, the inventors predicted the next two time steps. The inventors then appended the forecasted data to the previous 3-time step history and repeated the process until the entire 24-hour period was forecasted. The resulting forecasted data was stored separately. Finally, as Mohseni et al did not explicitly discloses the method in which missing data was imputed, the inventors used the last 5 hours of the forecasted data for day-1 to fill in any gaps in the given 24-hour data for day-1. Using this modified data, the inventors proceeded to forecast data for day-2 using the same sliding window approach. This process continued for subsequent days from 21st December to 1st January, which constituted the test data, comprising 20% of the total timeframe.
Experiment-2: The Inventors further tested the model disclosed herein in a second experiment. The inventors' model may employ a smaller input and output time step sizes of 5 and 2 respectively in the multivariate setting. This allows it to predict 2 output time steps by using 5 input time steps for any given cell and time. That is, the longest time series sequence length may be split into training and test data. For example, as described in connection with
The inventors' analysis of the results shows that the system developed by the inventors outperforms the existing state-of-the-art based on the data presented in Table I. The inventors' system may be multi-variate and the reported value of 113.17* represents the time taken to load the model and forecast all five features (Internet, SMS-IN, SMS-Out, Call-IN, and Call-Out). In line with Zhang et al, Mohseni et al, the inventors combined the forecasts for SMS-IN and SMS-Out to SMS, and Call-IN and Call-Out to Call. Further, Table II provides evidence of the effectiveness of the system disclosed herein in forecasting results for any cell and time interval, with consistent results regardless of the specific time and cell. However, the overall mean absolute error (MAE) score for the entire dataset is slightly lower. This could be due to the fact that the last 20% of the dataset includes significant events and holidays, such as year-end celebrations and the New Year, which were not present during the training phase. These results demonstrate that the system disclosed herein performs well in a variety of settings and can effectively forecast multiple features simultaneously. Furthermore, the model has the potential to be used in real-world applications where forecasting accuracy is crucial.
In conclusion, the model disclosed herein introduces a novel approach for addressing missing data in multi-step, multi-variate, and spatiotemporal time series forecasting without relying on data imputation techniques. Instead, the inventors propose identifying LCCFS based time step sequences that frequently occur in the dataset. Additionally, in an example the inventors' leverage spatial and local features by analyzing the impact of changes in neighbouring cells on the target cell using separate Bidirectional Long Short-Term Memory (BiLSTM) networks. By concatenating the BiLSTM outputs and applying self-attention, the influence of neighbouring cells on the target cell was assessed. In an example, the approach disclosed herein also incorporates a Repeat vector, BiLSTM network, and Time distributed dense layer for achieving multi-step and multivariate forecasting.
Each of the models, model 1810a, model 2, 810b, model 3810c, may consists of the same, or substantially the same architecture as the neural network model described in relation to
The three models may be trained in substantially the same way as the one neural network model described above. However, for each of the three models, a different time-series sequence lengths may be used to train the model. For example, as described above the longest common continuous frequent sequence algorithm may be used to determine a longest time-series sequence length of the dataset. The longest time-series sequence length may be used to train model 1. A shorter time-series sequence length may be determined. The shorter time-series length may be the next longest time series sequence length determined using the LCCFS algorithm. The shorter time-series sequence length may then be used to train model 2. A third longest time-series sequence length may be determined which then may be used to train model 3.
As discussed above, the inventors determined that, in the Milan dataset, the longest time series sequence length of the dataset was 7 hours. The second highest time series sequence length was 6 hours, and the third highest was 5 hours. Of course, with a different dataset, the LCCFS algorithm may determine a different longest time-series sequence length. As before, for the 7 hours sequence, the inventors used the initial five time series steps for training and the subsequent two continuous sequences for forecasting. For the 6 hours, the first 4 sequences were used for training and the last 2 for forecasting and for the 5 hours the first 3 sequences were used for training and the last 2 for forecasting. In this example, the inventors used the dataset to forecast the next 2 time series. Of course, if a different number of time-series is to be forecast the time series sequence may be partitioned differently. That is, for example, if the model were to predict the next 3 time series the longest time series in the example, 7 hours, may be partitioned to use the first 4 hours for training and the next 3 hours for forecasting.
Thus, in this example, a proposed method for training the model may be: inputting a dataset into the model for training. The dataset may be input from a forecast location and one or more adjacent locations. In this example the dataset comprises spatio-temporal characteristics of each location and multivariate data recorded at each location. The spatiotemporal characteristics of the network coverage area may be input data such as cell location and time of day. Furthermore, the input into the forecasting model may include historical call data records (CDRs) for various network services (e.g., internet, calls, text messages) recorded at different timestamps.
In this example, the one or more adjacent locations comprised 8 adjacent cells, or nearest neighbours, to a forecast location (or central cell). Hence, the model may take as an input the features of the central cell, C0, and the adjacent cells (C1-C8) features. The inventors found by using the one or more adjacent locations along with the forecast location, the forecast for the central location may be improved as the model takes into account the spatial influence of adjacent cells on the central cell. For example, this may be achieved by combing the features of the central cell and adjacent cells and processing them further through an attention layer. As a result, the forecast made for the central cell is influence by the information from its adjacent cell.
Considering for each of models 1810a, model 2 and model 3, a functional block which is the same, or substantially the same the functional block diagram given in
For each of the models, a decoder side of the model may have a sequence of bi-directional LSTM networks that take in the latent representations to predict CDRs for the central cell (C0) at time t+1 and t+2.
Three encoder-decoder models may be trained to address missing data problems. For example, as discussed above each model may be trained using a time series sequence length determined using a LCCFS algorithm. In this example, each model accepts a sequence lengths of time series 3, 4, 5, calculated via the LCCFS algorithm. The inventors found that the model trained using the longer time-series had more discriminative power to understand temporal variations, while the model with time series length the shortest time series length (in this case a time-series length of 3 hours) had been trained on more samples to associate variable sequence patterns.
In the training phase the inventors combined the three models for generating a forecast from the input dataset. The inventors took an ensemble 815 of the three models. For example, the ensemble may be a weighted average ensemble determined using a grid search method. As an example, the grid search method used by the inventors gave a weighting to model 1 of 0.58, a weighting to model 2 or 0.26 and a weighting to model 3 of 0.13. Hence, the final time-series predictions may be an ensemble of the three models, weighted accordingly. In the ensemble of models, the process of performing model decision averaging may involve combing the predictions of multiple models by calculating the average of their individual predictions.
In this example a system is proposed which leverages 5G spatiotemporal multivariate data with missing information by employing a novel approach. The system uses three shorter time series sequences (with, in this example, step lengths of 5, 4, and 3, (i.e., consecutive time-series) where these steps lengths are estimated via a LCCFS algorithm) as input and a forecast step length of 2 as the output. The proposed system consists of three shallow graph-type neural network architecture to learn patterns in the data, each focusing on specific time series sequence lengths. While the longer sequences capture variation of patterns, they are less in number due to missing records. On the other hand, shorter sequences are more in number and help to identify deviations or explore the depth of each feature. By incorporating an ensemble-based mechanism, the system combines insights from multiple architectures to make accurate predictions and provide forecasts for multiple variables and future time steps.
In this example the inventors trained three models using the three different length time series determined using an LCCFS algorithm. The inventors used the minimum odd number of models for ensembling as a design choice—trade off between effective ensembling and minimal computation overhead. However, of course, a different number of models may be trained using the time-series data. For example, one model, or two models or more may be used.
The inventors found the forecasting method encompassed a multipurpose method-capable of complex spatiotemporal, multivariate and multistep time series forecasting for 5G traffic, resource requirements in both cases—(a) in the case of a huge amount of missing data (e.g., 35% missing data) and (b) in a normal case (without any unsupervised data imputation strategy. The inventors found that an effect of the proposed training method and system was that the need for any supervised or unsupervised data imputation strategies were eliminated. Furthermore, the training method disclosed herein showed a better accuracy with respect to traditional systems, and worked with the complex cases of missing data and normal data (having no missing records).
The ensemble of models may be trained using, for example, a mean absolute error as the loss function. As described above, the complete system may comprise three (identical models), trained on Bi-LSTM architecture, which take sequences of varying lengths as input. The final output may be obtained by averaging these models using a weighted ensemble.
In this example each of the models were separately trained using an input sequence length computed using the LCCFS algorithm. In the training phase, each model is independently trained and optimized using mean absolute error (MAE) as the loss function. The MAE loss function mas be defined by:
as before. Hence, during the training process of each model, the mean absolute error (MAE) may be determined by calculating the average absolute difference between the predicted model outputs (Ŷi) and the actual output values Yi for each of the “i” training examples across a total of n training samples. The MAE loss that is computed may be back-propagated and the weights of the model optimized through an ADAM based optimizer based on stochastic gradient descent.
After the three models were optimized, the final weighted ensemble averaging may be computed for the final prediction of the system, where the weights of the ensemble may be computed through a grid search. In this example, a back propagation method using an ADAM optimizer was used to train each model, however, any backpropagation may be used. Furthermore, each model was trained using backpropagation and then combined in an ensemble but of course these steps may be switched such that an ensemble is determined then backpropagated.
Hence, the proposed system and method may automatically handle the data insufficient issues caused by irregular or missing records in spatiotemporal, muti-step, multivariate, 5G forecasting for network resource partitioning, and traffic forecasting. The system may address the missing data issue by using an ensemble of three shallow neural network architectures. These networks take time series sequences of longer and shorter lengths capturing the variations in time series data. The method overcomes the need of any data imputation techniques which often introduce a significant amount of noise to the data.
In a step S920, the dataset may be pre-processed by selecting the longest time-series sequence lengths to use with the models. In this example, three models were trained on different sequence lengths, hence three sequence lengths were determined from the data. A longest Common Continuous Frequent Sequence (LCCFS) algorithm was used to identify the optimum time series step-size.
Instead of focusing on very-high step sizes (and imputing missing data, as in known teachings), the inventors focused on smaller time-step sizes. By doing so, patterns for two different aspects of time-series data may be collected. For this, the inventors applied strategies to select the three optimal time-step sizes.
Considering, for example, a dataset with 30%+ missing data, a step may involve trying to find any number (representing the time-step size) which is near to (>30% of total time interval scale) and which captures the maximum number (>50%) of the data records. A detailed description of the LCCFS method is given in connection with
A further step may involve identifying two continuous lower time-step size, which captures more than, a second threshold of complete data records, for example the inventors used a 60% threshold of the records. In the Milan dataset, the division of time steps as 24 hours for one day was considered and a 1 hour scale was used. In the case of the Milan dataset a total of 30%+data was missing.
Based on the above rule, the inventors identified 7-hours as a first time series step size, which captures more than 50% of the total record counts. Then, 6-hours and 5 hours—were selected as the next two time series step-sizes (the next longest time series step sizes which satisfied the above rules).
The inventors found the following benefits with the time-series step sizes selected using the above method. The higher time-step length records capture the breadth of the time-series patterns while the lower time-step lengths records (which are more numerous) capture the depths of the time-series patterns.
In the step S920 further preprocessing of the data may occur. For example, based on the three identified time steps, the data may be organized into a format suitable for input into the forecasting model. In this step the dataset may be prepared so that (each) central cell C0 with the additional 6 features and the 8 neighbouring cells C1-C8 with the 5 features each are prepared for all the sequence length. For example, each are prepared into a data feeding pipeline.
In a step S930, the pre-processed data may be input into each of a model 1, a model 2 and a model 3. Each of the models may be the same or substantially the same as the models described in relation to
For each of the three models, the training steps described in relation to
In a step S940, an ensemble of the three models may be taken. For example, a weighted average ensemble of the system may be taken. The ensemble may be taken once the models have been trained. The models may be trained using, for example, backpropagation. The weights may be determined using a grid search system. Of course, any method to determine weights for the ensemble may be used. In the example using the Milan dataset, the inventors found that a weights ensemble with the weightings for model 1 as 0.58, model 2 as 0.29 and model 3 as 0.13, produced optimum forecasts. If another dataset was used, the weighting may of course be different. Furthermore, the inventors trained three models and took an ensemble of the three models. However, the skilled person would understand that more or fewer than 3 models may be trained.
A step S950 shows a step with a trained neural network model (or system) comprising the three trained models, with ensemble weights. Each of the three models, trained with different time-series step sizes, may be optimised using backpropagation techniques such as an ADAM based optimizer based on stochastic gradient descent. As described in connection with
In an input step s1010, a method for forecasting multivariate data in a forecast location in a network may comprise inputting a dataset from the forecast location and one or more adjacent locations, the dataset comprising spatio-temporal characteristics of each location and multivariate data recorded at each location.
The input step S1010 may further comprise providing a longest time-series sequence length of the dataset, the time-series sequence length indicating a total length of consecutive time steps with complete data in the dataset, an occurrence frequency of the longest time-series sequence lengths appearing in the dataset being higher than a threshold number of the dataset. As before, the inventors used the Milan dataset and partitioned the timeseries data into 5 hours for training and 2 hours for testing for the 7 hour time series, 4 hours and 2 hours for the 6 hour time series and 3 hours and 2 hours for the 5 hour time series.
An implementation step S1020 may comprise using a pretrained forecast location neural network based on the time series sequence length to encode the multivariate data from the forecast location into a forecast location vector, and using, for each of the one or more adjacent locations, a pretrained adjacent location neural network based on the time series sequence length to encode the multivariate data from each of the one or more adjacent locations each into an adjacent location vector. Each of the pretrained models, model 1, model 2 and model 3 may use a different time series sequency length. Although in this example three models are described, the skilled person would understand that more or fewer than three models may be used. That is, a single pretrained model consisting of a pretrained forecast location neural network and one or more pretrained adjacent location neural networks may be used, or two models, of four or five models, for example.
The implementation step S1020 may further comprise, for each of the models 1, 2 and 3, combining the one or more adjacent location vectors into a combined adjacent location vector and composing the forecast location vector and combined adjacent location vector into a final combined vector. The skilled person would understand that in an example with one adjacent location vectors, the combing step may involve transforming the one adjacent location into a final combined vector. For example, the combining step may comprise the steps of concatenating the one or more adjacent location vectors into a concatenated location vector; and inputting the concatenated location vector into a sequence of two multilayer perceptron layers, the output of a first multilayer perceptron layer being input into a second multilayer perceptron layer. Hence, with one adjacent location, the combing step may comprise inputting the adjacent location vector into the multilayer perceptron layers.
In a decoding step S1030 the final combined vector may be decoded to generate a forecast from the dataset for the forecast location. In an example for generating a forecast from multivariate data in a network, a forecast may be generated for each cell, or node, or location in that network. That is, each cell in that network may be assigned as the central cell and a forecast generated for the cell using that cell's neighbouring (or adjacent cells). Hence, the method for forecasting multivariate data in a forecast location in a network may comprise forecasting multivariate values for a given time-steps for all grids (spatial units) in that network.
As described above, in relation to a single model, an MAE may also be calculated in the testing phase to determine the accuracy of the model. The inventors also compared the accuracy of the ensemble of neural networks for forecasting the multivariate data.
Table III shows a comparison of the MAE of the model proposed by the inventors against known models in the art. The known models are disclosed in Ref-1: Mohseni, M., Nikan, S., & Shami, A. (published on 2022, September). A1-based Traffic Forecasting in 5G network. In 2022 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 188-192). IEEE.
As shown in Table III, the training method and model proposed by the inventors had a much improved MAE (of 0.08984) compared with the methods known in the art. The inventors further compared their proposed model with spatiotemporal baselines as shown in Table IV below. Across all of the Call data records (internet, SMS and Calls) the inventors found the method disclosed herein generated a lower MAE than the methods known ion the art, and therefore generated more accurate forecasts.
In both the proposed method and the methods known in the art, the Milan dataset was used. A total of 62 days of data was available. The data is divided into 1440 minutes of data per day for 10K grids (i.e., 10 thousand cells in a network). So, the total size of the data is (62×1440×10000). The dataset contains a lot of missing data, and outliers. The current data (used by the inventors) contained 35% missing data. The Ref-1 cited above used 80% of the starting data for training and the next 20% of the data for testing. Hence, the inventors used the same partitioning of data in this example.
A comparison between predicted internet usage at 3 am 1105 and predicted internet usage at 8 pm 1110 is shown in
The forecast generated using the methods described herein may be used in multiple application areas. Application areas concerning the potential use of the proposed robust forecasting model and source of missing data in each of such applications are described in detail below.
Predictive maintenance: In 5G networks, equipment and infrastructure require continuous monitoring to identify potential faults or failures. Multivariate time series forecast can be built to identify areas where the network service can potentially breakdown resulting in complete network outage or failure. The forecast data in this case typically consist of information related to sensor readings, network performance metrics, and other quality of service parameters. Here missing data records can result from faulty sensors, intermittent connectivity, or device malfunctions.
Network performance analysis: As modern networks generate vast amounts of multivariate time series data related to network performance metrics such as signal strength, latency, throughput, and quality of service. Forecasting models may be employed to identify network performance bottlenecks or may be utilized to effectively optimize the network infrastructure. The type of data that are involved in forecast models includes various network performance metrics such as signal strength, latency, throughput, and quality of service. Missing data can occur due to network congestion, signal interference, or technical issues.
Resource management and optimization: Efficient resource management is essential in 5G network to ensure optimal utilization of network resources, such as bandwidth allocation including network slicing, frequency allocation, and power control. The forecasted data consist of various network service usage features such as internet usage, voice call/text messaging traffic or data such as net power usage in each network grid. Multi-variate time series data with spatial information is collected to monitor resource usage and network capacity. An efficient forecasting system may be utilized to plan such resource management. The reason more missing records are manifold in this case, including network monitoring and measurement issues, sensor instrumentation failures, data transmission and communication errors, sampling and reporting processes, data preprocessing and filtering techniques, as well as due to privacy and security reasons (intentional masking of data). Thus, a forecast generated using the method disclosed herein may automatically bring more cells or nodes into action in a network during busy or high demand periods or turn cells or nodes off during quiet periods.
User experience management: Ensuring a high-quality user experience is a critical objective in modern networks like 5G. Multi-variate time series forecasting can be used to avoid deterioration in user experience metrics such as data rate, signal coverage, and call quality. Due to user mobility, network handovers, or temporary signal loss can result in data to be missed in record.
Network security and anomaly detection: communication networks face security threats and vulnerabilities that require continuous monitoring. Multivariate time series data with spatial information can be analysed to detect network anomalies, abnormal traffic patterns, or potential security breaches. Missing data can occur in such applications due to network attacks, packet loss, or data filtering mechanisms.
As shown in
A user may enter their selection of the date, time and feature by clicking a select button 1220. The select button may instruct the model to generate a forecast for the entered features. In this example, calls incoming has been selected and a call incoming forecast 1225 has been generated. The GUI may also generate a key for the forecast. For example, the key may show the number of data records of calls incoming. In this example, the key ranges from 0 calls to over 600 calls for each forecast location. Each pixel in the forecast may represent a different forecast location and hence may be assigned a different colour shown in the key. The inventors used normalised values for each feature in the multivariate data to train the model(s). Hence, to obtain the absolute values shown in the key, the inventors assumed an average population density at each forecast location and multiplied the forecast value for each location by the average density. If absolute values were used to train the models, then there may not be a need to multiply by the average density.
Each of the pixels in the generated forecast may represent a forecast location. That is, the GUI for the forecasting system may allow a user to input a data and time for the forecast and the entire forecast for, in this example, the city of Milan may then be displayed. A user may hover over or click on a pixel in the forecast and a pop-up information box 1235 may display forecast information for that pixel (or location). In this example, 5G network usage in Milan is shown. Hence, each pixel represents the network coverage area of a single cell in the network. The pop-up box may present the user with the cell ID (shown as Cell 7408 in
The GUI may include additional buttons 1240 a user may interact with. For example, starting from left to right, the additional buttons may include: a camera button which downloads the forecast image, as for example, a PNG image; a magnifying button which zooms in on the forecast to a present zoom level, a panning button which allows the users to pan around the image, a plus button which allows the user to further zoom on the image, a minus button which allows the user to zoom out of the image, a scale button which auto scales the forecast image and a home button which restores the original forecast image. The GUI may also have a share button 1245 which allows a user to share the forecast with other users.
The computing device 1300 comprises a processor 1303 and memory 1304. Optionally, the computing device also includes a network interface 1307 for communication with other such computing devices, for example with other computing devices of invention embodiments. Optionally, the computing device also includes one or more input mechanisms such as keyboard and mouse 1306, and a display unit such as one or more monitors 1305. These elements may facilitate user interaction. The components are connectable to one another via a bus 1302.
The memory 1304 may include a computer readable medium, which term may refer to a single medium or multiple media (e.g., a centralized or distributed database and/or associated caches and servers) configured to carry computer-executable instructions. Computer-executable instructions may include, for example, instructions and data accessible by and causing a computer (e.g., one or more processors) to perform one or more functions or operations. For example, the computer-executable instructions may include those instructions for implementing a method disclosed herein, or any method steps disclosed herein, for example any of steps S10-S70. Thus, the term “computer-readable storage medium” may also include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the method steps of the present disclosure. The term “computer-readable storage medium” may accordingly be taken to include, but not be limited to, solid-state memories, optical media and magnetic media. By way of example, and not limitation, such computer-readable media may include non-transitory computer-readable storage media, including Random Access Memory (RAM), Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory devices (e.g., solid state memory devices).
The processor 1303 is configured to control the computing device and execute processing operations, for example executing computer program code stored in the memory 1304 to implement any of the method steps described herein. The memory 1304 stores data being read and written by the processor 1303 and may store at least one neural network model and/or at least one encoder and/or at least one decoder and/or other data, described above, and/or programs for executing any of the method steps described above. These entities may be in the form of code blocks which are called when required and executed in a processor.
As referred to herein, a processor may include one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. The processor may include a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. The processor may also include one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. In one or more embodiments, a processor is configured to execute instructions for performing the operations and operations discussed herein. The processor 1303 may be considered to comprise any of the units described above. Any operations described as being implemented by a unit may be implemented as a method by a computer and e.g. by the processor 1303.
For training the model described herein, the inventors used 1 Nvidia A30 GPU with 24 GB RAM. The processor used was an IntelI XeonI Silver 4314 CPU with 256 GB RAM. Inferences were performed on the same system. The inventors also successfully tested the model in a CPU based system with IntelI CoreI i5 processor with at least 32 GB RAM, preferably with 16 GB GPU for accelerated training/testing. The inventors found that a recommended minimum hardware requirements for the system disclosed herein for training/testing may require an intel i5 processor with at least 32 GB RAM, preferably with 16 GB GPU for accelerated training/testing.
The display unit 1305 may display a representation of data stored and/or generated by the computing device, such as a generated image and/or GUI windows (such as the GUI shown in
The network interface (network I/F) 1307 may be connected to a network, such as the Internet, and is connectable to other such computing devices via the network. The network I/F 1307 may control data input/output from/to other apparatus via the network. Other peripheral devices such as microphone, speakers, printer, power supply unit, fan, case, scanner, trackerball etc may be included in the computing device.
Methods embodying the present invention may be carried out on a computing device/apparatus 1300 such as that illustrated in
A method embodying the present invention may be carried out by a plurality of computing devices operating in cooperation with one another. One or more of the plurality of computing devices may be a data storage server storing at least a portion of the data. For example, the neural network model(s) or forecasting model may be stored on a separate server from other units.
The invention may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The invention may be implemented as a computer program or computer program product, i.e., a computer program tangibly embodied in a non-transitory information carrier, e.g., in a machine-readable storage device, or in a propagated signal, for execution by, or to control the operation of, one or more hardware modules.
A computer program may be in the form of a stand-alone program, a computer program portion or more than one computer program and may be written in any form of programming language, including compiled or interpreted languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a data processing environment. A computer program may be deployed to be executed on one module or on multiple modules at one site or distributed across multiple sites and interconnected by a communication network.
The system described herein was developed in the python programming language with Keras API (tensorflow backend). Additional libraries in python were also used such as pandas, NumPy, seaborn and plotly for data handling and visualizations. Of course, any suitable programming langue may be used.
Method steps of the invention may be performed by one or more programmable processors executing a computer program to perform functions of the invention by operating on input data and generating output. Apparatus of the invention may be implemented as programmed hardware or as special purpose logic circuitry, including e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random-access memory or both. The essential elements of a computer are a processor for executing instructions coupled to one or more memory devices for storing instructions and data.
The above-described embodiments of the present invention may advantageously be used independently of any other of the embodiments or in any feasible combination with one or more others of the embodiments.
For the avoidance of doubt, the invention relates to the following numbered clauses.
The invention is described in terms of particular embodiments. Other embodiments are within the scope of the following claims. For example, the steps of the invention may be performed in a different order and still achieve desirable results.
The skilled person will appreciate that except where mutually exclusive, a feature described in relation to any one of the above aspects may be applied mutatis mutandis to any other aspect. Furthermore, except where mutually exclusive, any feature described herein may be applied to any aspect and/or combined with any other feature described herein.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202311079656 | Nov 2023 | IN | national |