This non-provisional application claims priority under 35 U.S.C. § 119(a) on Patent Application No(s). 2021102824612 filed in China on Mar. 16, 2021, the entire contents of which are hereby incorporated by reference.
This disclosure relates to a hyper-parameter configuration method of a time series forecasting model based on machine learning.
Artificial Intelligence (AI) has become a crucial part in our daily life. AI enables human capabilities in understanding, reasoning, planning, communication, and perception. Although AI is a powerful technology, developing AI models is no trivial matter since there would be a reality gap between the development and deployment stages. Failure to bridge this reality gap would yield false insights that cascade errors and escalate unwanted risks. Therefore, it is critical to ensure the model's performance.
Measuring or evaluating an AI model's performance is often associated with high accuracy. Therefore, it is natural for AI modelers to optimize this objective. To do so, AI modelers perform hyper-parameter tuning to achieve the best accuracy. During the development stage, hyper-parameter tuning is performed on the training and validation sets. However, such tuned AI model with this set of hyper-parameters could fail on the test set during the deployment stage. That is, there is a performance gap between the performances, often measured in accuracy, of the development and the deployment stages.
One of numerous AI's applications is to produce forecasts for multiple time-series data by a forecasting model. Time series is the quantities of changes of a certain phenomenon arranged in the order of time. The development trend of this phenomenon may be deduced from the time series, so the development direction and quantity of the phenomenon may be predicted. For example, one may use a forecasting model to forecast the daily temperatures of multiple cities, or use another forecasting model to forecast the customer demands of multiple products.
In order to predict multiple time-series, one can resort to a separate forecasting model, which can be a neural network model, for each of the time-series. However, given a large amount of time-series data to be predicted, this approach may not be feasible given its complexity and memory requirement for such large amount of forecasting models.
If a single forecasting model is adopted, such single forecasting model will take all these multiple times-series data into consideration. To utilize all these time-series data to train the forecasting model, the model may overfit the training data.
When a conventional single time-series forecasting model is applied to multiple time-series, the performance gap between the development and the deployment stages generally comes from two sources. Firstly, the model fails to generalize to different time frames. Secondly, the model, that is trained on one set of time-series data, fails to generalize to a different set of time-series. In other words, the conventional forecasting model cannot handle either unknown time frames, or unknown products.
According to one or more embodiment of the present disclosure, a hyper-parameter configuration method of time-series forecasting model comprising: storing N datasets respectively corresponding to N products by a storage device, wherein each of the datasets is a time-series; determining a forecasting model; and preforming a hyper-parameter searching procedure by a processor, wherein the hyper-parameter searching procedure comprises: generating M sets of hyper-parameters for the forecasting model by the processor; applying each of the M sets of hyper-parameters to the forecasting model by the processor; training the forecasting model applied with each of the M sets of hyper-parameters according to a first strategy and a second strategy respectively by the processor, wherein the first strategy and the second strategy respectively comprise performing a selection of a part of the N datasets as a training dataset according to two different data dimensions; validating the forecasting model applied with each of the M sets of hyper-parameters according to the first strategy and the second strategy to generate two error arrays by the processor, wherein the first strategy and the second strategy respectively comprise performing another selection of another part of the N datasets as a validation dataset according to the two different data dimensions, and each of the two error arrays has M error values; performing a weighting computation or a sorting operation according to a first weight, a second weight and the two error arrays by the processor; determining a target set of hyper-parameters according to the two error arrays by the processor, wherein the target set of hyper-parameters is one of the M sets of hyper-parameters, and the two error values corresponding to the target set of hyper-parameters in the two error arrays are two relative minimum values in the two error arrays; outputting the target set of hyper-parameters by the processor when the target set of hyper-parameters is determined; and increasing a value of M and performing the hyper-parameter searching procedure by the processor when the target set of hyper-parameters cannot be determined.
The present disclosure will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only and thus are not limitative of the present disclosure and wherein:
In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, that one or more embodiments may be practiced without these specific details. In other instances, well-known structures and devices are schematically shown in order to simplify the drawings.
We use an example to illustrate a situation adapted to the present disclosure: considering the task of developing an accurate forecasting model that aims to predict the sales of next 12 months over ten products. To successfully do that, the forecasting model needs to capture the temporal sales pattern within each product and the sales dynamics across products. A good forecasting model comprises a good set of hyper-parameters.
In step S1, a storage device stores N datasets respectively corresponding to N products, wherein each of the datasets is a time-series of a product. For example, the time-series is the monthly sales of the product over the past three years.
In step S2, a forecasting model is determined. In an embodiment of the present disclosure, the forecasting model is a long short-term memory (LSTM) model. LSTM is a variant of recurrent neural network (RNN). LSTM scales with large data volume and can take multiple variables as input which helps the forecasting model to solve the logistics problem. LSTM can also model long-term and short-term dependencies due to its forget and update mechanism. An embodiment of the present disclosure adopts LSTM as the time-series forecasting model.
Steps S3-S6 describe a flow for the processor to find a set of hyper-parameters suitable for the forecasting model of step S2.
In step S3, the processor performs a hyper-parameter searching procedure. In step S4, the processor determines whether a target set of hyper-parameters is found in step S3. If the determination result of step S4 is positive, step S5 is then performed to output the target set of hyper-parameters. On the other hand, if the determination result of step S4 is negative, step S6 is performed to increase a range for searching the hyper-parameters, and then step S3 is performed again.
In step S31, the processor generates M sets of hyper-parameters corresponding to the forecasting model. M is a relatively large number such as 1000. In practice, the processor generates M sets of hyper-parameters randomly. Each set of hyper-parameters comprises a plurality of hyper-parameters. For example, hyper-parameters adopted by LSTM comprise a dropout rate of hidden layer's neurons, a kernel size, a number of layers of multilayer perceptron (MLP); and hyper-parameters adopted by light gradient boosting (Light GBM) comprise a number of leaves and a tree's depth.
In step S32, the processor applies each of the M sets of hyper-parameters to the forecasting model. Therefore, M forecasting models are generated in this step S32, and each of them has different configuration parameters.
In step S33 and step S34, the processor trains the forecasting model applied with each of the M sets of hyper-parameters according to a first strategy and a second strategy respectively. In step S35 and step S36, the processor validates the forecasting model applied with each of the M sets of hyper-parameters according to the first strategy and the second strategy to generate two error arrays. Specifically, the first strategy and the second strategy respectively comprise performing a selection of a part of the N datasets as a training dataset according to two different data dimensions. The first strategy and the second strategy respectively comprise performing another selection of another part of the N datasets as a validation dataset according to the two different data dimensions. The two different data dimensions comprise a data dimension of time-series and a data dimension of product.
The second strategy specially proposed by the present disclosure considers the data dimension of the product as the vertical axis shown in
When the forecasting model well trained in step S33 and step S34 performs the N-fold cross-validation, the forecasting model generates an error (loss) in every fold. The error is the difference between the predicted value outputted by the forecasting model and the actual value in the validation data set. In step S35 and step S36, the errors of all N folds are summed to obtain a total error (hereinafter referred to as “error value”). Therefore, the M error values may be obtained by performing validation with the first strategy on M forecasting models, wherein the M error values form an error array; and M error values may be obtained by performing validation with the second strategy on M forecasting models, wherein the M error values form another error array. In short, two error arrays are obtained in step S35 and step S36, each of the two error arrays has M error values.
Please refer to step S37 in
In step S41, the processor applies the first weight to each of the M error values of the error array corresponding to the first strategy. In step S42, the processor applies the second weight to each of the M error values of the error array corresponding to the second strategy. In step S43, the processor computes a plurality of sums of the two error values corresponding to each other in the two error arrays.
For better understanding, the present disclosure assumes the error array corresponding to the first strategy is [e11 e21 e31 . . . eM1], the error array corresponding to the second strategy is [e12 e22 e32 . . . eM2], wherein eiP represents the ith error value of the Pth strategy.
The present disclosure assumes the first weight is ω1 and the second weight is ω2. After performing the process of steps S41-S43, the present disclosure generates a new array [E1 E2 E3 . . . EM], which comprises M weighting error values, and Ei=ω1ei1+ω2ei2.
Through the adjustments of the first weight and the second weight, the present disclosure may adjust the focus on the temporal prediction accuracy or the prediction accuracy for unknown products of the forecasting model, respectively.
In step S44, the processor sorts the plurality of sums in ascending order. Specifically, the processor arranges values from small to large according to E1, E2, E3, . . . , EM. In step S45, the processor selects the set of hyper-parameters corresponding to a minimum value of the plurality of sums as the target set of hyper-parameters. Specifically, the target set of hyper-parameters Etarget satisfies Etarget≤Ei, i∈{1, 2, 3, . . . , M}. After the sorting operation of step S44, the target set of hyper-parameters Etarget is the first element of the error array.
In step S51, the processor sorts the M error values in the error arrays corresponding to the first strategy in ascending order. In step S52, the processor sorts the M error values in the error arrays corresponding to the second strategy in ascending order. In step S53, the processor traverses from a minimal index of the two error arrays, and checks the two error values corresponding to the same index in the two error arrays. In step S54, the processor determines whether both the two error values correspond to an identical one of the M sets of hyper-parameters.
If the determination result of step S54 is positive, step S55 is then performed to determine said one of the M sets of hyper-parameters as the target set of hyper-parameters. In other words, when both the two error values correspond to the same one of the M sets of hyper-parameters, said one of the M sets of hyper-parameters is served as the target set of hyper-parameters. At this time, the determination result of step S4 in
If the determination result of step S54 is negative, step S56 is then performed. In step S56, the processor increases the array's index.
For better understanding, the following uses practical values to illustrate the process of step S51-S56, assuming the two error arrays corresponding to the first strategy and the second strategy are shown as Table 1.
When the processor finishes step S51 and step S52, the result is shown as Table 2.
Please refer to the example as shown in Table 2. In step S53, the minimum index of the error array is “1”, so the processor first checks two error values “2” and “4” corresponding to the index “1”. The error value “2” corresponds to the 9th set of hyper-parameters, and the error value “4” corresponds to the 1st set of hyper-parameters.
In step S54, these two error values “2” and “4” do not correspond to the same set of hyper-parameters (9≠1), therefore, step S56 is performed next for increasing the array index from “1” to “2” and then the process returns to step S54. This loop process will be repeatedly performed until the index reaches to “7”. Both two error values “69” and “54” correspond to the 8th set of hyper-parameters, therefore, step S55 is then performed and the target set of hyper-parameters is set to the 8th set of hyper-parameters.
During the loop process of step S54 and step S56, it is possible that the processor has already traversed all the indices of the array without finding two error values corresponding to the same index correspond to the same set of hyper-parameters. At this time, the determination result of step S4 in
To produce a single time-series forecasting model that takes all time-series data into consideration without overfitting, the present disclosure proposes a hyper-parameter configuration method of time-series forecasting model based on machine learning. A good forecasting model comprises a good set of hyper-parameters. The hyper-parameter searching procedure proposed in the present disclosure has two good cross-validation strategies, thereby generating a good set of hyper-parameters. The present disclosure proposes a hyper-parameter configuration method of time-series forecasting model on top of existing cross-validation techniques with generalization as the core concern. For this purpose, the present disclosure applies appropriate cross-validation techniques on in-class and out-class data points simultaneously to ensure the AI model generalizes well on both in-class and out-class cases.
In view of the above description, the proposed hyper-parameter configuration method of a time-series forecasting model is applicable to any machine-learning based time-series forecasting model. The present disclosure captures the temporal sales pattern within each product and captures the dynamics across products.
Number | Date | Country | Kind |
---|---|---|---|
202110282461.2 | Mar 2021 | CN | national |