This document relates generally to computer-implemented statistical analysis techniques and more particularly to generating forecasts.
Time series data are time-stamped data collected over time. Some examples of time series data are
As can be seen, the frequency associated with the time series varies with the problem at hand. The frequency or time interval may be hourly, daily, weekly, monthly, quarterly, yearly, or many other variants of the basic time intervals.
Associated with a time series could be a seasonal cycle (seasonality) or a business cycle. For example, the length of seasonality for a monthly time series is usually assumed to be twelve because there are twelve months in a year. Likewise, the seasonality of a daily time series is usually assumed to be seven. The usual seasonality assumption may not always hold. For example, if a particular business' seasonal cycle is fourteen days long, the seasonality is fourteen, not seven. Seasonality considerations constitutes just some of the difficulties confronting analysis of a time series. The difficulties significantly grow if many time series have to be analyzed.
In accordance with the teachings provided herein, systems and methods for operation upon data processing devices are provided in order to overcome one or more of the aforementioned disadvantages or other disadvantages concerning the time series analysis. For example, a computer-implemented system and method can be configured to provide a forecast using time series data that is indicative of a data generation activity occurring over a period of time. Candidate models and candidate input variables are received. For each candidate model, transfer functions are determined for the candidate input variables in order to relate a variable to be forecasted to the time series data. For each candidate model, there is a selection of which candidate input variables to include in each of the candidate models based upon the determined transfer functions. A model is selected from the candidate models to forecast the time series data using the selected input variables of the selected model.
A time series model 38 is applied to the time series data 34 in order to generate a fitted model 40. A time series model 38 describes the data generating process 36. Assuming that a particular data generating process 36 produced a time series 34, a time series model 38 can be selected that approximates this data generating process 36. Applying the statistical features associated with this model 38 generates forecasts 32 for future time series values. A time series model 38 is not dependent on any specific time series data.
A fitted model 40 results from applying a time series model 38 to specific time series data (e.g., data 34). Given a time series 34 and a time series model 38, model parameter estimates can be optimized to fit the time series data. The fitted model 40 is used to forecast the time series 34.
A fitted model 40 can be used to generate time series components such as seasonal components, trend components, etc. These components help explain the time series data 34 from different vantage points, such as to help explain seasonality aspects and/or trend aspects that might be present in the time series data 34. Such explanations improve the forecasting capability.
As depicted in
In addition to selection of one or more input variables from a pool of input variable candidates 54,
Based upon the time series data 50 and the variable to be forecast 52, a model analysis process 80 generates one or more models 84 having their own selected input variables as determined by input variable selection process 60. Based upon model selection criteria 92, a model selection process 90 selects a model 94 from the pool 84 for use in forecasting or other data model analysis.
The model analysis process 80 can perform outlier analysis 86 with respect to each of the input model candidates 82. For a detected outlier, dummy regressors can be created for use in forecasting the time series data. Examples of detected outliers include additive outliers, level shift outliers and combinations thereof.
The input model candidates can be from different families (e.g., rich families) of models, (e.g., ARIMA, UCM, and other families of models). A model selection list can be used to specify a list of candidate model specifications and how to choose which model specification is best suited to forecast a particular time series. Different techniques can be utilized in determining how to select a model. As an illustration, the model selection techniques discussed in the Forecasting Provisional Application can be used.
Models in the list can be associated with components that are not only useful for forecasting but also for describing how the time series evolves over time. The forecasting model decomposes the series into its various components. For example, the local trend component describes the trend (up or down) at each point in time, and the final trend component describes the expected future trend. These forecasting models can also indicate departures from previous behavior or can be used to cluster time series.
The parameter estimates (weights or component variances) describe how fast the component is changing with time. Weights or component variances near zero indicate a relative constant component; weights near one or large component variances indicate a relatively variable component. For example, a seasonal weight near zero or a component variance near zero represents a stable seasonal component; a seasonal weight near one or a large component variance represents an unstable seasonal component. Parameter estimates should be optimized for each time series for best results.
Examples of models include: local level models, local trend models, local seasonal models, local models, ARIMA models, causal models, transformed models, intermittent demand models, external and user-defined models, etc.
The local level models are used to forecast time series whose level (or mean) component varies with time. These models predict the local level for future periods.
Examples of local level models are Simple Exponential Smoothing and Local Level Unobserved Component Model. This model has one parameter (level), which describes how the local level evolves. The forecasts for the future periods are simply the final local level (a constant).
Local trend models are used to forecast time series whose level or trend/slope components vary with time. These models predict the local level and trend for future periods.
Examples of local trend models are Double (Brown), Linear (Holt), Damped-Trend Exponential Smoothing, and Local Trend Unobserved Component Model. The double model has one parameter (level/trend weight), the linear model has two parameters (level and trend), and the damped-trend model has three parameters (level, trend, and damping weights). The damping weight dampens the trend over time. The forecasts for the future periods are a combination of the final local level and the final local trend.
Local seasonal models are used to forecast time series whose level or seasonal components vary with time. These models predict the local level and season for future periods.
Examples of local seasonal models are Seasonal Exponential Smoothing and the Local Seasonal Unobserved Component Model. The seasonal model has two parameters (level and seasonal). The forecasts for the future periods are a combination of the final local level and the final local season.
The local models are used to forecast time series whose level, trend, or seasonal components vary with time. These models predict the local level, trend, and seasonal component for future periods.
Examples of local models are the Winters Method (additive or multiplicative) and the Basic Structural Model. These models have three parameters (level, trend, and seasonal). The forecasts for the future periods are a combination of the final local level, the final local trend, and final local season.
The Autoregressive Integrated Moving Average Models (ARIMA) are used to forecast time series whose level, trend, or seasonal properties vary with time. These models predict the future values of the time series by applying non-seasonal or seasonal polynomial filters to the disturbances. Using different types of polynomial filters permits the modeling of various properties of the time series.
Examples of ARIMA models are the Exponentially Weighted Moving Average (EWMA), moving average processes (MA), integrated moving average processes (IMA), autoregressive processes (AR), integrated autoregressive processes (IAR), and autoregressive moving average processes (ARMA).
Causal time series models are used to forecast time series data that are influenced by causal factors. Input variables (regressor or predictor variables) and calendar events (indicator, dummy, or intervention variables) are examples of causal factors. These independent (exogenous) time series causally influence the dependent (response, endogenous) time series and, therefore, can aid the forecasting of the dependent time series.
Examples of causal time series models are Autoregressive Integrated Moving Average with exogenous inputs (ARIMAX), which are also known as transfer function models or dynamic regression models, and Unobserved Component Models (UCM), which are also known as state-space models and structural time series models. These models may be formulated as follows:
These regression models are dynamic in that they take into account the autocorrelation between observations recorded at different times. Dynamic regression includes and extends multiple linear regression (static regression).
Input variables can be continuous-valued time series. They represent causal factors that influence the dependent time series throughout the time range. Examples of input variables are prices, temperatures, and other economic or natural factors. Input variables are contained in the time series data set.
Calendar events can be represented by indicator variables that are typically discrete-valued. They indicate when the causal factor influences the dependent time series. Typically, zero values indicate the absence of the event and nonzero values indicate the presence of the event. These dummy regressors can consist of pulses (points), steps (shifts), ramps, and temporary changes and combinations of these primitive shapes. The values of the indicator variable depend on the time interval. For example, if the calendar event is New Year's Day and the time interval is monthly, a pulse indicator variable will be nonzero for each January and zero otherwise.
In addition to the causal factors, the causal model can contain components described in preceding sections: local level, local trend, and local seasonal. Causal models decompose the time series into causal factors and the local components. This decomposition is useful for demand analysis (promotional analysis and intervention analysis).
With the exception of the Winters Method Multiplicative Model, the preceding fore-casting models are linear; that is, the components must be added together to re-create the series. Since time series are not always linear with respect to these components, transformed versions of the preceding forecasting models must be considered when using automatic forecasting. Some useful time series transformations are
For example, suppose the underlying process that generated the series has one of the following nonlinear forms:
Intermittent demand models (IDM) or interrupted time series models are used to forecast intermittent time series data. Since intermittent series are mostly constant valued (usually zero) except on relatively few occasions, it is often easier to predict when the series departs and how much the series departs from this constant value rather than the next value. An example of an intermittent demand model is Croston's Method.
Intermittent demand models decompose the time series into two parts: the interval series and the size series. The interval series measures the number of time periods between departures. The size series measures the magnitude of the departures. After this decomposition, each part is modeled and forecast independently. The interval forecast predicts when the next departure will occur. The size forecast predicts the magnitude of the next departure. After the interval and size predictions are computed, they are combined (predicted magnitude divided by predicted number of periods for the next departure) to produce a forecast for the average departure from the constant value for the next time period.
In addition to the previously described general families (e.g., classes) of Exponential Smoothing Models (ESM), Unobserved Component Models (UCM), Autoregressive Integrated Moving Average Models (ARIMA), and Intermittent Demand Models (IDM), external models and user-defined models can also be permitted.
External models are used for forecasts that are provided external to the system. These external forecasts may have originated from an external statistical model from another software package, may have been provided by an outside organization (e.g., marketing organization, government agency) or may be based on judgment. External models allow for the evaluation of external forecasts and for tests for unbiasedness.
User-defined models are external models that are implemented with the SAS programming language or the C programming language by the user of HPF software. (HPF is described in the Forecasting Provisional Application.) For these models, users of HPF create their own computational algorithm to generate the forecasts. They are considered external models because they were not implemented in HPF.
With such models and through use of an appropriate forecast function, a decision-making process can generate forecasts (forecast scores) based on future causal factor values with little analytical and computational effort. Due to the iterative nature of decision-making processes, forecast functions make large-scale decision-making processes more tractable. The model specification and forecast function can be stored for use by decision-making processes.
The models and their input variables may be selected through many different techniques. For example as shown in
As part of the transfer function identification, numerator-denominator processing 102 and cross-correlation determination processing 110 are performed. Numerator and denominator polynomial orders are determined at 102 for each functional transformed regressor. This determination can be made by comparing the patterns at 104 that result from processes 106 and 108. Process 106 fits regression with a high order distributed lag, and process 108 fits a transfer function using possible pairs of numerator and denominators.
The cross-correlation determination processing 110 includes selection of the candidate input variables based upon computing cross-correlations between the residuals related to the inputs 114 and the residuals related to the forecast variable 116. The input residuals 114 are determined by estimating residuals resulting from determining a model for a candidate input variable, and the forecast variable residuals 116 are determined by estimating residuals resulting from prewhitening the variable to be forecast using the model determined from the candidate input variable. For each candidate model, there is an automatic selection of which of the candidate input variables to include in each of the candidate models based upon the determined transfer functions.
As an illustration, the transfer functions can be determined from a white noise reference model by determining a functional transformation and stationary transformation for each regressor, determining delay for each transformed regressor, determining simple numerator and denominator polynomial orders for each functional transformed regressor, and determining the disturbance ARMA polynomials.
Such operations can be performed as described in the Forecasting Provisional Application. For example, The HPFDIAGNOSE procedure provides a set of tools for automated univariate time series model identification. Time series data can have outliers, structural changes, and calendar effects. In the past, finding a good model for time series data usually required experience and expertise in time series analysis.
The HPFDIAGNOSE procedure automatically diagnoses the statistical characteristics of time series and identifies appropriate models. The models that HPFDIAGNOSE considers for each time series include ARIMAX, Exponential Smoothing, Intermittent Demand and Unobserved Components models. Log transformation and stationarity tests are automatically performed. The ARIMAX model diagnostics find the AR and MA orders, detect outliers, and select the best input variables. The Unobserved Components Model diagnostics find the best components and select the best input variables.
The HPFDIAGNOSE procedure can be configured, inter alia, to provide one or more of the following functionality:
This following illustrates use of the HPFDIAGNOSE procedure and shows examples of how to create ARIMA, ESM, and UCM model specifications.
The following example prints the diagnostic tests of an ARIMA model. In the HPFDIAGNOSE statement, the SEASONALITY=12 option specifies the length of the seasonal cycle of the time series, and the PRINT=SHORT option prints the chosen model specification. The FORECAST statement specifies the dependent variable (AIR). The ARIMAX statement specifies that an ARIMA model is to be diagnosed.
The following example prints the diagnostic tests of an ESM for airline data. The ID statement INTERVAL=MONTH option specifies an implied seasonality of 12. The ESM statement specifies that an ESM model is to be diagnosed.
The following example prints the diagnostic tests of an UCM for airline data. The UCM statement specifies that a UCM model is to be diagnosed.
When the column SELECTED=YES, the component is significant. When the column SELECTED=NO, the component is insignificant in
When SELECTED=YES, the STOCHASTIC column has either YES or NO. STOCHASTIC=YES indicates a component has a statistically significant variance, indicating the component is changing over time; STOCHASTIC=NO indicates the variance of a component is not statistically significant, but the component itself is still significant.
The following example shows how to pass a model specification created by the HPFDIAGNOSE procedure to the HPFENGINE procedure.
An ARIMAX model specification file, a model selection list, and a model repository SASUSER.MYCAT are created by the HPFDIAGNOSE procedure. The ARIMAX model specification file and the model selection list are contained in the SASUSER.MYCAT repository.
The OUTEST=data set is used to transmit the diagnostic results to the HPFENGINE procedure by the INEST=option. The WORK.EST_ONE data set contains the information about the data set variable and the model selection list.
The following example shows how the HPFDIAGNOSE and HPFENGINE procedures can be used to select a single model specification from among multiple candidate model specifications.
In this example the HPFDIAGNOSE procedure creates three model specifications and adds them to the model repository SASUSER.MYCAT created in the previous example.
If new model specification files are added to a model repository that already exists, then the suffixed number of the model specification file name and the model selection list file name are sequentially.
This example adds three model specification files, DIAG2, DIAG3, and DIAG4 to the model repository SASUSER.MYCAT which already contains DIAG0 and DIAG1.
The following example shows the HPFDIAGNOSE procedure with the default settings.
It should be noted that the HPFDIAGNOSE procedure always performs the intermittency test first. If the HPFDIAGNOSE procedure determines that the series is intermittent, then the above example is equivalent to the following code:
However, if the HPFDIAGNOSE procedure determines that the series is not intermittent, then the default settings are equivalent to the following code:
The HPFDIAGNOSE procedure can be configured to perform the intermittency test first regardless of which model statement is specified. The IDM statement only controls the intermittency test using the INTERMITTENT= and BASE=options.
The following example specifies the IDM statement to control the intermittency test. If the HPFDIAGNOSE procedure determines that the series is intermittent, then an intermittent demand model is fitted to the data.
However, if the series is not intermittent, ARIMAX and ESM models are fitted to the data, even though the IDM statement is specified.
The following example specifies the ESM statement. If the series is intermittent, an intermittent demand model is fitted to the data, even though the ESM statement is specified. But, if the series is not intermittent, an ESM model is fitted to the data. The same is true when the ARIMAX and UCM statements are specified.
The HPFDIAGNOSE procedure uses the following statements:
A description of these statements is provided in
The following options can be used in the PROC HPFDIAGNOSE statement which has the following expression:
A BY statement can be used in the HPFDIAGNOSE procedure to process a data set in groups of observations defined by the BY variables:
The ID statement names a numeric variable that identifies observations in the input and output data sets and has the following format.
The ID variable's values are assumed to be SAS date, time, or datetime values. In addition, the ID statement specifies the (desired) frequency associated with the time series. The ID statement options also specify how the observations are accumulated and how the time ID values are aligned to form the time series. The information specified affects all variables specified in subsequent FORECAST statements. If the ID statement is specified, the INTERVAL=option must also be specified. If an ID statement is not specified, the observation number, with respect to the BY group, is used as the time ID.
The EVENT statement names event-names that identify the events in the INEVENT= data-set or predefined event-keywords or _ALL_. The statement has the following format:
The EVENT statement names either event-names or _ALL_. The event names identify the events in the INEVENT=data-set or are the SAS predefined event-keywords.
_ALL_ is used to indicate that all simple events in the INEVENT=data set should be included in processing. If combination events exist in the INEVENT=data set and are to be included, then they must be specified in a separate EVENT statement. The HPFDIAGNOSE procedure does not currently process group events, although if the simple events associated with the group are defined in the INEVENT=data set, they can be included in processing, either by event-name or using _ALL_. The EVENT statement requires the ID statement.
For more information on the EVENT statement, see the Forecasting Provisional Application.
The following option can be used in the EVENT statement:
Any number of FORECAST statements can be used in the HPFDIAGNOSE procedure. The statement has the following format:
The FORECAST statement lists the variables in the DATA=data set to be diagnosed. The variables are dependent or response variables that you wish to forecast in the HPFENGINE procedure. The following options can be used in the FORECAST statement:
Any number of INPUT statements can be used in the HPFDIAGNOSE procedure. The statement has the following format:
The INPUT statement lists the variables in the DATA=data set to be diagnosed as regressors. The variables are independent or predictor variables to be used to forecast dependent or response variables.
The following options can be used in the INPUT statement:
A TRANSFORM statement can be used to specify the functional transformation of the series. The statement can have the following format:
The following options can be used in the TRANSFORM statement:
A TREND statement can be used to test whether or not the dependent series requires simple or seasonal differencing, or both. The statement can have the following format:
The augmented Dickey-Fuller test (Dickey and Fuller 1979) is used for the simple unit root test. If the seasonality is less than or equal to 12, the seasonal augmented Dickey-Fuller (ADF) test (Dickey, Hasza and Fuller 1984) is used for the seasonal unit root test. Otherwise, an AR(1) seasonal dummy test is used. The joint simple and seasonal differencing test uses the Hasza-Fuller test (Hasza and Fuller 1979, 1984) in the special seasonality. Otherwise, proceed with the ADF test and the season dummy test.
The following options can be used in the TREND statement:
An ARIMAX statement can be used to find an appropriate ARIMAX specification. The statement can have the following format:
The HPFDIAGNOSE procedure performs the intermittency test first. If the series is intermittent, an intermittent demand model is fitted to the data and the ARIMAX statement is not applicable. If the series is not intermittent, an ARIMAX model is fitted to the data. If a model statement is not specified, the HPFDIAGNOSE procedure diagnoses ARIMAX and ESM models if the series is not intermittent, but diagnoses an IDM model if the series is intermittent.
The following options can be used in the ARIMAX statement:
If the OUTLIER=option is not specified, the HPFDIAGNOSE performs the outlier detection with the OUTLIER=(DETECT=MAYBE MAXNUM=2 MAXPCT=2 SIGLEVEL=0.01) option as default.
If the PREFILTER=EXTREME option is specified and extreme values are found, then these values are potential outliers. With the PREFILTER=EXTREME option, outliers may be detected even if the DETECT=NO option is specified; more than n number of outliers can be detected even if the MAXNUM=n option is specified.
An ESM statement can be used to find an appropriate ESM model specification based on the model selection criterion (McKenzie 1984). The statement can have the following format:
The HPFDIAGNOSE procedure performs the intermittency test first. If the series is intermittent, an intermittent demand model is fitted to the data and the ESM statement is not applicable. If the series is not intermittent, an ESM model is fitted to the data.
If a model statement is not specified, the HPFDIAGNOSE procedure diagnoses ARIMAX and ESM models if the series is not intermittent, but diagnoses an IDM model if the series is intermittent.
An IDM statement is used to control the intermittency test. The HPFDIAGNOSE procedure performs the intermittency test first. The statement can have the following format:
If the series is intermittent, an intermittent demand model is fitted to the data based on the model selection criterion. However, if the series is not intermittent, ARIMAX and ESM models are fitted to the data.
If a model statement is not specified, the HPFDIAGNOSE procedure diagnoses ARIMAX and ESM models if the series is not intermittent, but diagnoses an IDM model if the series is intermittent.
A UCM statement can be used to find an appropriate UCM model specification (Harvey 1989, 2001; Durbin and Koopman 2001). The statement can have the following format:
The HPFDIAGNOSE procedure performs the intermittency test first. If the series is intermittent, an intermittent demand model is fitted to the data and the UCM statement is not applicable. If the series is not intermittent, a UCM model is fitted to the data.
The following options can be used in the UCM statement:
With respect to data preparation, the HPFDIAGNOSE procedure does not use missing data at the beginning and/or end of the series. Missing values in the middle of the series to be forecast could be handled with the PREFILTER=MISSING or PREFILTER=YES option. The PREFILTER=MISSING option uses smoothed values for missing data for tentative order selection in the ARIMAX modeling and for tentative components selection in the UCM modeling, but the original values for the final diagnostics. The PREFILTER=YES option uses smoothed values for missing data and for all diagnostics.
Extreme values in the middle of the series to be forecast can be handled with the PREFILTER=EXTREME option in the ARIMA modeling. The HPFDIAGNOSE procedure replaces extreme values with missing values when determining a tentative ARIMA model, but the original values are used for the final diagnostics. The PREFILTER=EXTREME option detects extreme values if the absolute values of residuals are greater than 3×STDDEV from a proper smoothed model.
If there are missing values in the middle of data for the input series, the procedure uses an interpolation method based on exponential smoothing to fill in the missing values.
The following data set provides a scenario for explaining the PREFILTER=EXTREME option.
In the following SAS code, the HPFDIAGNOSE procedure diagnoses the new data set AIR-EXTREME without the PREFILTER=EXTREME option.
In
In the following SAS code, the HPFDIAGNOSE procedure diagnoses the new data set AIR-EXTREME with the PREFILTER=EXTREME option.
In
With respect to functional transformation, the log transform test compares the MSE or MAPE value after fitting an AR(p) model to the original data and to the logged data. If the MSE or MAPE value is smaller for the AR(p) model fitted to the logged data, then the HPFDIAGNOSE procedure will perform the log transformation.
The next two SAS programs specify the same log transformation test.
The Functional Transformation Table shown in
The stationarity test decides whether the data requires differencing. Note that d is the simple differencing order, and D is the seasonal differencing order.
The next two SAS programs specify the same trend test.
The simple augmented Dickey-Fuller test is used to determine the simple differencing order. If there is no unit root, then the HPFDIAGNOSE procedure will set d=0. If there is a unit root, then the double unit root test is applied; if there is a double unit root, then the HPFDIAGNOSE procedure will set d=2, otherwise d=1.
The seasonal augmented Dickey-Fuller test is used to identify the seasonal differencing order. If the seasonality is greater than 12, the season dummy regression test is used. If there is no seasonal unit root, the HPFDIAGNOSE procedure will set D=0. If there is a seasonal unit root, the HPFDIAGNOSE procedure will set D=1.
Hasza-Fuller (Hasza and Fuller 1979, 1984) proposed the joint unit roots test. If the seasonality is less than or equal to 12, use these tests. If there is a joint unit root, then the HPFDIAGNOSE procedure will set D=1 and d=1.
If the seasonality is greater than 12, the seasonal dummy test is used to decide the seasonal differencing order. The seasonal dummy test compares the criterion (AIC) of two AR(1) models and the joint significance of the seasonal dummy parameters, where one has seasonal dummy variables and the other does not have the seasonal dummy variables.
For ARMA order selection, the tentative simple autoregressive and moving-average orders (AR=p* and MA=q*) are found using the ESACF, MINIC, or SCAN method.
The next two SAS programs result in the same diagnoses.
The simple autoregressive and moving-average orders (p and q) are found by minimizing the SBC/AIC values from the models among 0≦p≦p* and 0≦q≦q* where p* and q* are the tentative simple autoregressive and moving-average orders.
The seasonal AR and MA orders (P and Q) are found by minimizing the SBC/AIC values from the models among 0≦P≦2 and 0≦Q≦2.
In order to determine whether the model has a constant, two models are fitted: (p,d,q)(P,D,Q)sand C+(p,d,q)(P,D,Q)s. The model with the smaller SBC/AIC value is chosen.
The ARIMA model uses the conditional least-squares estimates for the parameters.
A transfer function filter has delay, numerator, and denominator parameters. Set (b,k,r) where b is the delay, k is the numerator order, and r is the denominator order.
The default of functional transformation for the inputs is no transformation. The TESTINPUT=TRANSFORM option specifies that the same functional transformation is applied to the inputs as is used for the variable to be forecast.
Using the TESTINPUT=TRANSFORM option, you can test whether the log transformation is applied to the inputs.
The default of the simple and seasonal differencing for the inputs is the same as the simple and seasonal differencing applied to the variable to be forecast.
Using the TESTINPUT=TREND option, you can test whether the differencing is applied to the inputs.
The cross-correlations between the variable (yt) to be forecast and each input variable (xit) are used to identify the delay parameters. The following steps are used to prewhiten the variable to be forecast in order to identify the delay parameter (b).
The high-order lag regression model and the transfer function model are compared to identify the simple numerator and denominator orders.
Fit the high-order lag regression model (lag=15) and get the coefficients. Fit the transfer function C+(b,k,r) where C is a constant term, b is the delay parameter found in the previous section, 0≦k≦2, and 0≦r≦2, and get the impulse weight function (lag=15) of the transfer model. Compare the pattern of the coefficients from the high-order regression model and the transfer model.
The following SAS code shows how to select significant input variables.
The ARIMA Input Selection Table shown in
Outlier detection is the default in the ARIMAX modeling. There are two types of outliers: the additive outlier (AO) and the level shift (LS). For each detected outlier, dummy regressors or indicator variables are created. The ARIMAX model and the dummy regressors are fitted to the data.
The detection of outliers follows a forward method. First find a significant outlier. If there are no other significant outliers, detecting outlier stops at this point. Otherwise, include this outlier into a model as an input and find another significant outlier. The same functional differencing is applied to the outlier dummy regressors as is used for the variable to be forecast.
The data shown in
The HPFDIAGNOSE procedure selects an appropriate intermittent demand model (IDM) based on the model selection criterion. If a series is intermittent or interrupted, a proper IDM is selected by either individually modeling both the demand interval and size component or jointly modeling these components using the average demand component (demand size divided by demand interval).
The following example prints the diagnostics of an intermittent demand series. The INTERMITTENT=2.5 and BASE=0 are specified.
The HPFDIAGNOSE procedure selects an appropriate exponential smoothing model (ESM) based on the model selection criterion. The following example prints the ESM model specification.
The ESM model specification in
The UCM statement is used to find the proper components among the level, trend, seasonal, cycles, and regression effects.
With respect to differencing variables in a UCM, the variable to be forecast and the events are not differenced regardless of the result of the TREND statement. Differencing of the input variables follows the result of the option TESTINPUT=TREND or TESTINPUT=BOTH.
With respect to the transfer function in a UCM, the functional transformation, simple and seasonal differencing, and delay parameters for the transfer function in a UCM are the same as those that are used for the transfer function in an ARIMAX model.
The series that consists of the yearly river flow readings of the Nile, recorded at Aswan (Cobb 1978), is studied. The data consists of readings from the years 1871 to 1970. The DATA step statements shown in
The series is known to have had a shift in the level starting at the year 1899, and the years 1877 and 1913 are suspected to be outlying points. The following SAS code creates the NILE-DATA data set with the Shift1899, Event1877, and Event1913 variables.
The following SAS codes prints the diagnoses of the UCM model specification.
The following example has the same results as
A holdout sample is useful to find models that have better out-of-sample forecasts. If the HOLDOUT=or HOLDOUTPCT=option is specified, the model selection criterion is computed using only the holdout sample region.
The ARIMA model specification in
Calendar effects such as holiday and trading day are defined by the HPFEVENTS procedure or predefined event-keywords. The HPEVENTS procedure creates the OUT data set for the event definitions, and the HPFDIAGNOSE procedure uses these event definitions by specifying the INEVENT=option in the ARIMAX or UCM model.
With respect to Events in an ARIMAX Model, the simple and seasonal differencing for the events in an ARIMAX are the same as those that are used for the variable to be forecast. No functional transformations are applied to the events.
With respect to events in a UCM, the simple and seasonal differencing for the events in a UCM model are not applied to the events. No functional transformations are applied to the events.
The following SAS code shows how the HPEVENTS procedure can be used to create the event data set, OUT=EVENTDATA.
The following SAS code shows that the HPFDIAGNOSE procedure uses this event data by specifying the INEVENT=EVENTDATA option. The EVENT statement specifies the name of events defined in the INEVENT=EVENTDATA.
The following program generates the same results as the previous example without specifying an INEVENT=data set. In this example, SAS predefined event-keywords are specified in the EVENT statement.
The HPFDIAGNOSE procedure diagnoses and the HPFENGINE procedure forecasts. There are different ways to communicate between the HPFDIAGNOSE procedure and the HPFENGINE procedure. One way is that the OUTEST=data set specified in the HPFDIAGNOSE procedure is specified as the INEST=data set in the HPFENGINE procedure. The other way is that the HPFSELECT procedure is used to communicate between the HPFDIAGNOSE procedure and the HPFENGINE procedure.
The ALPHA=, CRITERION=, HOLDOUT=, and HOLDOUTPCT=options can be changed using the HPFSELECT procedure before these options are transmitted to the HPFENGINE procedure. Otherwise the values specified in the HPFDIAGNOSE procedure are transmitted directly to the HPFENGINE procedure.
Missing values in the input series are handled differently in the HPFDIAGNOSE procedure than in the HPFENGINE procedure. The HPFDIAGNOSE procedure uses the smoothed missing values for inputs, but the HPFENGINE procedure does not include the inputs that have missing values. This difference can produce different statistical results between the two procedures.
The model specification files created by the HPFDIAGNOSE procedure can be compared with benchmark model specifications using the HPFESMSPEC, HPFIDMSPEC, HPFARIMASPEC, and HPFUCMSPEC procedures.
The following example shows how to combine these procedures to diagnose a time series. Create a diagnosed model specification.
Create an ARIMA(0,1,1)(0,1,1), model specification.
Create a model selection list that includes a diagnosed model (DIAG0) and a specified model (BENCHMODEL).
Select a better model from the model specification list.
The OUTEST=data set contains information that maps data set variables to model symbols and references the model specification file and model selection list files for each variable to be forecast. This information is used by the HPFENGINE procedure for further model selection, parameter estimation, and forecasts.
In addition, this information can be used by the HPFSELECT procedure to create customized model specification files.
The OUTEST=data set has the following columns:
Here are two examples. The first has one model specification file with a model selection list file; the second one has two model select list files and four model specification files.
The first example uses the BASENAME=AIRSPEC and the new model repository SASUSER.MYMODEL.
The next example uses the new BASENAME=GNPSPEC and the new model repository SASUSER.MYGNP. The ESM and ARIMAX statement are requested for two variables to be forecast.
The model selection list GNPSPEC2 contains the two model specifications; GNPSPEC0 is the ARIMAX model specification, and GNPSPEC1 is the ESM model specification for the variable to be forecast, CONSUMP.
The model selection list GNPSPEC5 contains the two model specifications; GNPSPEC3 is the ARIMAX model specification, and GNPSPEC4 is the ESM model specification for the variable to be forecast, INVEST.
The HPFDIAGNOSE procedure assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the table of
The following example of selection of input variables requests testing of the transformation and differencing of the input variables independent of the variable to be forecast.
The output shown in
The output shown in
The output shown in
The output shown in
This example demonstrates how to select events and input variables.
The output shown in
The output shown in
The output shown in
The output shown in
This example shows that the data is an intermittent demand series.
The output shown in
This example illustrates the use of exponential smoothing models (ESM).
The output shown in
This example illustrates the use of the UCM statement in the HPFDIAGNOSE procedure and uses the code shown in
The output shown in
The output shown in
The operations of a diagnostic software program can be configured in many different ways
If a series is not intermittent or interrupted as determined at decision step 202, then pre-filtering is performed at step 206 in order to find extreme values which can affect a baseline model. At step 206, extreme values in the middle of the series to be forecast can be handled with the PREFILTER=EXTREME option in the ARIMA modeling. This holds extreme values and treats the same as events. Extreme values are replaced with missing values when determining a tentative ARIMA model, but the original values are used for the final diagnostics.
Decision step 208 tests if the series needs a log transformation. The decision as to whether to transform or not to transform the data depends on the test result or using a given transformation function. For example, the log transform test at decision step 208 can compare the MSE or MAPE value after fitting an AR(p) model to the original data and to the logged data. If the MSE or MAPE value is smaller for the AR(p) model fitted to the logged data, then the log transformation will be performed at step 214. Step 214 can use the following statement to perform this operation: Transform TYPE=AUTO, LOG, SQRT, LOGSITIC, and BOX-COX(n). It is noted that if the seasonality is specified, a SEASON DUMMY test is first performed.
Step 210 fits an Exponential Smoothing Model to the time series data if events and inputs are not available. Step 210 can use the Statement/Option “ESM” in order to find a proper (best) ESM based on the model selection criterion. This is then used as the model specification 270.
However if an ESM is not to be used, then processing continues at decision step 212. Decision step 212 tests if the series needs a simple differencing (d) and/or seasonal differencing (D). Decision step 212 can use the following statement/option to perform this: “Trend DIF=SDIF=.”
The simple augmented Dickey-Fuller test is used to determine the simple differencing order d. If there is no unit root as determined at decision step 212, then d=0 and processing continues at model determination steps 240 and 250. If there is a unit root as determined at decision step 212, then at step 216 the double unit root test is applied; if there is a double unit root, then d=2, otherwise d=1.
The seasonal augmented Dickey-Fuller test is used to identify the seasonal differencing order D. If the seasonality is greater than 12, the season dummy regression test is used. If there is no seasonal unit root, then D=0. If there is a seasonal unit root, then D=1. If the seasonality is less than or equal to 12, then the Hasza-Fuller joint unit roots test is used. If there is a joint unit root, then D=1 and d=1.
A seasonal dummy test is also performed as follows: if the seasonality is greater than 12, the seasonal dummy test is used to decide the seasonal differencing order. The seasonal dummy test compares the criterion (AIC) of two AR(1) models and the joint significance of the seasonal dummy parameters, where one has seasonal dummy variables and the other does not have the seasonal dummy variables. Processing continues at model determination steps 240 and 250.
At model determination step 240, an ARIMAX model is fitted. The Statement/Option “ARIMAX” can be used. This step considers events, inputs, and outliers in order to find an ARIMA model to be benched and to find proper events, inputs and outliers which can explain the data better than the benched model.
The tentative simple autoregressive and moving-average orders (AR=p* and MA=q*) are found using the ESACF, MINIC, or SCAN method.
The simple autoregressive and moving-average orders (p and q) are found by minimizing the SBC/AIC values from the models among 0<=p<=p* and 0<=q<=q* where p* and q* are the tentative simple autoregressive and moving-average orders.
The seasonal AR and MA orders (P and Q) are found by minimizing the SBC/AIC values from the models among 0<=P<=2 and 0<=Q<=2.
In order to determine whether the model has a constant, two models are fitted: (p,d,q)(P,D,Q)_s and C+(p,d,q)(P,D,Q)_s, where s is a season period. The model with the smaller SBC/AIC value is chosen.
To help build the ARIMAX model, a functional transformation may be applied to the input variables that are received at step 230. An IDM test is performed at step 232 in order to avoid testing for functional transformation and stationary transformation and identifying transfer function.
A transfer function determination process which is used to build the ARIMAX model can be performed using the following operations:
With respect to functional transformation for input variables, Step 234 determines whether a functional transformation should occur. The TESTINPUT=TRANSFORM option specifies that the same functional transformation is applied to the inputs as is used for the variable to be forecast. Using the TESTINPUT=TRANSFORM option, step 234 can test whether the log transformation should be applied to the inputs.
With respect to simple and seasonal differencing orders for input variables, the default of the simple and seasonal differencing for the inputs is the same as the simple and seasonal differencing applied to the variable to be forecast. At decision step 236, using the TESTINPUT=TREND option, a test is performed as to whether the differencing is applied to the inputs.
With respect to cross-correlations between forecast and input variables, the cross-correlations between the variable (y_t) to be forecast and each input variable (x_{it}) are used to identify the delay parameters. The following steps are used to prewhiten the variable to be forecast in order to identify the delay parameter (b).
With respect to determination of simple numerator (k) and denominator orders (k), the high-order lag regression model and the transfer function model are compared to identify the simple numerator and denominator orders. Fit the high-order lag regression model (lag=15) and get the coefficients. Fit the transfer function C+(b,k,r) where C is a constant term. The output 238 of the transfer function is then provided in order to build the ARIMAX model at step 240.
Events can be considered in building an ARIMAX Model. Event data is received at step 220 and the same functional differencing is applied to the events as is used for the variable to be forecast.
Outliers can be considered when building an ARIMAX model. Outlier data is received at step 260 and can be of two types: the additive outlier (AO) and the level shift (LS). For each detected outlier, dummy regressors or indicator variables are created. The ARIMAX model and the dummy regressors are fitted to the data.
The detection of outliers follows a forward method: first find a significant outlier; if there are no other significant outliers, detecting outlier stops at this point. Otherwise, include this outlier into a model as an input and find another significant outlier. The same functional differencing is applied to the outlier dummy regressors as is used for the variable to be forecast.
Step 250 fits the UCM model. Step 250 finds a useful components model to be benched and finds proper events and inputs which can explain the data better than the benched model. The Statement/Option can be used: UCM Components=( . . . )
Step 250 considers events and inputs, but no outliers are considered and there is no differencing for the variable to be forecast and the events. Proper components are found among the LEVEL, TREND, SEASON, CYCLES, DEPLAG(1), AUTOREG and regression effects. If the data has a season, the CYCLES component is not considered; otherwise two CYCLES are estimated. When the TREND component is specified, the LEVEL is always included in the model. Only a DEPLAG component of order 1 is considered in the model.
The variable to be forecast and the events are not differenced regardless of the result of the TREND statement. Differencing of the input variables follows the result of the option TESTINPUT=TREND or TESTINPUT=BOTH.
The functional transformation, simple and seasonal differencing, and delay parameters for the transfer function in a UCM are the same as those that are used for the transfer function in building an ARIMAX model (see step 238).
To help select which of the constructed models to use, holdout sample analysis is performed. The holdout sample is a subset of the dependent time series ending at the last non-missing observation. The statistics of a model selection criterion are computed using only the holdout sample.
While examples have been used to disclose the invention, including the best mode, and also to enable any person skilled in the art to make and use the invention, the patentable scope of the invention is defined by claims, and may include other examples that occur to those skilled in the art. Accordingly the examples disclosed herein are to be considered non-limiting. As an illustration, it should be understood that the steps and the order of the processing flows described herein may be altered, modified, deleted and/or augmented and still achieve the desired outcome.
It is noted that the systems and methods may be implemented on various types of computer architectures, such as for example on a single general purpose computer or workstation, or on a networked system, or in a client-server configuration, or in an application service provider configuration.
It is further noted that the systems and methods may include data signals conveyed via networks (e.g., local area network, wide area network, interne, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data signals can carry any or all of the data disclosed herein that is provided to or from a device.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform methods described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, etc.) may be stored and implemented in one or more different types of computer-implemented ways, such as different types of storage devices and programming constructs (e.g., data stores, RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The systems and methods may be provided on many different types of computer-readable media including computer storage mechanisms (e.g., CD-ROM, diskette, RAM, flash memory, computer's hard drive, etc.) that contain instructions for use in execution by a processor to perform the methods' operations and implement the systems described herein.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
This application is related to and claims the benefit of and priority to U.S. Provisional Patent Application 60/679,093 filed May 9, 2005 entitled “Computer-Implemented Forecasting Systems And Methods,” the entire document (e.g., specification, drawings, etc.) of which is herein expressly incorporated by reference and hereinafter referred to herein as the “Forecasting Provisional Application.”
Number | Name | Date | Kind |
---|---|---|---|
5991740 | Messer | Nov 1999 | A |
5995943 | Bull et al. | Nov 1999 | A |
6052481 | Grajski et al. | Apr 2000 | A |
6128624 | Papierniak et al. | Oct 2000 | A |
6151584 | Papierniak et al. | Nov 2000 | A |
6169534 | Raffel et al. | Jan 2001 | B1 |
6189029 | Fuerst | Feb 2001 | B1 |
6208975 | Bull et al. | Mar 2001 | B1 |
6216129 | Eldering | Apr 2001 | B1 |
6286005 | Cannon | Sep 2001 | B1 |
6308162 | Ouimet et al. | Oct 2001 | B1 |
6317731 | Luciano | Nov 2001 | B1 |
6334110 | Walter et al. | Dec 2001 | B1 |
6397166 | Leung et al. | May 2002 | B1 |
6400853 | Shiiyama | Jun 2002 | B1 |
6526405 | Mannila et al. | Feb 2003 | B1 |
6539392 | Rebane | Mar 2003 | B1 |
6542869 | Foote | Apr 2003 | B1 |
6564190 | Dubner | May 2003 | B1 |
6591255 | Tatum et al. | Jul 2003 | B1 |
6611726 | Crosswhite | Aug 2003 | B1 |
6640227 | Andreev | Oct 2003 | B1 |
6735738 | Kojima | May 2004 | B1 |
6775646 | Tufillaro et al. | Aug 2004 | B1 |
6792399 | Phillips et al. | Sep 2004 | B1 |
6850871 | Barford et al. | Feb 2005 | B1 |
6878891 | Josten et al. | Apr 2005 | B1 |
6928398 | Fang et al. | Aug 2005 | B1 |
7072863 | Phillips et al. | Jul 2006 | B1 |
7103222 | Peker | Sep 2006 | B2 |
7171340 | Brocklebank | Jan 2007 | B2 |
7216088 | Chappel et al. | May 2007 | B1 |
7236940 | Chappel | Jun 2007 | B2 |
7251589 | Crowe et al. | Jul 2007 | B1 |
7260550 | Notani | Aug 2007 | B1 |
20020169657 | Singh et al. | Nov 2002 | A1 |
20030105660 | Walsh et al. | Jun 2003 | A1 |
20030110016 | Stefek et al. | Jun 2003 | A1 |
20030187719 | Brocklebank | Oct 2003 | A1 |
20030200134 | Leonard et al. | Oct 2003 | A1 |
20040172225 | Hochberg et al. | Sep 2004 | A1 |
20050102107 | Porikli | May 2005 | A1 |
20050249412 | Radhakrishnan et al. | Nov 2005 | A1 |
20060063156 | Willman et al. | Mar 2006 | A1 |
20060064181 | Kato | Mar 2006 | A1 |
20060112028 | Xiao et al. | May 2006 | A1 |
20060143081 | Argaiz | Jun 2006 | A1 |
20070291958 | Jehan | Dec 2007 | A1 |
20080294651 | Masuyama et al. | Nov 2008 | A1 |
Number | Date | Country | |
---|---|---|---|
60679093 | May 2005 | US |