I’m currently working on a multivariate time series forecasting project. Initially, I developed a VAR (Vector Autoregression) model, but the forecast error was significantly high. Upon diagnostic checking, I noticed large residuals at specific time points, indicating the presence of outliers that the endogenous variables couldn't explain.
To address this, I’m shifting to a VARX model. My plan is to use dummy variables as exogenous inputs (X), where the dummy equals 1 at the specific time points where the residuals are outliers and 0 otherwise. My question is should these dummy variables be included in both the training and testing datasets?
Any advice or best practices for handling "Outlier Dummies" would be greatly appreciated!
I guess you are using PROC VARMAX. Correct?
The PROC VARMAX procedure in SAS is primarily used for multivariate time series analysis and modeling, but it does NOT have a single dedicated and automated OUTLIER statement like some univariate procedures (e.g., PROC ARIMA or PROC X13).
You could work with outlier dummies but only in the training data and not in the testing data. Your testing data is historical data, but ultimately your model is used to forecast the future. And the outliers in the future are unknown and cannot be forecast (predicted). So, for an honest assessment of the forecast error after deployment of your model in production ... don't put the outlier dummies in your test data.
It is important that (a few) outliers do not disrupt the training process so that you can perform pattern recognition optimally. But then your model is industrialized (deployment in production), and then you really have to hope that outliers no longer occur (a vain hope, of course).
Perhaps you will find some sort of forecast driver that underlies the outliers. If that is the case, you must include that forecast driver as an independent variable in the model. And perhaps you know the future of that independent variable?
BR, Koen
I guess you are using PROC VARMAX. Correct?
The PROC VARMAX procedure in SAS is primarily used for multivariate time series analysis and modeling, but it does NOT have a single dedicated and automated OUTLIER statement like some univariate procedures (e.g., PROC ARIMA or PROC X13).
You could work with outlier dummies but only in the training data and not in the testing data. Your testing data is historical data, but ultimately your model is used to forecast the future. And the outliers in the future are unknown and cannot be forecast (predicted). So, for an honest assessment of the forecast error after deployment of your model in production ... don't put the outlier dummies in your test data.
It is important that (a few) outliers do not disrupt the training process so that you can perform pattern recognition optimally. But then your model is industrialized (deployment in production), and then you really have to hope that outliers no longer occur (a vain hope, of course).
Perhaps you will find some sort of forecast driver that underlies the outliers. If that is the case, you must include that forecast driver as an independent variable in the model. And perhaps you know the future of that independent variable?
BR, Koen
thankyou
My pleasure.
To be honest, I wasn't quite sure what you meant by "testing data". A hold-out set or an out-of-sample region?
But it doesn’t matter for my answer.
There are 1, 2 or 3 regions when you want to model a univariate time series:
In any case, it is better to smooth out outliers (additive, level shift) only in the in-sample region to prevent them from distorting parameter estimation and to allow for proper pattern recognition.
And thus ... it is better to not remove shocks in the hold-out and out-of-sample regions for the reasons mentioned earlier. After all, the hold-out set is meant to mimic "unseen" future data.
BR, Koen
In my case, I also only use 2 regions, just like you mentioned.
Btw, I have a follow-up question regarding the implementation in SAS PROC VARMAX. I’m trying to model the dummy variables as exogenous variables. However, when I run the procedure using only my in-sample data (e.g., 100 observations), SAS does not produce any forecast output. I then tried to generate the forecasts manually using the estimated coefficients, but the forecast values gradually became smaller and even turned negative over time. I’m wondering whether I made a mistake in the data setup. Should the dataset already include the future time ID rows (with missing values for Y but fill the future dummy (X) values with 0s)?
@tugasakhir wrote:Should the dataset already include the future time ID rows (with missing values for Y but fill the future dummy (X) values with 0s)?
Yes!
See
Problem Note 37474: Incorrect or no forecasts are produced when future values of exogenous variables are not provided
https://support.sas.com/kb/37/474.html
So, just put zeros as future values for your outlier dummies.
BR, Koen
ok, thankyou so much for your help.
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →