Solved: Re: Improving VAR Model Accuracy using VARX with Outlier Dummy Variabl...

tugasakhir

I’m currently working on a multivariate time series forecasting project. Initially, I developed a VAR (Vector Autoregression) model, but the forecast error was significantly high. Upon diagnostic checking, I noticed large residuals at specific time points, indicating the presence of outliers that the endogenous variables couldn't explain.

To address this, I’m shifting to a VARX model. My plan is to use dummy variables as exogenous inputs (X), where the dummy equals 1 at the specific time points where the residuals are outliers and 0 otherwise. My question is should these dummy variables be included in both the training and testing datasets?

Any advice or best practices for handling "Outlier Dummies" would be greatly appreciated!

sbxkoenk

I guess you are using PROC VARMAX. Correct?

The PROC VARMAX procedure in SAS is primarily used for multivariate time series analysis and modeling, but it does NOT have a single dedicated and automated OUTLIER statement like some univariate procedures (e.g., PROC ARIMA or PROC X13).

but only in the training data and not in the testing data. Your testing data is historical data, but ultimately your model is used to forecast the future. And the outliers in the future are unknown and cannot be forecast (predicted). So, for an honest assessment of the forecast error after deployment of your model in production ... don't put the outlier dummies in your test data.

It is important that (a few) outliers do not disrupt the training process so that you can perform pattern recognition optimally. But then your model is industrialized (deployment in production), and then you really have to hope that outliers no longer occur (a vain hope, of course).

Perhaps you will find some sort of forecast driver that underlies the outliers. If that is the case, you must include that forecast driver as an independent variable in the model. And perhaps you know the future of that independent variable?

BR, Koen

View solution in original post

sbxkoenk

I guess you are using PROC VARMAX. Correct?

The PROC VARMAX procedure in SAS is primarily used for multivariate time series analysis and modeling, but it does NOT have a single dedicated and automated OUTLIER statement like some univariate procedures (e.g., PROC ARIMA or PROC X13).

but only in the training data and not in the testing data. Your testing data is historical data, but ultimately your model is used to forecast the future. And the outliers in the future are unknown and cannot be forecast (predicted). So, for an honest assessment of the forecast error after deployment of your model in production ... don't put the outlier dummies in your test data.

It is important that (a few) outliers do not disrupt the training process so that you can perform pattern recognition optimally. But then your model is industrialized (deployment in production), and then you really have to hope that outliers no longer occur (a vain hope, of course).

Perhaps you will find some sort of forecast driver that underlies the outliers. If that is the case, you must include that forecast driver as an independent variable in the model. And perhaps you know the future of that independent variable?

BR, Koen

tugasakhir

thankyou

sbxkoenk

My pleasure.

To be honest, I wasn't quite sure what you meant by "testing data". A hold-out set or an out-of-sample region?
But it doesn’t matter for my answer.

There are 1, 2 or 3 regions when you want to model a univariate time series:

The in-sample (training) region,
the hold-out (validation) region and
the out-of-sample (testing) region.
(I often drop the latter to ensure there is enough data left for modelling)

In any case, it is better to smooth out outliers (additive, level shift) only in the in-sample region to prevent them from distorting parameter estimation and to allow for proper pattern recognition.

And thus ... it is better to not remove shocks in the hold-out and out-of-sample regions for the reasons mentioned earlier. After all, the hold-out set is meant to mimic "unseen" future data.

BR, Koen

tugasakhir

In my case, I also only use 2 regions, just like you mentioned.

Btw, I have a follow-up question regarding the implementation in SAS PROC VARMAX. I’m trying to model the dummy variables as exogenous variables. However, when I run the procedure using only my in-sample data (e.g., 100 observations), SAS does not produce any forecast output. I then tried to generate the forecasts manually using the estimated coefficients, but the forecast values gradually became smaller and even turned negative over time. I’m wondering whether I made a mistake in the data setup. Should the dataset already include the future time ID rows (with missing values for Y but fill the future dummy (X) values with 0s)?

sbxkoenk

@tugasakhir wrote:
Should the dataset already include the future time ID rows (with missing values for Y but fill the future dummy (X) values with 0s)?

Yes!
See
Problem Note 37474: Incorrect or no forecasts are produced when future values of exogenous variables are not provided
https://support.sas.com/kb/37/474.html

So, just put zeros as future values for your outlier dummies.

BR, Koen

tugasakhir

ok, thankyou so much for your help.

Improving VAR Model Accuracy using VARX with Outlier Dummy Variables

Re: Improving VAR Model Accuracy using VARX with Outlier Dummy Variables

Re: Improving VAR Model Accuracy using VARX with Outlier Dummy Variables

Re: Improving VAR Model Accuracy using VARX with Outlier Dummy Variables

Re: Improving VAR Model Accuracy using VARX with Outlier Dummy Variables

Re: Improving VAR Model Accuracy using VARX with Outlier Dummy Variables

Re: Improving VAR Model Accuracy using VARX with Outlier Dummy Variables

Re: Improving VAR Model Accuracy using VARX with Outlier Dummy Variables

Catch up on SAS Innovate 2026