BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jade_SAS
Pyrite | Level 9

Hi All,

 

    I have a quick question:

    Is it necessary to separate the original data sets to training set, validation set and test set when doing the forecasting in forecast studio?   What's the general practice here? Thank you!

 

Thanks,

Jade

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
alexchien
Pyrite | Level 9

Hi Jade, Forecast Studio supports utilizing the validation and test data set to evaluate the model performance. You can set HOLDOUT (number of periods) or HOLDOUT_PCT (percent of total periods) to use validation data set to diagnose and select models. You can use BACK (number of periods) to use test data set to evaluate true model performance as the test data is not used in any way during the modeling process. In data mining world, the training data is partitioned randomly (or in a fashion that is not time related) to form the training, validation, and test data. Models will be built based on the training data and guarded by the validation data. That's the end of modeling process. However, in forecasting, the data has to be partitioned by time sequences since you are not building a model to forecast any random period in the past. The model is built to forecast into the future periods in sequence. The test data has to be the most recent observations, and then the validation data, and then the training data. You might lose the recency effect (most recent data are typically important for forecasting) if you are holdout out data for validation purpose (or the test data). In Forecast Studio, the validation data set (via the HOLDOUT option) will be used for diagnose to create model candidate list. Then the training + validation data will be used to select the best model from the model candidate list, and generate forecasts. This is the most common practice. However, If the data has long enough of history, you do have the luxury to compare models with the test data (via the BACK option). But i would use the BACK for reporting the expected model performance, and then set the BACK to 0 and generate forecasts using the selected model in order to utilize the latest data.

sorry for the long reply... have a nice weekend

alex 

View solution in original post

8 REPLIES 8
Reeza
Super User

Yes it's necessary and yes it's standard procedure when doing predictive modeling to split it three ways. 

Ksharp
Super User

@Reeza It is Forecast model , not Predict model( generally exist in Data Minding ).

Jade_SAS
Pyrite | Level 9

Yes, I am asking for the forecasting models whether it's a general practice to have the training, validation and test sets. Thank you! 

 

Reeza
Super User

Do you have a model already built that you're forecasting or are you building a model?

If you're building a model the current standard is the three way split.This is more of an industry standard than a SAS rule. 

 

Here's a video from Coursera that describes why this is done. If you're a reader there's the text transcript below the video.

 

https://www.coursera.org/learn/machine-learning/lecture/QGKbr/model-selection-and-train-validation-t...

alexchien
Pyrite | Level 9

Hi Jade, Forecast Studio supports utilizing the validation and test data set to evaluate the model performance. You can set HOLDOUT (number of periods) or HOLDOUT_PCT (percent of total periods) to use validation data set to diagnose and select models. You can use BACK (number of periods) to use test data set to evaluate true model performance as the test data is not used in any way during the modeling process. In data mining world, the training data is partitioned randomly (or in a fashion that is not time related) to form the training, validation, and test data. Models will be built based on the training data and guarded by the validation data. That's the end of modeling process. However, in forecasting, the data has to be partitioned by time sequences since you are not building a model to forecast any random period in the past. The model is built to forecast into the future periods in sequence. The test data has to be the most recent observations, and then the validation data, and then the training data. You might lose the recency effect (most recent data are typically important for forecasting) if you are holdout out data for validation purpose (or the test data). In Forecast Studio, the validation data set (via the HOLDOUT option) will be used for diagnose to create model candidate list. Then the training + validation data will be used to select the best model from the model candidate list, and generate forecasts. This is the most common practice. However, If the data has long enough of history, you do have the luxury to compare models with the test data (via the BACK option). But i would use the BACK for reporting the expected model performance, and then set the BACK to 0 and generate forecasts using the selected model in order to utilize the latest data.

sorry for the long reply... have a nice weekend

alex 

ccaulkins9
Pyrite | Level 9
models are a part of Data Mining
e-SAS regards,

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2469 views
  • 8 likes
  • 5 in conversation