BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jade_SAS
Pyrite | Level 9

Hi All,

 

    I have a quick question:

    Is it necessary to separate the original data sets to training set, validation set and test set when doing the forecasting in forecast studio?   What's the general practice here? Thank you!

 

Thanks,

Jade

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
alexchien
Pyrite | Level 9

Hi Jade, Forecast Studio supports utilizing the validation and test data set to evaluate the model performance. You can set HOLDOUT (number of periods) or HOLDOUT_PCT (percent of total periods) to use validation data set to diagnose and select models. You can use BACK (number of periods) to use test data set to evaluate true model performance as the test data is not used in any way during the modeling process. In data mining world, the training data is partitioned randomly (or in a fashion that is not time related) to form the training, validation, and test data. Models will be built based on the training data and guarded by the validation data. That's the end of modeling process. However, in forecasting, the data has to be partitioned by time sequences since you are not building a model to forecast any random period in the past. The model is built to forecast into the future periods in sequence. The test data has to be the most recent observations, and then the validation data, and then the training data. You might lose the recency effect (most recent data are typically important for forecasting) if you are holdout out data for validation purpose (or the test data). In Forecast Studio, the validation data set (via the HOLDOUT option) will be used for diagnose to create model candidate list. Then the training + validation data will be used to select the best model from the model candidate list, and generate forecasts. This is the most common practice. However, If the data has long enough of history, you do have the luxury to compare models with the test data (via the BACK option). But i would use the BACK for reporting the expected model performance, and then set the BACK to 0 and generate forecasts using the selected model in order to utilize the latest data.

sorry for the long reply... have a nice weekend

alex 

View solution in original post

8 REPLIES 8
Reeza
Super User

Yes it's necessary and yes it's standard procedure when doing predictive modeling to split it three ways. 

Ksharp
Super User

@Reeza It is Forecast model , not Predict model( generally exist in Data Minding ).

Jade_SAS
Pyrite | Level 9

Yes, I am asking for the forecasting models whether it's a general practice to have the training, validation and test sets. Thank you! 

 

Reeza
Super User

Do you have a model already built that you're forecasting or are you building a model?

If you're building a model the current standard is the three way split.This is more of an industry standard than a SAS rule. 

 

Here's a video from Coursera that describes why this is done. If you're a reader there's the text transcript below the video.

 

https://www.coursera.org/learn/machine-learning/lecture/QGKbr/model-selection-and-train-validation-t...

alexchien
Pyrite | Level 9

Hi Jade, Forecast Studio supports utilizing the validation and test data set to evaluate the model performance. You can set HOLDOUT (number of periods) or HOLDOUT_PCT (percent of total periods) to use validation data set to diagnose and select models. You can use BACK (number of periods) to use test data set to evaluate true model performance as the test data is not used in any way during the modeling process. In data mining world, the training data is partitioned randomly (or in a fashion that is not time related) to form the training, validation, and test data. Models will be built based on the training data and guarded by the validation data. That's the end of modeling process. However, in forecasting, the data has to be partitioned by time sequences since you are not building a model to forecast any random period in the past. The model is built to forecast into the future periods in sequence. The test data has to be the most recent observations, and then the validation data, and then the training data. You might lose the recency effect (most recent data are typically important for forecasting) if you are holdout out data for validation purpose (or the test data). In Forecast Studio, the validation data set (via the HOLDOUT option) will be used for diagnose to create model candidate list. Then the training + validation data will be used to select the best model from the model candidate list, and generate forecasts. This is the most common practice. However, If the data has long enough of history, you do have the luxury to compare models with the test data (via the BACK option). But i would use the BACK for reporting the expected model performance, and then set the BACK to 0 and generate forecasts using the selected model in order to utilize the latest data.

sorry for the long reply... have a nice weekend

alex 

ccaulkins9
Pyrite | Level 9
models are a part of Data Mining
e-SAS regards,

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2437 views
  • 8 likes
  • 5 in conversation