Solved: Considerations on Data Splitting

pvareschi · Posted 06-08-2020 07:36 AM

Re: Predictive Modeling Using Logistic Regression

Apologies if this may not be directly related to the topics covered in the course text (page 1.19).

After splitting the data and identifying the best model based on the performance on the validation dataset, would it make sense to merge together the training and validation datasets and re-fit the chosen model on the full set of observations to obtain more accurate estimates of its parameters?

Is this approach used in practice? If so, I can see how that would work for a regression or neural network model, however, what about decision trees? Even if the inputs to be used were constrained to those found by the initial fitting, the splitting points may actually change: would that be ok?

sasmlp · Posted 06-10-2020 12:07 PM

I do not recommend combining the training and validation data sets to obtain more precise estimates of the parameter estimates. The main purpose of data splitting is to see how well the model generalizes to new data. Comparing the accuracy of the model on the training and validation data sets will address that question. If you have a small sample size where data splitting is not feasible, you can try bootstrapping or K-fold cross-validation.

View solution in original post

PaigeMiller · Posted 06-08-2020 07:39 AM

would it make sense to merge together the training and validation datasets and re-fit the chosen model on the full set of observations to obtain more accurate estimates of its parameters

Doing this would result in more precise (not more accurate) estimates, because of the larger sample size. But you lose your protection against over-fitting, which is what splitting the data provides.

--
Paige Miller

sasmlp · Posted 06-10-2020 12:07 PM

I do not recommend combining the training and validation data sets to obtain more precise estimates of the parameter estimates. The main purpose of data splitting is to see how well the model generalizes to new data. Comparing the accuracy of the model on the training and validation data sets will address that question. If you have a small sample size where data splitting is not feasible, you can try bootstrapping or K-fold cross-validation.

gcjfernandez · Posted 06-11-2020 11:43 AM

In predictive modeling the selected model and its parameters are always data specific. Therefore, I would recommend using a most recent "TEST" data and compare the models selected by the validation partition and pooled data and select the champion model.

Considerations on Data Splitting

Re: Considerations on Data Splitting

Re: Considerations on Data Splitting

Re: Considerations on Data Splitting

Re: Considerations on Data Splitting

Considerations on Data Splitting

Re: Considerations on Data Splitting

Re: Considerations on Data Splitting

Re: Considerations on Data Splitting

Re: Considerations on Data Splitting

SAS Training: Just a Click Away