BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pvareschi
Quartz | Level 8

Re: Predictive Modeling Using Logistic Regression

Apologies if this may not be directly related to the topics covered in the course text (page 1.19).

After splitting the data and identifying the best model based on the performance on the validation dataset, would it make sense to merge together the training and validation datasets and re-fit the chosen model on the full set of observations to obtain more accurate estimates of its parameters?

Is this approach used in practice? If so, I can see how that would work for a regression or neural network model, however, what about decision trees? Even if the inputs to be used were constrained to those found by the initial fitting, the splitting points may actually change: would that be ok?

1 ACCEPTED SOLUTION

Accepted Solutions
sasmlp
SAS Employee

I do not recommend combining the training and validation data sets to obtain more precise estimates of the parameter estimates. The main purpose of data splitting is to see how well the model generalizes to new data. Comparing the accuracy of the model on the training and validation data sets will address that question. If you have a small sample size where data splitting is not feasible, you can try bootstrapping or K-fold cross-validation. 

View solution in original post

3 REPLIES 3
PaigeMiller
Diamond | Level 26

would it make sense to merge together the training and validation datasets and re-fit the chosen model on the full set of observations to obtain more accurate estimates of its parameters

 

Doing this would result in more precise (not more accurate) estimates, because of the larger sample size. But you lose your protection against over-fitting, which is what splitting the data provides.

--
Paige Miller
sasmlp
SAS Employee

I do not recommend combining the training and validation data sets to obtain more precise estimates of the parameter estimates. The main purpose of data splitting is to see how well the model generalizes to new data. Comparing the accuracy of the model on the training and validation data sets will address that question. If you have a small sample size where data splitting is not feasible, you can try bootstrapping or K-fold cross-validation. 

gcjfernandez
SAS Employee

In predictive modeling the selected model and its parameters are always data specific. Therefore, I would recommend using a most recent "TEST" data and compare the models selected by the validation partition and  pooled data and select the champion model.