Re: Predictive Modeling Using Logistic Regression
Apologies if this may not be directly related to the topics covered in the course text (page 1.19).
After splitting the data and identifying the best model based on the performance on the validation dataset, would it make sense to merge together the training and validation datasets and re-fit the chosen model on the full set of observations to obtain more accurate estimates of its parameters?
Is this approach used in practice? If so, I can see how that would work for a regression or neural network model, however, what about decision trees? Even if the inputs to be used were constrained to those found by the initial fitting, the splitting points may actually change: would that be ok?
I do not recommend combining the training and validation data sets to obtain more precise estimates of the parameter estimates. The main purpose of data splitting is to see how well the model generalizes to new data. Comparing the accuracy of the model on the training and validation data sets will address that question. If you have a small sample size where data splitting is not feasible, you can try bootstrapping or K-fold cross-validation.
would it make sense to merge together the training and validation datasets and re-fit the chosen model on the full set of observations to obtain more accurate estimates of its parameters
Doing this would result in more precise (not more accurate) estimates, because of the larger sample size. But you lose your protection against over-fitting, which is what splitting the data provides.
I do not recommend combining the training and validation data sets to obtain more precise estimates of the parameter estimates. The main purpose of data splitting is to see how well the model generalizes to new data. Comparing the accuracy of the model on the training and validation data sets will address that question. If you have a small sample size where data splitting is not feasible, you can try bootstrapping or K-fold cross-validation.
In predictive modeling the selected model and its parameters are always data specific. Therefore, I would recommend using a most recent "TEST" data and compare the models selected by the validation partition and pooled data and select the champion model.
This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:
Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment