BookmarkSubscribeRSS Feed
🔒 This topic is locked. We are no longer accepting replies to this topic. Need further help? Please sign in and ask a new question.
pvareschi
Quartz | Level 8

Lesson "What Is Separate Sampling?" under "Lesson 7: Model Assessment Using SAS Enterprise Miner" (module 3: Predictive Modeler using SAS Enterprise Miner) seems to imply that oversampling is done for efficiency reasons only, with minimal impact on the resulting model.

However, based on some tests I have done using dataset "INQ2005", the use of oversampling vs the full population can have a significant impact on the final model. In particular, the comparison of two Decision Tree models (full dataset with primary proportion = 3.15% vs oversampling with primary proportion = 50%) has shown the following results:

 

  1. Different optimal sub-trees are selected (based on Average Square Error); the model based on the full dataset results in a subtree with 26 leaves vs 16 from the model based on oversampling
  2. Difference in (some) splitting variables selected; this is also confirmed by differences in the list of variables reported under "Variable Important"
  3. More importantly, the performance (as measured by ASE) of the model based on the full sample shows a marked divergence between training and validation datasets (pointing to overfitting) compared to a more stable dynamic on the oversampled model

 

Overall, my taking on this is that oversampling is not just a matter of making the whole process more efficient but it leads to "better models" in the sense that, using a balanced sample (i.e. 50/50 split between primary and secondary outcome) seems to help the model to give equal importance to positive and negative cases resulting in more stable models (i.e. with less overfitting).

 

I would appreciate to hear other opinions on the above.

Thanks

 

1 REPLY 1
gcjfernandez
SAS Employee
I agree with your final conclusion. In building predictive models we are after the champion model that can score new data more accurately. (The goal is different form using SurveyLogistc for population survey model or Proc Logistic for inferential statistics.
Performing separate sampling or oversampling is similar to using equal number of replicates in clinical studies. Therefore my recommendation is try balanced over sample when your target variable is a rare event.

 

This is a knowledge-sharing community for learners in the Academy. Find answers to your questions or post here for a reply.
To ensure your success, use these getting-started resources:

Estimating Your Study Time
Reserving Software Lab Time
Most Commonly Asked Questions
Troubleshooting Your SAS-Hadoop Training Environment

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 429 views
  • 0 likes
  • 2 in conversation