- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Lesson "What Is Separate Sampling?" under "Lesson 7: Model Assessment Using SAS Enterprise Miner" (module 3: Predictive Modeler using SAS Enterprise Miner) seems to imply that oversampling is done for efficiency reasons only, with minimal impact on the resulting model.
However, based on some tests I have done using dataset "INQ2005", the use of oversampling vs the full population can have a significant impact on the final model. In particular, the comparison of two Decision Tree models (full dataset with primary proportion = 3.15% vs oversampling with primary proportion = 50%) has shown the following results:
- Different optimal sub-trees are selected (based on Average Square Error); the model based on the full dataset results in a subtree with 26 leaves vs 16 from the model based on oversampling
- Difference in (some) splitting variables selected; this is also confirmed by differences in the list of variables reported under "Variable Important"
- More importantly, the performance (as measured by ASE) of the model based on the full sample shows a marked divergence between training and validation datasets (pointing to overfitting) compared to a more stable dynamic on the oversampled model
Overall, my taking on this is that oversampling is not just a matter of making the whole process more efficient but it leads to "better models" in the sense that, using a balanced sample (i.e. 50/50 split between primary and secondary outcome) seems to help the model to give equal importance to positive and negative cases resulting in more stable models (i.e. with less overfitting).
I would appreciate to hear other opinions on the above.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Performing separate sampling or oversampling is similar to using equal number of replicates in clinical studies. Therefore my recommendation is try balanced over sample when your target variable is a rare event.