Lesson "What Is Separate Sampling?" under "Lesson 7: Model Assessment Using SAS Enterprise Miner" (module 3: Predictive Modeler using SAS Enterprise Miner) seems to imply that oversampling is done for efficiency reasons only, with minimal impact on the resulting model.
However, based on some tests I have done using dataset "INQ2005", the use of oversampling vs the full population can have a significant impact on the final model. In particular, the comparison of two Decision Tree models (full dataset with primary proportion = 3.15% vs oversampling with primary proportion = 50%) has shown the following results:
- Different optimal sub-trees are selected (based on Average Square Error); the model based on the full dataset results in a subtree with 26 leaves vs 16 from the model based on oversampling
- Difference in (some) splitting variables selected; this is also confirmed by differences in the list of variables reported under "Variable Important"
- More importantly, the performance (as measured by ASE) of the model based on the full sample shows a marked divergence between training and validation datasets (pointing to overfitting) compared to a more stable dynamic on the oversampled model
Overall, my taking on this is that oversampling is not just a matter of making the whole process more efficient but it leads to "better models" in the sense that, using a balanced sample (i.e. 50/50 split between primary and secondary outcome) seems to help the model to give equal importance to positive and negative cases resulting in more stable models (i.e. with less overfitting).
I would appreciate to hear other opinions on the above.
Thanks