SAS Academy for Data Science

pvareschi · Posted 03-13-2020 03:30 AM

Lesson "What Is Separate Sampling?" under "Lesson 7: Model Assessment Using SAS Enterprise Miner" (module 3: Predictive Modeler using SAS Enterprise Miner) seems to imply that oversampling is done for efficiency reasons only, with minimal impact on the resulting model.

However, based on some tests I have done using dataset "INQ2005", the use of oversampling vs the full population can have a significant impact on the final model. In particular, the comparison of two Decision Tree models (full dataset with primary proportion = 3.15% vs oversampling with primary proportion = 50%) has shown the following results:

Different optimal sub-trees are selected (based on Average Square Error); the model based on the full dataset results in a subtree with 26 leaves vs 16 from the model based on oversampling
Difference in (some) splitting variables selected; this is also confirmed by differences in the list of variables reported under "Variable Important"
More importantly, the performance (as measured by ASE) of the model based on the full sample shows a marked divergence between training and validation datasets (pointing to overfitting) compared to a more stable dynamic on the oversampled model

Overall, my taking on this is that oversampling is not just a matter of making the whole process more efficient but it leads to "better models" in the sense that, using a balanced sample (i.e. 50/50 split between primary and secondary outcome) seems to help the model to give equal importance to positive and negative cases resulting in more stable models (i.e. with less overfitting).

I would appreciate to hear other opinions on the above.

Thanks

gcjfernandez · Posted 06-06-2020 10:45 PM

I agree with your final conclusion. In building predictive models we are after the champion model that can score new data more accurately. (The goal is different form using SurveyLogistc for population survey model or Proc Logistic for inferential statistics.
Performing separate sampling or oversampling is similar to using equal number of replicates in clinical studies. Therefore my recommendation is try balanced over sample when your target variable is a rare event.

SAS Academy for Data Science

Effect of oversampling on fitted model

Re: Effect of oversampling on fitted model

Follow Us

What is...

SAS Academy for Data Science

Effect of oversampling on fitted model

Re: Effect of oversampling on fitted model

SAS Training: Just a Click Away

Follow Us

What is...