Hi EM experts:
I am doing a classification research project on extremely unbalanced design (with target:nontarget ratio > 1:1000) and a huge dataset. Obviously I need to do some oversampling/undersampling for target/nontarget samples. I came to this dilemma: should I do over/under sampling before I split the dataset into train/validation or after?
The reason I am asking this is because I can’t seem find a good compromise and there are a number of articles (e.g. this one on data leakage and this one on train-test contamination) particularly pointed out:
“You can minimize data leakage in machine learning in many different ways. You can start by partitioning your data into training and test subsets before engaging in any preprocessing. “
To avoid data leakage, I should have split the dataset into train/valid before any over/under sampling was performed but it will create very different target:nontarget ratio in the final training vs. validation dataset. How should I deal with this on EM? It did not work for me if I have to use a raw sampling of the population as validation dataset. Because the evaluation metric is misclassification rate in EM (not balanced accuracry or F1 that consider balanced recall/precision rate), there is no final model is generated since the default misclassification rate is always the best.
If I do over/under sample first then split into train/valid sets, it would work fine except that I have a data leakage problem. So I have to use a separate test set (production level data with same target ratio as population) to find the best models and adjust for thresholds in a production environment.
I also tried not to do over/under sampling at all but keep same event ratio as population for both training and validation set while adjusting prior distribution etc. It would work but the random error is quite high since the sampling error is very high in this case.
Any suggestions?
I would do it like this.
Despite your very rare target event , it appears you have room for a test set. That hold-out set I would split off beforehand (e.g. 20% of the total number of observations chosen at random). With the remaining 80% of the observations you will model. First partition in a stratified way (with 2 strata, being event and non-event). Assign 65% (of that 80%) to a training set and assign 35% (of that 80%) to a validation set. Then (as a next step!) you can under-sample the non-event so that you go from a ratio of 1 to 1000 to a ratio of 2.5 to 100 (e.g.).
Afterwards, do not forget to adjust the posterior probabilities to the real priors (instead of the sample priors). I don't know how important proper calibration of the probabilities is in your context.
You can then score the test set in a separate sub-flow (attached to the main flow via the SCORE node) and get the ultimate, honest estimate of the generalization error. Your hold-out set will tell you how well your model works on data it has never seen. However ... always keep this in mind ... you now have an out-of-sample error estimate, but not an out-of-time error estimate!! In other words, it could be that your data show a covariate shift (one or more) or concept drift ... and then the performance will still end up deviating from the performance on the hold-out set.
Good luck with that.
Koen
Thanks Koen!
What you suggested is exactly what we had been doing as a practical approach. However, it occurs to me that the holdout/test set is essentially being treated as a validation set if we do it this way since we basically pick our models based on test set performance metrics instead of validation set. It is also quite a hassle to write macros to plot precision-recall curves and other performance metrics e.g. ROC, confusion metrics etc on test set within EM either with cutoff node or SAS code.
One thing I want to point out is that the sub-sample picked model as you suggested performed really poor some time (not always) on the test/hold out dataset due to obvious bias of subsampling (large variances within the original population). That is the real reason I wanted to calibrate the models based on test/hold out datasets.
Any comments/suggestions?
Thanks
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.