Hi EM experts: I am doing a classification research project on extremely unbalanced design (with target:nontarget ratio > 1:1000) and a huge dataset. Obviously I need to do some oversampling/undersampling for target/nontarget samples. I came to this dilemma: should I do over/under sampling before I split the dataset into train/validation or after? The reason I am asking this is because I can’t seem find a good compromise and there are a number of articles (e.g. this one on data leakage and this one on train-test contamination) particularly pointed out: “You can minimize data leakage in machine learning in many different ways. You can start by partitioning your data into training and test subsets before engaging in any preprocessing. “ To avoid data leakage, I should have split the dataset into train/valid before any over/under sampling was performed but it will create very different target:nontarget ratio in the final training vs. validation dataset. How should I deal with this on EM? It did not work for me if I have to use a raw sampling of the population as validation dataset. Because the evaluation metric is misclassification rate in EM (not balanced accuracry or F1 that consider balanced recall/precision rate), there is no final model is generated since the default misclassification rate is always the best. If I do over/under sample first then split into train/valid sets, it would work fine except that I have a data leakage problem. So I have to use a separate test set (production level data with same target ratio as population) to find the best models and adjust for thresholds in a production environment. I also tried not to do over/under sampling at all but keep same event ratio as population for both training and validation set while adjusting prior distribution etc. It would work but the random error is quite high since the sampling error is very high in this case. Any suggestions?
... View more