About lei_zheng

lei_zheng · ‎07-15-2024

Thanks Koen! What you suggested is exactly what we had been doing as a practical approach. However, it occurs to me that the holdout/test set is essentially being treated as a validation set if we do it this way since we basically pick our models based on test set performance metrics instead of validation set. It is also quite a hassle to write macros to plot precision-recall curves and other performance metrics e.g. ROC, confusion metrics etc on test set within EM either with cutoff node or SAS code. One thing I want to point out is that the sub-sample picked model as you suggested performed really poor some time (not always) on the test/hold out dataset due to obvious bias of subsampling (large variances within the original population). That is the real reason I wanted to calibrate the models based on test/hold out datasets. Any comments/suggestions? Thanks

lei_zheng · ‎07-12-2024

Hi EM experts: I am doing a classification research project on extremely unbalanced design (with target:nontarget ratio > 1:1000) and a huge dataset. Obviously I need to do some oversampling/undersampling for target/nontarget samples. I came to this dilemma: should I do over/under sampling before I split the dataset into train/validation or after? The reason I am asking this is because I can’t seem find a good compromise and there are a number of articles (e.g. this one on data leakage and this one on train-test contamination) particularly pointed out: “You can minimize data leakage in machine learning in many different ways. You can start by partitioning your data into training and test subsets before engaging in any preprocessing. “ To avoid data leakage, I should have split the dataset into train/valid before any over/under sampling was performed but it will create very different target:nontarget ratio in the final training vs. validation dataset. How should I deal with this on EM? It did not work for me if I have to use a raw sampling of the population as validation dataset. Because the evaluation metric is misclassification rate in EM (not balanced accuracry or F1 that consider balanced recall/precision rate), there is no final model is generated since the default misclassification rate is always the best. If I do over/under sample first then split into train/valid sets, it would work fine except that I have a data leakage problem. So I have to use a separate test set (production level data with same target ratio as population) to find the best models and adjust for thresholds in a production environment. I also tried not to do over/under sampling at all but keep same event ratio as population for both training and validation set while adjusting prior distribution etc. It would work but the random error is quite high since the sampling error is very high in this case. Any suggestions?

Online Status	Offline
Date Last Visited	‎07-22-2024 12:38 PM

Re: SAS EM unbalanced design training and validation

SAS EM unbalanced design training and validation

Re: SAS EM unbalanced design training and validation

Re: SAS EM unbalanced design training and validation

SAS EM unbalanced design training and validation