Re: Handling imbalanced data

lukholoman

Greetings, I need your assistance with handling my heavily imbalanced dataset. I am predicting the probability of a student passing after being accepted into the school. As part of the application process, prospective students complete a survey that includes details such as their study periods, study habits, location, academic marks, and other related information. Using this data, I aim to predict the probability of failure. The issue is that we are working with historical data from 2021 to the present, and it is heavily imbalanced. In the training set, we have 1,465 students who failed and 58,744 who passed. My model is not performing well, as it fails to correctly predict students who are likely to fail at various thresholds (class_pred = 0.3 to 0.6). Could you please assist me in addressing this problem? I have tried oversampling, but I am unsure if this is the best approach. I also plan to experiment with techniques such as undersampling and SMOTE. I am currently working in SAS Enterprise Guide and also have access to Enterprise Miner.

quickbluefish

It's quite possible your model has poor predictive ability because failure in this case might be largely explained by factors your data does not contain, and there's really not anything that's going to fix that. However, you might start by posting the parameter estimates and other model output from, e.g., a Cox model (PROC PHREG), assuming you have time to failure. Given the time period, I would also definitely try to incorporate something related to the pandemic, as the effect of that on academic success might vary quite a lot by place and over time. A more detailed list of the predictors you're using (and how they're captured -- categorical, continuous, etc.) would help us answer your question better.

Ksharp

Yeah. That is a big issue.
You could try Tree Based statistical method.
Like
decision tree:
PROC HPSPLIT

random forest:
PROC HPFOREST

and partial least square regression:
PROC PLS

or try non-parameter version of logistic model:
https://blogs.sas.com/content/iml/2016/03/23/nonparametric-regression-binary-response-sas.html

sbxkoenk

undersampling the majority class in your binary classification model might be worthwhile
oversampling the minority class in your binary classification model might be worthwhile
MITIGATING THE EFFECTS OF CLASS IMBALANCE USING SMOTE
adding business features might be worthwhile!! (extra inputs or derived or composite inputs that are also relevant for explaining and predicting the target)
calculating statistical and machine learning features might be worthwhile.
e.g. in Enterprise Miner there is a node for variable clustering. Cluster your variables and model with the 1st principal component of every cluster as inputs / candidate predictors.
do not forget to adjust your posterior probabilities for the real priors. you can use the target profiler for this.
Prior Probabilities :: SAS(R) Enterprise Miner(TM) 14.1 Extension Nodes: Developer's Guide
Search the best threshold that gives good balance between precision and recall (true positive rate). Or look at the F1-score.
https://en.wikipedia.org/wiki/Confusion_matrix

Ciao,

Koen

PaigeMiller

When you have lots of "good" and relatively few "bad", there is definitely the possibility that the variables you have does not predict the bads. In fact, this is a common situation. This is not necessarily your fault or the fault of the model, that's often the way it is. You can try oversampling (see here and here). You can also read a gazillion commentaries on oversampling, go to your favorite internet search engine and type in

statistical oversampling good vs bad

--
Paige Miller

Handling imbalanced data