Greetings, I need your assistance with handling my heavily imbalanced dataset. I am predicting the probability of a student passing after being accepted into the school. As part of the application process, prospective students complete a survey that includes details such as their study periods, study habits, location, academic marks, and other related information. Using this data, I aim to predict the probability of failure. The issue is that we are working with historical data from 2021 to the present, and it is heavily imbalanced. In the training set, we have 1,465 students who failed and 58,744 who passed. My model is not performing well, as it fails to correctly predict students who are likely to fail at various thresholds (class_pred = 0.3 to 0.6). Could you please assist me in addressing this problem? I have tried oversampling, but I am unsure if this is the best approach. I also plan to experiment with techniques such as undersampling and SMOTE. I am currently working in SAS Enterprise Guide and also have access to Enterprise Miner.
It's quite possible your model has poor predictive ability because failure in this case might be largely explained by factors your data does not contain, and there's really not anything that's going to fix that. However, you might start by posting the parameter estimates and other model output from, e.g., a Cox model (PROC PHREG), assuming you have time to failure. Given the time period, I would also definitely try to incorporate something related to the pandemic, as the effect of that on academic success might vary quite a lot by place and over time. A more detailed list of the predictors you're using (and how they're captured -- categorical, continuous, etc.) would help us answer your question better.
Ciao,
Koen
When you have lots of "good" and relatively few "bad", there is definitely the possibility that the variables you have does not predict the bads. In fact, this is a common situation. This is not necessarily your fault or the fault of the model, that's often the way it is. You can try oversampling (see here and here). You can also read a gazillion commentaries on oversampling, go to your favorite internet search engine and type in
statistical oversampling good vs bad
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.