BookmarkSubscribeRSS Feed
lukholoman
New User | Level 1

Greetings, I need your assistance with handling my heavily imbalanced dataset. I am predicting the probability of a student passing after being accepted into the school. As part of the application process, prospective students complete a survey that includes details such as their study periods, study habits, location, academic marks, and other related information. Using this data, I aim to predict the probability of failure. The issue is that we are working with historical data from 2021 to the present, and it is heavily imbalanced. In the training set, we have 1,465 students who failed and 58,744 who passed. My model is not performing well, as it fails to correctly predict students who are likely to fail at various thresholds (class_pred = 0.3 to 0.6). Could you please assist me in addressing this problem? I have tried oversampling, but I am unsure if this is the best approach. I also plan to experiment with techniques such as undersampling and SMOTE. I am currently working in SAS Enterprise Guide and also have access to Enterprise Miner.

2 REPLIES 2
quickbluefish
Barite | Level 11

It's quite possible your model has poor predictive ability because failure in this case might be largely explained by factors your data does not contain, and there's really not anything that's going to fix that.  However, you might start by posting the parameter estimates and other model output from, e.g., a Cox model (PROC PHREG), assuming you have time to failure.  Given the time period, I would also definitely try to incorporate something related to the pandemic, as the effect of that on academic success might vary quite a lot by place and over time.    A more detailed list of the predictors you're using (and how they're captured -- categorical, continuous, etc.) would help us answer your question better.

Ksharp
Super User
Yeah. That is a big issue.
You could try Tree Based statistical method.
Like
decision tree:
PROC HPSPLIT

random forest:
PROC HPFOREST

and partial least square regression:
PROC PLS

or try non-parameter version of logistic model:
https://blogs.sas.com/content/iml/2016/03/23/nonparametric-regression-binary-response-sas.html

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 86 views
  • 0 likes
  • 3 in conversation