BookmarkSubscribeRSS Feed
Winnie19
Calcite | Level 5

I am doing a project using logistic regression as an algorithm to predict a specific target. The problem was that the dataset was imbalanced, talking about 3377 frequencies at response level 0 and 200 at response level 1 (For the training dataset). Which methods can I use to control the imbalanced dataset? I am using the SAS enterprise guide. 

6 REPLIES 6
Ksharp
Super User
Oversample it and make them has reasonable proportion like 6:4 or 7:3. and Adjust this Prob in PROC LOGISTIC .
fierceanalytics
Obsidian | Level 7

You may use, reuse the 200 event=1. Randomly sample from the non-events, say, 1000 (any count that constitutes normal percentage with the 200 events). Say you have 10 or 5 such samples. Build regular LR models. Then assemble /average the predicted scores. This is often called K-fold method. Motivation sometimes is not because the ratio of 1 and 0, but because the scanty count of 1. This can be a bit hacking, especially if you sample a lot. And it does not help with driver explanations 

 

If possible, run a HPSPLIT or EM decision trees. Sometimes non event count overpowers due to some unbalanced exclusion, the root-cause method, a much more business-friendly method.

 

Oversampling: with logistic regression,  very OK to try it. 

 

Jia Xin

 

 

GuyTreepwood
Obsidian | Level 7

How are you planning to apply the results of the model? What is the objective function (what are you trying to maximize/minimize)? It is difficult to answer this question without additional context regarding the problem you are trying to solve since you can go in many different directions regarding the solution. 

 

Your dataset has a target rate of 5.5%, which could be enough to build a strong model on its own without additional processing. Doing nothing could be a feasible solution if you are using the precision/recall/f1 metrics as the model selection criteria. Also, since you are working with logistic regression, you can assign a higher weight to the observations that have a target value of 1 in the training dataset using the WEIGHT statement in PROC LOGISTIC. If you have the time, you can experiment with under/over sampling as suggested in the other replies within this thread or create additional synthetic versions of the existing target observations (e.g. using SMOTE) in the training dataset. Again, it depends on the specifics of the problem you are trying to solve. 

fierceanalytics
Obsidian | Level 7

The mere fact the event=200 requires treatment. So the original question may not be imbalance. No, event =200 is not going to give you any strong, sustainable model regardless. 

 

Manually running weight is essentially the same as oversampling/oversampling. One goes by count, sampling goes by ratio. Manually weighting is not recommended because such action almost guarantees inability to sum the resampled total weight properly.  At using the Weight statement of proc logistic, pay some attention to the norm option on the statement. You will see the impact of using vs not using that option on your model

 

Last, but not the least, I doubt this subject has much to do with objective function. 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 4004 views
  • 4 likes
  • 5 in conversation