Dealing with Imbalanced data

Winnie19 · Posted 10-10-2022 07:39 AM

I am doing a project using logistic regression as an algorithm to predict a specific target. The problem was that the dataset was imbalanced, talking about 3377 frequencies at response level 0 and 200 at response level 1 (For the training dataset). Which methods can I use to control the imbalanced dataset? I am using the SAS enterprise guide.

Ksharp · Posted 10-10-2022 08:45 AM

Oversample it and make them has reasonable proportion like 6:4 or 7:3. and Adjust this Prob in PROC LOGISTIC .

Ksharp · Posted 10-10-2022 08:45 AM

Calling @StatDave

PaigeMiller · Posted 10-10-2022 09:58 AM

See https://support.sas.com/kb/22/601.html

--
Paige Miller

fierceanalytics · Posted 12-16-2022 04:40 PM

You may use, reuse the 200 event=1. Randomly sample from the non-events, say, 1000 (any count that constitutes normal percentage with the 200 events). Say you have 10 or 5 such samples. Build regular LR models. Then assemble /average the predicted scores. This is often called K-fold method. Motivation sometimes is not because the ratio of 1 and 0, but because the scanty count of 1. This can be a bit hacking, especially if you sample a lot. And it does not help with driver explanations

If possible, run a HPSPLIT or EM decision trees. Sometimes non event count overpowers due to some unbalanced exclusion, the root-cause method, a much more business-friendly method.

Oversampling: with logistic regression, very OK to try it.

Jia Xin

GuyTreepwood · Posted 12-21-2022 09:06 AM

How are you planning to apply the results of the model? What is the objective function (what are you trying to maximize/minimize)? It is difficult to answer this question without additional context regarding the problem you are trying to solve since you can go in many different directions regarding the solution.

Your dataset has a target rate of 5.5%, which could be enough to build a strong model on its own without additional processing. Doing nothing could be a feasible solution if you are using the precision/recall/f1 metrics as the model selection criteria. Also, since you are working with logistic regression, you can assign a higher weight to the observations that have a target value of 1 in the training dataset using the WEIGHT statement in PROC LOGISTIC. If you have the time, you can experiment with under/over sampling as suggested in the other replies within this thread or create additional synthetic versions of the existing target observations (e.g. using SMOTE) in the training dataset. Again, it depends on the specifics of the problem you are trying to solve.

fierceanalytics · Posted 12-21-2022 10:58 AM

The mere fact the event=200 requires treatment. So the original question may not be imbalance. No, event =200 is not going to give you any strong, sustainable model regardless.

Manually running weight is essentially the same as oversampling/oversampling. One goes by count, sampling goes by ratio. Manually weighting is not recommended because such action almost guarantees inability to sum the resampled total weight properly. At using the Weight statement of proc logistic, pay some attention to the norm option on the statement. You will see the impact of using vs not using that option on your model

Last, but not the least, I doubt this subject has much to do with objective function.

Dealing with Imbalanced data

Re: Dealing with Imbalanced data

Re: Dealing with Imbalanced data

Re: Dealing with Imbalanced data

Re: Dealing with Imbalanced data

Re: Dealing with Imbalanced data

Re: Dealing with Imbalanced data