topic Re: Fixing imabalance data in SAS Data Science

Fixing imabalance data

Solly7 — Mon, 07 Jun 2021 11:27:13 GMT

Hi,

Im working on binary classification model using logistic regression in SAS Base, but my data is extremely imbalanced...i need help in balancing the data or perhaps strategies in working with this kind of imbalance data using SAS BASE..see screenshot below for my data

Re: Fixing imabalance data

PaigeMiller — Mon, 07 Jun 2021 11:57:31 GMT

Let's say you want to have twice as many 0s as 1s (so 1/3 of the data is now 1). You can randomly select records with 0 to be removed so that you have 4572 0s and 2286 1s. Or if you want 1/2 0s and 1/2 1s, you can modify the selection process to produce 2286 0s and 2286 1s.

The method is called "oversampling", and here is a way to handle oversampled data in your logistic regression in SAS. https://support.sas.com/kb/22/601.html

Re: Fixing imabalance data

Ksharp — Mon, 07 Jun 2021 12:02:30 GMT

Here are three ways you could go :
1) oversample to 1:1 or 1:2 or 1:3 or 1:4

or
2) using exactly logistic regression, but due to your sample size is big, that could be mission impossible.

or
3)using penalty logistic regression by FIRTH option:
proc logistic.......
model ............ / firth ;
run;

Re: Fixing imabalance data

Solly7 — Mon, 07 Jun 2021 12:03:32 GMT

Hi thanks for your propmpt response, so lets say i have sample data with 20000 samples and lets call it full_data...so do I need to split the the full_data into training and testing..then oversample the training data? or am i not understanding...

Re: Fixing imabalance data

PaigeMiller — Mon, 07 Jun 2021 12:05:12 GMT

I would oversample first (reduce the imbalance), and then split that data randomly into training and validation.