Help using Base SAS procedures

Balancing Imbalanced Data

Reply
Contributor
Posts: 44

Balancing Imbalanced Data

Hi,

 

I have a dataset where the %age of bads are quite low as 5% .Can any one suggest a way to balance such a dataset?

PROC Star
Posts: 634

Re: Balancing Imbalanced Data

You have to provide more information than this and preferably some sample data Smiley Happy

Contributor
Posts: 44

Re: Balancing Imbalanced Data

Here is a sample: 1=Bad & 0 is good . I need to predict for Bad.Thanks in advance

IDX1X2X3X4X5Target
1..2000
2..1000
3..3001
4..1001
5..4001
6..1001
71325133001
8..4001
9397.1001
1098.5001
11125.5001
12..2001
13..4001
14..3000
15..1000
16..1000
17..2000
18..1000
19..2000
20.756000
21722.2000
22..2000
23..2000
24..1001
25..4001
26..2000
2775.1000
28.754000
29..1000
30..5000
31.1011000
327442160021010
33..1000
34134.4100
35..5100
36..1000
37..1000
381492.2000
39..3000
40..1000
41..1000
42..4000
43..5000
44..1000
453682.1000
4695.3010
471530.4000
48..2000
49..2000
50..2000
51..2000
52.17361000
53..4000
54..3000
55100.5001
56..2001
57..1000
58..1000
59..1000
60..2100
61..1000
62..2000
63..1000
64..5000
65..1000
66..1000
67.752000
68..6000
69..1000
70780.5000
71..1000
72..2000
73373.4000
74..1000
75..3000
76..1001
77..2100
78281.1001
79..1000
80..1001
81..5001
82.3672000
83110791201000
84..3001
85..1000
86110.3000
871251251000
88..1000
89.3274000
90..3100
91.13264000
92..1000
93..3000
94.1761000
95..2000
96..5000
972266.3000
98..4000
99..1000
100..1000
SAS Super FREQ
Posts: 3,547

Re: Balancing Imbalanced Data

Sorry to bear bad news, but using this data to predict the target variable is going to be very difficult.  If you do a frequency analysis for the x3-x5 variables, you will see that only the X3 variable has enough information to be used to model the target:

 

proc freq data=Have;
tables (x3-x5)*target / nocum norow nocol nopercent;
run;

The X1 and X2 variables have so many missing values that about the only thing you can say is that the target is empirically associated with low (univariate) values of the X1 and X2 variables.  

 

My best advice is to get more data and better data.

SAS Super FREQ
Posts: 3,547

Re: Balancing Imbalanced Data

To add to my previous message, if you want to use a logistic model, about the only model that has sufficient data is Target = X3, and that model is not significantly different from the intercept-only model:

 

proc logistic data=Have plots=all;
where x3 < 6;
class X3;
model target(event='1') = x3;
run;
Ask a Question
Discussion stats
  • 4 replies
  • 172 views
  • 0 likes
  • 3 in conversation