Hi,
I have a dataset where the %age of bads are quite low as 5% .Can any one suggest a way to balance such a dataset?
You have to provide more information than this and preferably some sample data 🙂
Here is a sample: 1=Bad & 0 is good . I need to predict for Bad.Thanks in advance
ID | X1 | X2 | X3 | X4 | X5 | Target |
1 | . | . | 2 | 0 | 0 | 0 |
2 | . | . | 1 | 0 | 0 | 0 |
3 | . | . | 3 | 0 | 0 | 1 |
4 | . | . | 1 | 0 | 0 | 1 |
5 | . | . | 4 | 0 | 0 | 1 |
6 | . | . | 1 | 0 | 0 | 1 |
7 | 132 | 513 | 3 | 0 | 0 | 1 |
8 | . | . | 4 | 0 | 0 | 1 |
9 | 397 | . | 1 | 0 | 0 | 1 |
10 | 98 | . | 5 | 0 | 0 | 1 |
11 | 125 | . | 5 | 0 | 0 | 1 |
12 | . | . | 2 | 0 | 0 | 1 |
13 | . | . | 4 | 0 | 0 | 1 |
14 | . | . | 3 | 0 | 0 | 0 |
15 | . | . | 1 | 0 | 0 | 0 |
16 | . | . | 1 | 0 | 0 | 0 |
17 | . | . | 2 | 0 | 0 | 0 |
18 | . | . | 1 | 0 | 0 | 0 |
19 | . | . | 2 | 0 | 0 | 0 |
20 | . | 75 | 6 | 0 | 0 | 0 |
21 | 722 | . | 2 | 0 | 0 | 0 |
22 | . | . | 2 | 0 | 0 | 0 |
23 | . | . | 2 | 0 | 0 | 0 |
24 | . | . | 1 | 0 | 0 | 1 |
25 | . | . | 4 | 0 | 0 | 1 |
26 | . | . | 2 | 0 | 0 | 0 |
27 | 75 | . | 1 | 0 | 0 | 0 |
28 | . | 75 | 4 | 0 | 0 | 0 |
29 | . | . | 1 | 0 | 0 | 0 |
30 | . | . | 5 | 0 | 0 | 0 |
31 | . | 101 | 1 | 0 | 0 | 0 |
32 | 7442 | 16002 | 1 | 0 | 1 | 0 |
33 | . | . | 1 | 0 | 0 | 0 |
34 | 134 | . | 4 | 1 | 0 | 0 |
35 | . | . | 5 | 1 | 0 | 0 |
36 | . | . | 1 | 0 | 0 | 0 |
37 | . | . | 1 | 0 | 0 | 0 |
38 | 1492 | . | 2 | 0 | 0 | 0 |
39 | . | . | 3 | 0 | 0 | 0 |
40 | . | . | 1 | 0 | 0 | 0 |
41 | . | . | 1 | 0 | 0 | 0 |
42 | . | . | 4 | 0 | 0 | 0 |
43 | . | . | 5 | 0 | 0 | 0 |
44 | . | . | 1 | 0 | 0 | 0 |
45 | 3682 | . | 1 | 0 | 0 | 0 |
46 | 95 | . | 3 | 0 | 1 | 0 |
47 | 1530 | . | 4 | 0 | 0 | 0 |
48 | . | . | 2 | 0 | 0 | 0 |
49 | . | . | 2 | 0 | 0 | 0 |
50 | . | . | 2 | 0 | 0 | 0 |
51 | . | . | 2 | 0 | 0 | 0 |
52 | . | 1736 | 1 | 0 | 0 | 0 |
53 | . | . | 4 | 0 | 0 | 0 |
54 | . | . | 3 | 0 | 0 | 0 |
55 | 100 | . | 5 | 0 | 0 | 1 |
56 | . | . | 2 | 0 | 0 | 1 |
57 | . | . | 1 | 0 | 0 | 0 |
58 | . | . | 1 | 0 | 0 | 0 |
59 | . | . | 1 | 0 | 0 | 0 |
60 | . | . | 2 | 1 | 0 | 0 |
61 | . | . | 1 | 0 | 0 | 0 |
62 | . | . | 2 | 0 | 0 | 0 |
63 | . | . | 1 | 0 | 0 | 0 |
64 | . | . | 5 | 0 | 0 | 0 |
65 | . | . | 1 | 0 | 0 | 0 |
66 | . | . | 1 | 0 | 0 | 0 |
67 | . | 75 | 2 | 0 | 0 | 0 |
68 | . | . | 6 | 0 | 0 | 0 |
69 | . | . | 1 | 0 | 0 | 0 |
70 | 780 | . | 5 | 0 | 0 | 0 |
71 | . | . | 1 | 0 | 0 | 0 |
72 | . | . | 2 | 0 | 0 | 0 |
73 | 373 | . | 4 | 0 | 0 | 0 |
74 | . | . | 1 | 0 | 0 | 0 |
75 | . | . | 3 | 0 | 0 | 0 |
76 | . | . | 1 | 0 | 0 | 1 |
77 | . | . | 2 | 1 | 0 | 0 |
78 | 281 | . | 1 | 0 | 0 | 1 |
79 | . | . | 1 | 0 | 0 | 0 |
80 | . | . | 1 | 0 | 0 | 1 |
81 | . | . | 5 | 0 | 0 | 1 |
82 | . | 367 | 2 | 0 | 0 | 0 |
83 | 11079 | 120 | 1 | 0 | 0 | 0 |
84 | . | . | 3 | 0 | 0 | 1 |
85 | . | . | 1 | 0 | 0 | 0 |
86 | 110 | . | 3 | 0 | 0 | 0 |
87 | 125 | 125 | 1 | 0 | 0 | 0 |
88 | . | . | 1 | 0 | 0 | 0 |
89 | . | 327 | 4 | 0 | 0 | 0 |
90 | . | . | 3 | 1 | 0 | 0 |
91 | . | 1326 | 4 | 0 | 0 | 0 |
92 | . | . | 1 | 0 | 0 | 0 |
93 | . | . | 3 | 0 | 0 | 0 |
94 | . | 176 | 1 | 0 | 0 | 0 |
95 | . | . | 2 | 0 | 0 | 0 |
96 | . | . | 5 | 0 | 0 | 0 |
97 | 2266 | . | 3 | 0 | 0 | 0 |
98 | . | . | 4 | 0 | 0 | 0 |
99 | . | . | 1 | 0 | 0 | 0 |
100 | . | . | 1 | 0 | 0 | 0 |
Sorry to bear bad news, but using this data to predict the target variable is going to be very difficult. If you do a frequency analysis for the x3-x5 variables, you will see that only the X3 variable has enough information to be used to model the target:
proc freq data=Have;
tables (x3-x5)*target / nocum norow nocol nopercent;
run;
The X1 and X2 variables have so many missing values that about the only thing you can say is that the target is empirically associated with low (univariate) values of the X1 and X2 variables.
My best advice is to get more data and better data.
To add to my previous message, if you want to use a logistic model, about the only model that has sufficient data is Target = X3, and that model is not significantly different from the intercept-only model:
proc logistic data=Have plots=all;
where x3 < 6;
class X3;
model target(event='1') = x3;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.