Balancing Imbalanced Data

Lopa2016 · Posted 01-11-2017 04:59 AM

Hi,

I have a dataset where the %age of bads are quite low as 5% .Can any one suggest a way to balance such a dataset?

PeterClemmensen · Posted 01-11-2017 05:00 AM

You have to provide more information than this and preferably some sample data 🙂

The DATA to DATA Step Macro
Blog: SASnrd

Lopa2016 · Posted 01-11-2017 05:08 AM

Here is a sample: 1=Bad & 0 is good . I need to predict for Bad.Thanks in advance

ID	X1	X2	X3	X4	X5	Target
1	.	.	2	0	0	0
2	.	.	1	0	0	0
3	.	.	3	0	0	1
4	.	.	1	0	0	1
5	.	.	4	0	0	1
6	.	.	1	0	0	1
7	132	513	3	0	0	1
8	.	.	4	0	0	1
9	397	.	1	0	0	1
10	98	.	5	0	0	1
11	125	.	5	0	0	1
12	.	.	2	0	0	1
13	.	.	4	0	0	1
14	.	.	3	0	0	0
15	.	.	1	0	0	0
16	.	.	1	0	0	0
17	.	.	2	0	0	0
18	.	.	1	0	0	0
19	.	.	2	0	0	0
20	.	75	6	0	0	0
21	722	.	2	0	0	0
22	.	.	2	0	0	0
23	.	.	2	0	0	0
24	.	.	1	0	0	1
25	.	.	4	0	0	1
26	.	.	2	0	0	0
27	75	.	1	0	0	0
28	.	75	4	0	0	0
29	.	.	1	0	0	0
30	.	.	5	0	0	0
31	.	101	1	0	0	0
32	7442	16002	1	0	1	0
33	.	.	1	0	0	0
34	134	.	4	1	0	0
35	.	.	5	1	0	0
36	.	.	1	0	0	0
37	.	.	1	0	0	0
38	1492	.	2	0	0	0
39	.	.	3	0	0	0
40	.	.	1	0	0	0
41	.	.	1	0	0	0
42	.	.	4	0	0	0
43	.	.	5	0	0	0
44	.	.	1	0	0	0
45	3682	.	1	0	0	0
46	95	.	3	0	1	0
47	1530	.	4	0	0	0
48	.	.	2	0	0	0
49	.	.	2	0	0	0
50	.	.	2	0	0	0
51	.	.	2	0	0	0
52	.	1736	1	0	0	0
53	.	.	4	0	0	0
54	.	.	3	0	0	0
55	100	.	5	0	0	1
56	.	.	2	0	0	1
57	.	.	1	0	0	0
58	.	.	1	0	0	0
59	.	.	1	0	0	0
60	.	.	2	1	0	0
61	.	.	1	0	0	0
62	.	.	2	0	0	0
63	.	.	1	0	0	0
64	.	.	5	0	0	0
65	.	.	1	0	0	0
66	.	.	1	0	0	0
67	.	75	2	0	0	0
68	.	.	6	0	0	0
69	.	.	1	0	0	0
70	780	.	5	0	0	0
71	.	.	1	0	0	0
72	.	.	2	0	0	0
73	373	.	4	0	0	0
74	.	.	1	0	0	0
75	.	.	3	0	0	0
76	.	.	1	0	0	1
77	.	.	2	1	0	0
78	281	.	1	0	0	1
79	.	.	1	0	0	0
80	.	.	1	0	0	1
81	.	.	5	0	0	1
82	.	367	2	0	0	0
83	11079	120	1	0	0	0
84	.	.	3	0	0	1
85	.	.	1	0	0	0
86	110	.	3	0	0	0
87	125	125	1	0	0	0
88	.	.	1	0	0	0
89	.	327	4	0	0	0
90	.	.	3	1	0	0
91	.	1326	4	0	0	0
92	.	.	1	0	0	0
93	.	.	3	0	0	0
94	.	176	1	0	0	0
95	.	.	2	0	0	0
96	.	.	5	0	0	0
97	2266	.	3	0	0	0
98	.	.	4	0	0	0
99	.	.	1	0	0	0
100	.	.	1	0	0	0

Rick_SAS · Posted 01-11-2017 10:52 AM

Sorry to bear bad news, but using this data to predict the target variable is going to be very difficult. If you do a frequency analysis for the x3-x5 variables, you will see that only the X3 variable has enough information to be used to model the target:

proc freq data=Have;
tables (x3-x5)*target / nocum norow nocol nopercent;
run;

The X1 and X2 variables have so many missing values that about the only thing you can say is that the target is empirically associated with low (univariate) values of the X1 and X2 variables.

My best advice is to get more data and better data.

Rick_SAS · Posted 01-11-2017 10:56 AM

To add to my previous message, if you want to use a logistic model, about the only model that has sufficient data is Target = X3, and that model is not significantly different from the intercept-only model:

proc logistic data=Have plots=all;
where x3 < 6;
class X3;
model target(event='1') = x3;
run;

Balancing Imbalanced Data

Re: Balancing Imbalanced Data

Re: Balancing Imbalanced Data

Re: Balancing Imbalanced Data

Re: Balancing Imbalanced Data

SAS Innovate 2026 Registration is Open

Balancing Imbalanced Data

Re: Balancing Imbalanced Data

Re: Balancing Imbalanced Data

Re: Balancing Imbalanced Data

Re: Balancing Imbalanced Data

SAS Innovate 2026 Registration is Open

SAS Training: Just a Click Away