Write and run SAS programs in your web browser

How to cut dataset such that the proportion of observations for a binary variable is 80-20.

Accepted Solution Solved
Reply
Contributor
Posts: 25
Accepted Solution

How to cut dataset such that the proportion of observations for a binary variable is 80-20.

[ Edited ]

Hi SAS experts, 

I am using SAS University Edition v.9.4.

 

 I need your help with writing a code that will help me doing the following task:

 

I must undersample (or oversample) the dataset such that the observations linked to the binary variable churn (0 no churn and 1 churn) are distributed with a proportion of 80% not-churn and 20% churn. In other words, with a dataset of i.e. 100 observations I must end up with 80% observations being churn=0 and 20% being churn=1.

 

Right now I have a dataset that has only 0.67% of customers being churners and 99.927% of non-churners. My dataset has approximately 65790 observations and the frequency of churners is only 330 of the dataset, I must oversample or undersample (i don't know what is best) in order to have an 80-20 distribution. Unfortunately, I cannot come up with an actionable code that succeeds in the task.

Below a small data step of 100 observations that represents what my real dataset is. Hopefully, you can play around with it to create the code that will allow me to achieve the desired goal.

 

data have;
input customer_id_ano churn;
cards;
1	0
2	0
3	0
4	0
5	0
6	0
7	0
8	0
9	0
10	0
11	0
12	0
13	0
14	0
15	0
16	0
17	0
18	0
19	0
20	0
21	0
22	0
23	0
24	0
25	0
26	0
27	0
28	0
29	0
30	0
31	0
32	0
33	0
34	0
35	0
36	0
37	0
38	0
39	0
40	0
41	0
42	0
43	0
44	0
45	0
46	0
47	0
48	0
49	0
50	0
51	0
52	0
53	0
54	0
55	0
56	0
57	0
58	0
59	0
60	0
61	1
62	0
63	0
64	0
65	0
66	0
67	0
68	0
69	0
70	0
71	0
72	1
73	1
74	0
75	0
76	0
77	0
78	0
79	0
80	0
81	0
82	1
83	0
84	0
85	0
86	0
87	0
88	0
89	0
90	0
91	0
92	0
93	1
94	0
95	0
96	0
97	0
98	0
99	0
100	0;
run;

 

Thank you in advance for the precious help you always provide me with!



Accepted Solutions
Solution
‎07-29-2017 02:36 AM
Super User
Posts: 11,343

Re: How to cut dataset such that the proportion of observations for a binary variable is 80-20.

Since your example data only contains 5% churn there is NO way to get 20%.

 

data have;
   do customer=1 to 60000;
      churn = (rand('uniform') < 0.007);
      output;
   end;
 run;

/* to show that we may a reasonable facsimile of your data*/
proc freq data=have;
   tables churn;
run;

/* algebra gives us a maximum number of Churn to select and the appropriate
   associated nonchurn to select, place the values into macro variables
   for use in Proc survey select
   if you want to use fewer churn then write a data step or that will create
   the macro variables with the number desired instead of count(*)
*/
proc sql noprint;
   select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt
   from have
   where churn=1
   ;
quit;

%put &NonChurnCnt &ChurnCnt;
/* we are going to use STRATA in survey select requiring the data to be
   sorted
*/
proc sort data=have;
   by churn;
run;

proc surveyselect data=have out=want
   sampsize=(&NonChurnCnt &ChurnCnt);
   strata churn;
run;
/*examine the output*/
proc freq data=want;
   tables churn;
run;

View solution in original post


All Replies
Solution
‎07-29-2017 02:36 AM
Super User
Posts: 11,343

Re: How to cut dataset such that the proportion of observations for a binary variable is 80-20.

Since your example data only contains 5% churn there is NO way to get 20%.

 

data have;
   do customer=1 to 60000;
      churn = (rand('uniform') < 0.007);
      output;
   end;
 run;

/* to show that we may a reasonable facsimile of your data*/
proc freq data=have;
   tables churn;
run;

/* algebra gives us a maximum number of Churn to select and the appropriate
   associated nonchurn to select, place the values into macro variables
   for use in Proc survey select
   if you want to use fewer churn then write a data step or that will create
   the macro variables with the number desired instead of count(*)
*/
proc sql noprint;
   select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt
   from have
   where churn=1
   ;
quit;

%put &NonChurnCnt &ChurnCnt;
/* we are going to use STRATA in survey select requiring the data to be
   sorted
*/
proc sort data=have;
   by churn;
run;

proc surveyselect data=have out=want
   sampsize=(&NonChurnCnt &ChurnCnt);
   strata churn;
run;
/*examine the output*/
proc freq data=want;
   tables churn;
run;
Contributor
Posts: 25

Re: How to cut dataset such that the proportion of observations for a binary variable is 80-20.

Dear @ballardw,

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi
Super User
Posts: 11,343

Re: How to cut dataset such that the proportion of observations for a binary variable is 80-20.


noemi_b wrote:
Dear @ballardw,

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi

The phrasing of your intial question sounded like you wanted 20 to me.

Contributor
Posts: 25

Re: How to cut dataset such that the proportion of observations for a binary variable is 80-20.

My original dataset has a proportion of 99.927% of customers being non-churners, while only the 0.67% of the sample is churner. Now, I would like to change this distribution of non-churners vs. Churners to an 80% and 20%, respectively. I want to do this in order to remove the problem of running a logistic regression with a rare event that would lead me to a biased regression result with biased estimates and probabilities.

So, I run your code and I must admit that it seems like achieving my desired goal. However, before posting me the code you wrote to me the following:

"Since your example data only contains 5% of churn there is now way to achieve 20%"

So, What did you mean by saying that? :-)
Super User
Posts: 11,343

Re: How to cut dataset such that the proportion of observations for a binary variable is 80-20.

This part of your original post:

"100 observations I must end up with 80% observations being churn=0 and 20% being churn=1."

made me think that the resulting data set should have 20. Since you provided an example data set of 100 I thought that was the data set you were refering to.

Contributor
Posts: 25

Re: How to cut dataset such that the proportion of observations for a binary variable is 80-20.

whatever the starting sample size was, I simply wanted to end up with 80% of the sample being non-churners and 20% being churners. And apparently proc surveyselect is the right procedure I needed. Thank you for the help! :-)
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 262 views
  • 1 like
  • 2 in conversation