Hi SAS experts,
I am using SAS University Edition v.9.4.
I need your help with writing a code that will help me doing the following task:
I must undersample (or oversample) the dataset such that the observations linked to the binary variable churn (0 no churn and 1 churn) are distributed with a proportion of 80% not-churn and 20% churn. In other words, with a dataset of i.e. 100 observations I must end up with 80% observations being churn=0 and 20% being churn=1.
Right now I have a dataset that has only 0.67% of customers being churners and 99.927% of non-churners. My dataset has approximately 65790 observations and the frequency of churners is only 330 of the dataset, I must oversample or undersample (i don't know what is best) in order to have an 80-20 distribution. Unfortunately, I cannot come up with an actionable code that succeeds in the task.
Below a small data step of 100 observations that represents what my real dataset is. Hopefully, you can play around with it to create the code that will allow me to achieve the desired goal.
data have;
input customer_id_ano churn;
cards;
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
49 0
50 0
51 0
52 0
53 0
54 0
55 0
56 0
57 0
58 0
59 0
60 0
61 1
62 0
63 0
64 0
65 0
66 0
67 0
68 0
69 0
70 0
71 0
72 1
73 1
74 0
75 0
76 0
77 0
78 0
79 0
80 0
81 0
82 1
83 0
84 0
85 0
86 0
87 0
88 0
89 0
90 0
91 0
92 0
93 1
94 0
95 0
96 0
97 0
98 0
99 0
100 0;
run;
Thank you in advance for the precious help you always provide me with!
Since your example data only contains 5% churn there is NO way to get 20%.
data have; do customer=1 to 60000; churn = (rand('uniform') < 0.007); output; end; run; /* to show that we may a reasonable facsimile of your data*/ proc freq data=have; tables churn; run; /* algebra gives us a maximum number of Churn to select and the appropriate associated nonchurn to select, place the values into macro variables for use in Proc survey select if you want to use fewer churn then write a data step or that will create the macro variables with the number desired instead of count(*) */ proc sql noprint; select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt from have where churn=1 ; quit; %put &NonChurnCnt &ChurnCnt; /* we are going to use STRATA in survey select requiring the data to be sorted */ proc sort data=have; by churn; run; proc surveyselect data=have out=want sampsize=(&NonChurnCnt &ChurnCnt); strata churn; run; /*examine the output*/ proc freq data=want; tables churn; run;
Since your example data only contains 5% churn there is NO way to get 20%.
data have; do customer=1 to 60000; churn = (rand('uniform') < 0.007); output; end; run; /* to show that we may a reasonable facsimile of your data*/ proc freq data=have; tables churn; run; /* algebra gives us a maximum number of Churn to select and the appropriate associated nonchurn to select, place the values into macro variables for use in Proc survey select if you want to use fewer churn then write a data step or that will create the macro variables with the number desired instead of count(*) */ proc sql noprint; select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt from have where churn=1 ; quit; %put &NonChurnCnt &ChurnCnt; /* we are going to use STRATA in survey select requiring the data to be sorted */ proc sort data=have; by churn; run; proc surveyselect data=have out=want sampsize=(&NonChurnCnt &ChurnCnt); strata churn; run; /*examine the output*/ proc freq data=want; tables churn; run;
This part of your original post:
"100 observations I must end up with 80% observations being churn=0 and 20% being churn=1."
made me think that the resulting data set should have 20. Since you provided an example data set of 100 I thought that was the data set you were refering to.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.