BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
noemi_b
Obsidian | Level 7

Hi SAS experts, 

I am using SAS University Edition v.9.4.

 

 I need your help with writing a code that will help me doing the following task:

 

I must undersample (or oversample) the dataset such that the observations linked to the binary variable churn (0 no churn and 1 churn) are distributed with a proportion of 80% not-churn and 20% churn. In other words, with a dataset of i.e. 100 observations I must end up with 80% observations being churn=0 and 20% being churn=1.

 

Right now I have a dataset that has only 0.67% of customers being churners and 99.927% of non-churners. My dataset has approximately 65790 observations and the frequency of churners is only 330 of the dataset, I must oversample or undersample (i don't know what is best) in order to have an 80-20 distribution. Unfortunately, I cannot come up with an actionable code that succeeds in the task.

Below a small data step of 100 observations that represents what my real dataset is. Hopefully, you can play around with it to create the code that will allow me to achieve the desired goal.

 

data have;
input customer_id_ano churn;
cards;
1	0
2	0
3	0
4	0
5	0
6	0
7	0
8	0
9	0
10	0
11	0
12	0
13	0
14	0
15	0
16	0
17	0
18	0
19	0
20	0
21	0
22	0
23	0
24	0
25	0
26	0
27	0
28	0
29	0
30	0
31	0
32	0
33	0
34	0
35	0
36	0
37	0
38	0
39	0
40	0
41	0
42	0
43	0
44	0
45	0
46	0
47	0
48	0
49	0
50	0
51	0
52	0
53	0
54	0
55	0
56	0
57	0
58	0
59	0
60	0
61	1
62	0
63	0
64	0
65	0
66	0
67	0
68	0
69	0
70	0
71	0
72	1
73	1
74	0
75	0
76	0
77	0
78	0
79	0
80	0
81	0
82	1
83	0
84	0
85	0
86	0
87	0
88	0
89	0
90	0
91	0
92	0
93	1
94	0
95	0
96	0
97	0
98	0
99	0
100	0;
run;

 

Thank you in advance for the precious help you always provide me with!


1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Since your example data only contains 5% churn there is NO way to get 20%.

 

data have;
   do customer=1 to 60000;
      churn = (rand('uniform') < 0.007);
      output;
   end;
 run;

/* to show that we may a reasonable facsimile of your data*/
proc freq data=have;
   tables churn;
run;

/* algebra gives us a maximum number of Churn to select and the appropriate
   associated nonchurn to select, place the values into macro variables
   for use in Proc survey select
   if you want to use fewer churn then write a data step or that will create
   the macro variables with the number desired instead of count(*)
*/
proc sql noprint;
   select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt
   from have
   where churn=1
   ;
quit;

%put &NonChurnCnt &ChurnCnt;
/* we are going to use STRATA in survey select requiring the data to be
   sorted
*/
proc sort data=have;
   by churn;
run;

proc surveyselect data=have out=want
   sampsize=(&NonChurnCnt &ChurnCnt);
   strata churn;
run;
/*examine the output*/
proc freq data=want;
   tables churn;
run;

View solution in original post

6 REPLIES 6
ballardw
Super User

Since your example data only contains 5% churn there is NO way to get 20%.

 

data have;
   do customer=1 to 60000;
      churn = (rand('uniform') < 0.007);
      output;
   end;
 run;

/* to show that we may a reasonable facsimile of your data*/
proc freq data=have;
   tables churn;
run;

/* algebra gives us a maximum number of Churn to select and the appropriate
   associated nonchurn to select, place the values into macro variables
   for use in Proc survey select
   if you want to use fewer churn then write a data step or that will create
   the macro variables with the number desired instead of count(*)
*/
proc sql noprint;
   select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt
   from have
   where churn=1
   ;
quit;

%put &NonChurnCnt &ChurnCnt;
/* we are going to use STRATA in survey select requiring the data to be
   sorted
*/
proc sort data=have;
   by churn;
run;

proc surveyselect data=have out=want
   sampsize=(&NonChurnCnt &ChurnCnt);
   strata churn;
run;
/*examine the output*/
proc freq data=want;
   tables churn;
run;
noemi_b
Obsidian | Level 7
Dear @ballardw,

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi
ballardw
Super User

@noemi_b wrote:
Dear @ballardw,

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi

The phrasing of your intial question sounded like you wanted 20 to me.

noemi_b
Obsidian | Level 7
My original dataset has a proportion of 99.927% of customers being non-churners, while only the 0.67% of the sample is churner. Now, I would like to change this distribution of non-churners vs. Churners to an 80% and 20%, respectively. I want to do this in order to remove the problem of running a logistic regression with a rare event that would lead me to a biased regression result with biased estimates and probabilities.

So, I run your code and I must admit that it seems like achieving my desired goal. However, before posting me the code you wrote to me the following:

"Since your example data only contains 5% of churn there is now way to achieve 20%"

So, What did you mean by saying that? 🙂
ballardw
Super User

This part of your original post:

"100 observations I must end up with 80% observations being churn=0 and 20% being churn=1."

made me think that the resulting data set should have 20. Since you provided an example data set of 100 I thought that was the data set you were refering to.

noemi_b
Obsidian | Level 7
whatever the starting sample size was, I simply wanted to end up with 80% of the sample being non-churners and 20% being churners. And apparently proc surveyselect is the right procedure I needed. Thank you for the help! 🙂

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1455 views
  • 1 like
  • 2 in conversation