turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Studio
- /
- How to cut dataset such that the proportion of obs...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-28-2017 10:12 AM - edited 07-28-2017 11:10 AM

Hi SAS experts,

I am using SAS University Edition v.9.4.

I need your help with writing a code that will help me doing the following task:

I must undersample (or oversample) the dataset such that the observations linked to the binary variable churn (0 no churn and 1 churn) are distributed with a proportion of 80% not-churn and 20% churn. In other words, with a dataset of i.e. 100 observations I must end up with 80% observations being churn=0 and 20% being churn=1.

Right now I have a dataset that has only 0.67% of customers being churners and 99.927% of non-churners. My dataset has approximately 65790 observations and the frequency of churners is only 330 of the dataset, I must oversample or undersample (i don't know what is best) in order to have an 80-20 distribution. Unfortunately, I cannot come up with an actionable code that succeeds in the task.

Below a small data step of 100 observations that represents what my real dataset is. Hopefully, you can play around with it to create the code that will allow me to achieve the desired goal.

```
data have;
input customer_id_ano churn;
cards;
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
49 0
50 0
51 0
52 0
53 0
54 0
55 0
56 0
57 0
58 0
59 0
60 0
61 1
62 0
63 0
64 0
65 0
66 0
67 0
68 0
69 0
70 0
71 0
72 1
73 1
74 0
75 0
76 0
77 0
78 0
79 0
80 0
81 0
82 1
83 0
84 0
85 0
86 0
87 0
88 0
89 0
90 0
91 0
92 0
93 1
94 0
95 0
96 0
97 0
98 0
99 0
100 0;
run;
```

Thank you in advance for the precious help you always provide me with!

Accepted Solutions

Solution

07-29-2017
02:36 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to noemi_b

07-28-2017 11:33 AM

Since your example data only contains 5% churn there is NO way to get 20%.

data have; do customer=1 to 60000; churn = (rand('uniform') < 0.007); output; end; run; /* to show that we may a reasonable facsimile of your data*/ proc freq data=have; tables churn; run; /* algebra gives us a maximum number of Churn to select and the appropriate associated nonchurn to select, place the values into macro variables for use in Proc survey select if you want to use fewer churn then write a data step or that will create the macro variables with the number desired instead of count(*) */ proc sql noprint; select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt from have where churn=1 ; quit; %put &NonChurnCnt &ChurnCnt; /* we are going to use STRATA in survey select requiring the data to be sorted */ proc sort data=have; by churn; run; proc surveyselect data=have out=want sampsize=(&NonChurnCnt &ChurnCnt); strata churn; run; /*examine the output*/ proc freq data=want; tables churn; run;

All Replies

Solution

07-29-2017
02:36 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to noemi_b

07-28-2017 11:33 AM

Since your example data only contains 5% churn there is NO way to get 20%.

data have; do customer=1 to 60000; churn = (rand('uniform') < 0.007); output; end; run; /* to show that we may a reasonable facsimile of your data*/ proc freq data=have; tables churn; run; /* algebra gives us a maximum number of Churn to select and the appropriate associated nonchurn to select, place the values into macro variables for use in Proc survey select if you want to use fewer churn then write a data step or that will create the macro variables with the number desired instead of count(*) */ proc sql noprint; select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt from have where churn=1 ; quit; %put &NonChurnCnt &ChurnCnt; /* we are going to use STRATA in survey select requiring the data to be sorted */ proc sort data=have; by churn; run; proc surveyselect data=have out=want sampsize=(&NonChurnCnt &ChurnCnt); strata churn; run; /*examine the output*/ proc freq data=want; tables churn; run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ballardw

07-28-2017 02:42 PM

Dear @ballardw,

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to noemi_b

07-28-2017 03:41 PM

noemi_b wrote:

Dear @ballardw,

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi

The phrasing of your intial question sounded like you wanted 20 to me.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to noemi_b

07-28-2017 03:49 PM

My original dataset has a proportion of 99.927% of customers being non-churners, while only the 0.67% of the sample is churner. Now, I would like to change this distribution of non-churners vs. Churners to an 80% and 20%, respectively. I want to do this in order to remove the problem of running a logistic regression with a rare event that would lead me to a biased regression result with biased estimates and probabilities.

So, I run your code and I must admit that it seems like achieving my desired goal. However, before posting me the code you wrote to me the following:

"Since your example data only contains 5% of churn there is now way to achieve 20%"

So, What did you mean by saying that? :-)

So, I run your code and I must admit that it seems like achieving my desired goal. However, before posting me the code you wrote to me the following:

"Since your example data only contains 5% of churn there is now way to achieve 20%"

So, What did you mean by saying that? :-)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to noemi_b

07-28-2017 06:36 PM

This part of your original post:

"100 observations I must end up with 80% observations being churn=0 and 20% being churn=1."

made me think that the resulting data set should have 20. Since you provided an example data set of 100 I thought that was the data set you were refering to.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ballardw

07-29-2017 02:38 AM

whatever the starting sample size was, I simply wanted to end up with 80% of the sample being non-churners and 20% being churners. And apparently proc surveyselect is the right procedure I needed. Thank you for the help! :-)