Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Studio
- /
- How to cut dataset such that the proportion of obs...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

3 weeks ago - last edited 3 weeks ago

Hi SAS experts,

I am using SAS University Edition v.9.4.

I need your help with writing a code that will help me doing the following task:

I must undersample (or oversample) the dataset such that the observations linked to the binary variable churn (0 no churn and 1 churn) are distributed with a proportion of 80% not-churn and 20% churn. In other words, with a dataset of i.e. 100 observations I must end up with 80% observations being churn=0 and 20% being churn=1.

Right now I have a dataset that has only 0.67% of customers being churners and 99.927% of non-churners. My dataset has approximately 65790 observations and the frequency of churners is only 330 of the dataset, I must oversample or undersample (i don't know what is best) in order to have an 80-20 distribution. Unfortunately, I cannot come up with an actionable code that succeeds in the task.

Below a small data step of 100 observations that represents what my real dataset is. Hopefully, you can play around with it to create the code that will allow me to achieve the desired goal.

```
data have;
input customer_id_ano churn;
cards;
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
49 0
50 0
51 0
52 0
53 0
54 0
55 0
56 0
57 0
58 0
59 0
60 0
61 1
62 0
63 0
64 0
65 0
66 0
67 0
68 0
69 0
70 0
71 0
72 1
73 1
74 0
75 0
76 0
77 0
78 0
79 0
80 0
81 0
82 1
83 0
84 0
85 0
86 0
87 0
88 0
89 0
90 0
91 0
92 0
93 1
94 0
95 0
96 0
97 0
98 0
99 0
100 0;
run;
```

Thank you in advance for the precious help you always provide me with!

Accepted Solutions

Solution

3 weeks ago

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

Since your example data only contains 5% churn there is NO way to get 20%.

data have; do customer=1 to 60000; churn = (rand('uniform') < 0.007); output; end; run; /* to show that we may a reasonable facsimile of your data*/ proc freq data=have; tables churn; run; /* algebra gives us a maximum number of Churn to select and the appropriate associated nonchurn to select, place the values into macro variables for use in Proc survey select if you want to use fewer churn then write a data step or that will create the macro variables with the number desired instead of count(*) */ proc sql noprint; select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt from have where churn=1 ; quit; %put &NonChurnCnt &ChurnCnt; /* we are going to use STRATA in survey select requiring the data to be sorted */ proc sort data=have; by churn; run; proc surveyselect data=have out=want sampsize=(&NonChurnCnt &ChurnCnt); strata churn; run; /*examine the output*/ proc freq data=want; tables churn; run;

All Replies

Solution

3 weeks ago

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

Since your example data only contains 5% churn there is NO way to get 20%.

data have; do customer=1 to 60000; churn = (rand('uniform') < 0.007); output; end; run; /* to show that we may a reasonable facsimile of your data*/ proc freq data=have; tables churn; run; /* algebra gives us a maximum number of Churn to select and the appropriate associated nonchurn to select, place the values into macro variables for use in Proc survey select if you want to use fewer churn then write a data step or that will create the macro variables with the number desired instead of count(*) */ proc sql noprint; select ceil( (count(*)/0.2)*0.8), count(*) into : NonChurnCnt, : ChurnCnt from have where churn=1 ; quit; %put &NonChurnCnt &ChurnCnt; /* we are going to use STRATA in survey select requiring the data to be sorted */ proc sort data=have; by churn; run; proc surveyselect data=have out=want sampsize=(&NonChurnCnt &ChurnCnt); strata churn; run; /*examine the output*/ proc freq data=want; tables churn; run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

Dear @ballardw,

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

noemi_b wrote:

Dear @ballardw,

Why I cannot achieve an 80-20 proportion if I start from a 5-95 proportion?

Thank you in advance.

-Noemi

The phrasing of your intial question sounded like you wanted 20 to me.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

My original dataset has a proportion of 99.927% of customers being non-churners, while only the 0.67% of the sample is churner. Now, I would like to change this distribution of non-churners vs. Churners to an 80% and 20%, respectively. I want to do this in order to remove the problem of running a logistic regression with a rare event that would lead me to a biased regression result with biased estimates and probabilities.

So, I run your code and I must admit that it seems like achieving my desired goal. However, before posting me the code you wrote to me the following:

"Since your example data only contains 5% of churn there is now way to achieve 20%"

So, What did you mean by saying that? :-)

So, I run your code and I must admit that it seems like achieving my desired goal. However, before posting me the code you wrote to me the following:

"Since your example data only contains 5% of churn there is now way to achieve 20%"

So, What did you mean by saying that? :-)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

This part of your original post:

"100 observations I must end up with 80% observations being churn=0 and 20% being churn=1."

made me think that the resulting data set should have 20. Since you provided an example data set of 100 I thought that was the data set you were refering to.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

3 weeks ago

whatever the starting sample size was, I simply wanted to end up with 80% of the sample being non-churners and 20% being churners. And apparently proc surveyselect is the right procedure I needed. Thank you for the help! :-)