turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Proc Survey Select - Stratified Random Sampling

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-15-2018 05:06 AM - edited 02-15-2018 05:10 AM

In Using Proc SurveySelect (from SAS 9.4) for sampling Train and Validation data-sets in an 80-20 split, I find that the number of records does not exactly correspond to 80% of the original for the Train set (or exactly 20% in the Validation set). Is this normal ? One of my strata variables contains missing values (about 2% of variable D is missing).

The original data-set, *overall_new,* contains 15,573 rows and 80% of it is 12458 rows.

In the resulting data-sets , the Train set , *intime_TR,* has 12475 rows, which is more than 12,458. Any ideas why this might be so?

Thanks!

```
proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D;
run;
```

/*12475 rows*/

data intime_TR;

set sorted_intime;

if selected =1;

run;

/*3098 rows*/

data intime_VAL;

set sorted_intime;

if selected =0;

run;

Accepted Solutions

Solution

02-19-2018
04:16 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to nstdt

02-15-2018 11:29 AM

From the documentation:

PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT .

Which means that your missing D strata has one more level than values which is likely causing issues with the A B C combinations

Consider a strata that only has 7 members and you request a samprate of 80. How many would you expect in the output? (Hint: 7* .8= 5.6 rounds to 6) (or 80 percent of 23 or practically anything you'll have rounding issues.).

You may be having multiple round up issues due to the sizes of your strata.

Run this code:

proc freq data=overall_new;

tables a*b*c*d/list missing;

run;

and see how many records per combination of the strata you have.

You don't mention how many levels any of your strata have but if there are more than 5 each and are roughly evenly distributed you don't have many records per combination of strata variables, about 25 per combination. With more levels the numbers per strata combination can go way down increasing the issue of rounding to 80 percent per.

You might be better served by summarizing the input data by the strata variables, getting an explicit count of available (proc means or summary don't forget missing option), using a data step to do your rounding per combination and use that as a SAMPSIZE data set.

All Replies

Solution

02-19-2018
04:16 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to nstdt

02-15-2018 11:29 AM

From the documentation:

PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT .

Which means that your missing D strata has one more level than values which is likely causing issues with the A B C combinations

Consider a strata that only has 7 members and you request a samprate of 80. How many would you expect in the output? (Hint: 7* .8= 5.6 rounds to 6) (or 80 percent of 23 or practically anything you'll have rounding issues.).

You may be having multiple round up issues due to the sizes of your strata.

Run this code:

proc freq data=overall_new;

tables a*b*c*d/list missing;

run;

and see how many records per combination of the strata you have.

You don't mention how many levels any of your strata have but if there are more than 5 each and are roughly evenly distributed you don't have many records per combination of strata variables, about 25 per combination. With more levels the numbers per strata combination can go way down increasing the issue of rounding to 80 percent per.

You might be better served by summarizing the input data by the strata variables, getting an explicit count of available (proc means or summary don't forget missing option), using a data step to do your rounding per combination and use that as a SAMPSIZE data set.