Solved
Contributor
Posts: 46

# Proc Survey Select - Stratified Random Sampling

[ Edited ]

In Using Proc SurveySelect (from SAS 9.4)  for sampling Train and Validation data-sets in an 80-20 split, I find that the number of records does not exactly correspond to 80% of the original for the Train set (or exactly 20% in the Validation set). Is this normal ? One of my strata variables contains missing values (about 2% of variable D is missing).

The original data-set, overall_new, contains 15,573 rows and 80% of it is 12458 rows.

In the resulting data-sets , the Train set , intime_TR, has 12475 rows, which is more than 12,458. Any ideas why this might be so?

Thanks!

``````
proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D;
run;/*12475 rows*/data intime_TR;set sorted_intime;if selected =1;run;/*3098 rows*/data intime_VAL;set sorted_intime;if selected =0;run;``````

Accepted Solutions
Solution
‎02-19-2018 04:16 AM
Super User
Posts: 13,583

## Re: Proc Survey Select - Stratified Random Sampling

From the documentation:

PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT .

Which means that your missing D strata has one more level than values which is likely causing issues with the A B C combinations

Consider a strata that only has 7 members and you request a samprate of 80. How many would you expect in the output? (Hint: 7* .8= 5.6 rounds to 6) (or 80 percent of 23 or practically anything you'll have rounding issues.).

You may be having multiple round up issues due to the sizes of your strata.

Run this code:

proc freq data=overall_new;

tables a*b*c*d/list missing;

run;

and see how many records per combination of the strata you have.

You don't mention how many levels any of your strata have but if there are more than 5 each and are roughly evenly distributed you don't have many records per combination of strata variables, about 25 per combination. With more levels the numbers per strata combination can go way down increasing the issue of rounding to 80 percent per.

You might be better served by summarizing the input data by the strata variables, getting an explicit count of available (proc means or summary don't forget missing option), using a data step to do your rounding per combination and use that as a SAMPSIZE data set.

All Replies
Solution
‎02-19-2018 04:16 AM
Super User
Posts: 13,583

## Re: Proc Survey Select - Stratified Random Sampling

From the documentation:

PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT .

Which means that your missing D strata has one more level than values which is likely causing issues with the A B C combinations

Consider a strata that only has 7 members and you request a samprate of 80. How many would you expect in the output? (Hint: 7* .8= 5.6 rounds to 6) (or 80 percent of 23 or practically anything you'll have rounding issues.).

You may be having multiple round up issues due to the sizes of your strata.

Run this code:

proc freq data=overall_new;

tables a*b*c*d/list missing;

run;

and see how many records per combination of the strata you have.

You don't mention how many levels any of your strata have but if there are more than 5 each and are roughly evenly distributed you don't have many records per combination of strata variables, about 25 per combination. With more levels the numbers per strata combination can go way down increasing the issue of rounding to 80 percent per.

You might be better served by summarizing the input data by the strata variables, getting an explicit count of available (proc means or summary don't forget missing option), using a data step to do your rounding per combination and use that as a SAMPSIZE data set.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats