In Using Proc SurveySelect (from SAS 9.4) for sampling Train and Validation data-sets in an 80-20 split, I find that the number of records does not exactly correspond to 80% of the original for the Train set (or exactly 20% in the Validation set). Is this normal ? One of my strata variables contains missing values (about 2% of variable D is missing).
The original data-set, overall_new, contains 15,573 rows and 80% of it is 12458 rows.
In the resulting data-sets , the Train set , intime_TR, has 12475 rows, which is more than 12,458. Any ideas why this might be so?
Thanks!
proc surveyselect data=overall_new
out=sorted_intime
noprint
seed=1234
method=srs
samprate=80
outall;
strata A B C D;
run;
/*12475 rows*/
data intime_TR;
set sorted_intime;
if selected =1;
run;
/*3098 rows*/
data intime_VAL;
set sorted_intime;
if selected =0;
run;
From the documentation:
PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT .
Which means that your missing D strata has one more level than values which is likely causing issues with the A B C combinations
Consider a strata that only has 7 members and you request a samprate of 80. How many would you expect in the output? (Hint: 7* .8= 5.6 rounds to 6) (or 80 percent of 23 or practically anything you'll have rounding issues.).
You may be having multiple round up issues due to the sizes of your strata.
Run this code:
proc freq data=overall_new;
tables a*b*c*d/list missing;
run;
and see how many records per combination of the strata you have.
You don't mention how many levels any of your strata have but if there are more than 5 each and are roughly evenly distributed you don't have many records per combination of strata variables, about 25 per combination. With more levels the numbers per strata combination can go way down increasing the issue of rounding to 80 percent per.
You might be better served by summarizing the input data by the strata variables, getting an explicit count of available (proc means or summary don't forget missing option), using a data step to do your rounding per combination and use that as a SAMPSIZE data set.
From the documentation:
PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT .
Which means that your missing D strata has one more level than values which is likely causing issues with the A B C combinations
Consider a strata that only has 7 members and you request a samprate of 80. How many would you expect in the output? (Hint: 7* .8= 5.6 rounds to 6) (or 80 percent of 23 or practically anything you'll have rounding issues.).
You may be having multiple round up issues due to the sizes of your strata.
Run this code:
proc freq data=overall_new;
tables a*b*c*d/list missing;
run;
and see how many records per combination of the strata you have.
You don't mention how many levels any of your strata have but if there are more than 5 each and are roughly evenly distributed you don't have many records per combination of strata variables, about 25 per combination. With more levels the numbers per strata combination can go way down increasing the issue of rounding to 80 percent per.
You might be better served by summarizing the input data by the strata variables, getting an explicit count of available (proc means or summary don't forget missing option), using a data step to do your rounding per combination and use that as a SAMPSIZE data set.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.