BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
nstdt
Quartz | Level 8

In Using Proc SurveySelect (from SAS 9.4)  for sampling Train and Validation data-sets in an 80-20 split, I find that the number of records does not exactly correspond to 80% of the original for the Train set (or exactly 20% in the Validation set). Is this normal ? One of my strata variables contains missing values (about 2% of variable D is missing).

 

The original data-set, overall_new, contains 15,573 rows and 80% of it is 12458 rows.

In the resulting data-sets , the Train set , intime_TR, has 12475 rows, which is more than 12,458. Any ideas why this might be so?

 

Thanks!

 


   


proc surveyselect data=overall_new
                  out=sorted_intime
                       noprint
                       seed=1234
                       method=srs
                       samprate=80
                  outall;
strata A B C D;
run;

/*12475 rows*/

data intime_TR;

set sorted_intime;

if selected =1;

run;


/*3098 rows*/

data intime_VAL;

set sorted_intime;

if selected =0;

run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

From the documentation:

PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT .

 

Which means that your missing D strata has one more level than values which is likely causing issues with the A B C combinations

 

Consider a strata that only has 7 members and you request a samprate of 80. How many would you expect in the output? (Hint: 7* .8= 5.6 rounds to 6) (or 80 percent of 23 or practically anything you'll have rounding issues.).

You may be having multiple round up issues due to the sizes of your strata.

 

Run this code:

proc freq data=overall_new;

tables a*b*c*d/list missing;

run;

and see how many records per combination of the strata you have.

 

You don't mention how many levels any of your strata have but if there are more than 5 each and are roughly evenly distributed you don't have many records per combination of strata variables, about 25 per combination. With more levels the numbers per strata combination can go way down increasing the issue of rounding to 80 percent per.

 

You might be better served by summarizing the input data by the strata variables, getting an explicit count of available (proc means or summary don't forget missing option), using a data step to do your rounding per combination and use that as a SAMPSIZE data set.

View solution in original post

1 REPLY 1
ballardw
Super User

From the documentation:

PROC SURVEYSELECT treats missing values of STRATA and SAMPLINGUNIT variables like any other STRATA or SAMPLINGUNIT .

 

Which means that your missing D strata has one more level than values which is likely causing issues with the A B C combinations

 

Consider a strata that only has 7 members and you request a samprate of 80. How many would you expect in the output? (Hint: 7* .8= 5.6 rounds to 6) (or 80 percent of 23 or practically anything you'll have rounding issues.).

You may be having multiple round up issues due to the sizes of your strata.

 

Run this code:

proc freq data=overall_new;

tables a*b*c*d/list missing;

run;

and see how many records per combination of the strata you have.

 

You don't mention how many levels any of your strata have but if there are more than 5 each and are roughly evenly distributed you don't have many records per combination of strata variables, about 25 per combination. With more levels the numbers per strata combination can go way down increasing the issue of rounding to 80 percent per.

 

You might be better served by summarizing the input data by the strata variables, getting an explicit count of available (proc means or summary don't forget missing option), using a data step to do your rounding per combination and use that as a SAMPSIZE data set.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 2344 views
  • 0 likes
  • 2 in conversation