Solved: Re: Proc Survey Stratified Random Sampling

Jeff_DOC · Posted 03-05-2018 12:19 PM

Good afternoon.

Please see the example file at: https://communities.sas.com/t5/SAS-Procedures/Proc-Survey-Select-Stratified-Random-Sampling/m-p/4414...

I'm hoping someone out there can help me with proc surveyselect. I have a dataset of multiple locations and within each location are multiple caseloads. What I need is to randomly select 1 case from each caseload to a maximum selection of 11 per audit location. If there are more caseloads than 11 in the audit location I just want a random selection of 11 even if that means not selecting a case from one caseload. If there are less than 11 caseloads for a location I need as many cases from each caseload to get to 11. I've read the documentation and read over the boards and some answers can just about get me there but not quite.

I am using Enterprise Guide version 7.15 HF2

Thank you for any help you can provide.

Here's one of the many iterations I've tried and a file attachment for practice

proc surveyselect data = have01

stats

n = 1

out = want01

sampsize = 11

selectall;

/* method=sys*/

/* n=11;*/

/* control caseload;*/

strata audit_location caseload ;

run;

ballardw · Posted 03-05-2018 03:42 PM

@Jeff_DOC wrote:

Hey Ballardw.

Thanks for taking the time to look it over. Sorry my explanation wasn't all that clear.

What I'm looking for is a random sample of 1 per caseload but no more than 11 from any one location.

Thanks for talking the time to try to help.

that would imply Caseload as the strata, but Survey Select will select at least one from each strata.

Perhaps this is a two stage process. Step one would be to summarize the count of caseloads per location (Proc freq).

Then use that information in that set to create a sampsize dataset. The summary data set could be used to create a sample from the locations with more than 11 caseloads and a separate one for location with fewer than 11 caseloads with . The SAMPSIZE data set would contain the Location and Caseload values to select from plus an N for how many from that combination. Then combine the created sampsize sets for use with the full data. You likely would need to keep the first stage sampling probabilities to calculate a final sample probability and weight.

View solution in original post

ballardw · Posted 03-05-2018 12:38 PM

The first time you posted this I spent some time trying to figure out what you wanted. I couldn't get anything that really made sense in terms of "stratified" sample. Stratified to me means you have one or more categorical variable that is subdividing your data in some order (state then county for example). But your wording "from each caseload" with "more caseloads than 11 in the audit location I just want a random selection of 11 even if that means not selecting a case from one caseload". Makes it hard to tell which order is most important.

It might help to provide a small dummy data set and show what an actual "sample" would look like from that data.

I would likely start by dropping caseload from your strata statement and see if the result comes close to what you want.

Jeff_DOC · Posted 03-05-2018 12:46 PM

Hey Ballardw.

Thanks for taking the time to look it over. Sorry my explanation wasn't all that clear.

What I'm looking for is a random sample of 1 per caseload but no more than 11 from any one location.

Thanks for talking the time to try to help.

ballardw · Posted 03-05-2018 03:42 PM

@Jeff_DOC wrote:

Hey Ballardw.

Thanks for taking the time to look it over. Sorry my explanation wasn't all that clear.

What I'm looking for is a random sample of 1 per caseload but no more than 11 from any one location.

Thanks for talking the time to try to help.

that would imply Caseload as the strata, but Survey Select will select at least one from each strata.

Perhaps this is a two stage process. Step one would be to summarize the count of caseloads per location (Proc freq).

Then use that information in that set to create a sampsize dataset. The summary data set could be used to create a sample from the locations with more than 11 caseloads and a separate one for location with fewer than 11 caseloads with . The SAMPSIZE data set would contain the Location and Caseload values to select from plus an N for how many from that combination. Then combine the created sampsize sets for use with the full data. You likely would need to keep the first stage sampling probabilities to calculate a final sample probability and weight.

Jeff_DOC · Posted 03-06-2018 11:28 AM

I think you have the right idea about a two step process. That's not something I'd thought of before. perhaps I can randomly select one from each caseload in step 1 and then from that select 11 from each location in step 2? Thanks for the help and the idea.

Jeff_DOC · Posted 03-06-2018 06:28 PM

The two step process seems to have worked. First I stratified the data by caseload and then the second time by location. Thanks for the help.

Classroom Training Available!