BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Tommy1
Quartz | Level 8

Hi SAS Community,

 

I have a distribution of how often data occurs in certain regions in dataset 1. I am trying to create a stratified random sample of another dataset with the same region variable so that it has the same proportion of accounts as from each region as in dataset 1. 

 

I have been trying to use PROC SURVEYSELECT to do this, but I am struggling to figure out how to tie the distribution from one dataset to another and the documentation has only made me more confused. 

 

An example of what I am trying to do is:

Dataset 1:

State Count Percent_of_total

PA   10   %10

NY   30  %30

DE   60   %60

 

Dataset 2 N=200 

State Count Percent_of_total

PA      40      %20

NY     50       %25

DE     110      %55

Lets say I want to randomly sample 100 of the 200. I would want to match the proportions from dataset 1 so that PA made up  %10 of the sample of 100, NY would be 30% of the data, and DE would be 60% of the data. That would mean that PA I would select 10/40, NY I would pick 30/50, and DE 60/110. 

 

I hope this makes sense what I am trying to do. Happy to clarify if what I am saying doesn't make sense. 

Thanks so much for the help!

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Proc Surveyselects assumes that you have some sort of population file that you want to select observations from.

 

I don't see a clear statement that you have such a data set.

Do you want to select a COUNT or a PERCENTAGE of records? You get to pick one with surveyselect.

If you specify a percentage (SAMPRATE) that will apply the percentage to a STRATA if any and is almost certainly not what you want. If your population set is large enough and you select a random sample of size N without any strata the result from surveyselect will be close to the proportion of any variable like state in the data. But likely won't be exact.

SAMPSIZE will let specify an exact number for each strata that SAS will attempt to match if there is enough in the population.

 

One example:

/* create a populstion data set*/
data dummy;
  do i=1 to 10000;
     assign=rand('integer',3);
     select (assign);
        when(1) state='PA';
        when(2) state='NY';
        when(3) state='DE';
        otherwise;
     end;
     output;
  end;
run;
/* strata wants sorted data by the strata variable*/
proc sort data=dummy;
   by state;
run;
/*_nsize_ is specific keyword to specify 
   count of observations, other keywords
   for other criteria
   The strata variable(s) values must match that
   of the population set 
*/
data selectcontrol;
   input state :$2. _nsize_;
datalines;
PA 10
NY 30
DE 60
;
/* sort order fro the control data set 
   has to match the strata
*/
proc sort data=selectcontrol;
   by state;
run;


Proc surveyselect data=dummy out=selected
   sampsize=selectcontrol  /*<= this tells SAS to use the control set*/
;
strata state;
run;
;

This would be a bit of overkill once you understand how the STRATA and SAMPSIZE options interact. You can specify a list of values for SAMPSIZE such as Sampsize=(60 30 10 ). The FIRST number is the number of observations to select from the first strata value, second from the second strata, third from the third ( and so on). This gets a bit more complicated with two strata variables to match the number to the combination. Look closely at your output data set from sorting the population.

This will create a similar selection to the previous:

Proc surveyselect data=dummy out=selected2
   sampsize=(60 10 30)
;
strata state;
run;

If you want to be able to repeat the same selection with the exact same code specify the SEED= <some number between  and 32K>. Note that changing OS or SAS version is likely to not duplicate the results because of other factors out of your control.

 

 

View solution in original post

2 REPLIES 2
ballardw
Super User

Proc Surveyselects assumes that you have some sort of population file that you want to select observations from.

 

I don't see a clear statement that you have such a data set.

Do you want to select a COUNT or a PERCENTAGE of records? You get to pick one with surveyselect.

If you specify a percentage (SAMPRATE) that will apply the percentage to a STRATA if any and is almost certainly not what you want. If your population set is large enough and you select a random sample of size N without any strata the result from surveyselect will be close to the proportion of any variable like state in the data. But likely won't be exact.

SAMPSIZE will let specify an exact number for each strata that SAS will attempt to match if there is enough in the population.

 

One example:

/* create a populstion data set*/
data dummy;
  do i=1 to 10000;
     assign=rand('integer',3);
     select (assign);
        when(1) state='PA';
        when(2) state='NY';
        when(3) state='DE';
        otherwise;
     end;
     output;
  end;
run;
/* strata wants sorted data by the strata variable*/
proc sort data=dummy;
   by state;
run;
/*_nsize_ is specific keyword to specify 
   count of observations, other keywords
   for other criteria
   The strata variable(s) values must match that
   of the population set 
*/
data selectcontrol;
   input state :$2. _nsize_;
datalines;
PA 10
NY 30
DE 60
;
/* sort order fro the control data set 
   has to match the strata
*/
proc sort data=selectcontrol;
   by state;
run;


Proc surveyselect data=dummy out=selected
   sampsize=selectcontrol  /*<= this tells SAS to use the control set*/
;
strata state;
run;
;

This would be a bit of overkill once you understand how the STRATA and SAMPSIZE options interact. You can specify a list of values for SAMPSIZE such as Sampsize=(60 30 10 ). The FIRST number is the number of observations to select from the first strata value, second from the second strata, third from the third ( and so on). This gets a bit more complicated with two strata variables to match the number to the combination. Look closely at your output data set from sorting the population.

This will create a similar selection to the previous:

Proc surveyselect data=dummy out=selected2
   sampsize=(60 10 30)
;
strata state;
run;

If you want to be able to repeat the same selection with the exact same code specify the SEED= <some number between  and 32K>. Note that changing OS or SAS version is likely to not duplicate the results because of other factors out of your control.

 

 

Tommy1
Quartz | Level 8

@ballardw Thank you so much for the speedy reply! In my effort to create a simplified example to explain what I am trying to do with the data, I didn't show that I do in fact have a population file.

 

You explain this so well and make it so easy to understand. I feel like the documentation was so confusing that I couldn't figure out which options to use. I was able to modify your code to do exactly what I wanted. I am creating a ton of different versions for what I need to do and this allows me to make all those different versions. 

 

Thank you so much for your help!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 1492 views
  • 3 likes
  • 2 in conversation