SAS Programming

Tommy1 · Posted 10-17-2023 03:55 PM

Hi SAS Community,

I have a distribution of how often data occurs in certain regions in dataset 1. I am trying to create a stratified random sample of another dataset with the same region variable so that it has the same proportion of accounts as from each region as in dataset 1.

I have been trying to use PROC SURVEYSELECT to do this, but I am struggling to figure out how to tie the distribution from one dataset to another and the documentation has only made me more confused.

An example of what I am trying to do is:

Dataset 1:

State Count Percent_of_total

PA 10 %10

NY 30 %30

DE 60 %60

Dataset 2 N=200

State Count Percent_of_total

PA 40 %20

NY 50 %25

DE 110 %55

Lets say I want to randomly sample 100 of the 200. I would want to match the proportions from dataset 1 so that PA made up %10 of the sample of 100, NY would be 30% of the data, and DE would be 60% of the data. That would mean that PA I would select 10/40, NY I would pick 30/50, and DE 60/110.

I hope this makes sense what I am trying to do. Happy to clarify if what I am saying doesn't make sense.

Thanks so much for the help!

ballardw · Posted 10-17-2023 05:30 PM

Proc Surveyselects assumes that you have some sort of population file that you want to select observations from.

I don't see a clear statement that you have such a data set.

Do you want to select a COUNT or a PERCENTAGE of records? You get to pick one with surveyselect.

If you specify a percentage (SAMPRATE) that will apply the percentage to a STRATA if any and is almost certainly not what you want. If your population set is large enough and you select a random sample of size N without any strata the result from surveyselect will be close to the proportion of any variable like state in the data. But likely won't be exact.

SAMPSIZE will let specify an exact number for each strata that SAS will attempt to match if there is enough in the population.

One example:

/* create a populstion data set*/
data dummy;
  do i=1 to 10000;
     assign=rand('integer',3);
     select (assign);
        when(1) state='PA';
        when(2) state='NY';
        when(3) state='DE';
        otherwise;
     end;
     output;
  end;
run;
/* strata wants sorted data by the strata variable*/
proc sort data=dummy;
   by state;
run;
/*_nsize_ is specific keyword to specify 
   count of observations, other keywords
   for other criteria
   The strata variable(s) values must match that
   of the population set 
*/
data selectcontrol;
   input state :$2. _nsize_;
datalines;
PA 10
NY 30
DE 60
;
/* sort order fro the control data set 
   has to match the strata
*/
proc sort data=selectcontrol;
   by state;
run;


Proc surveyselect data=dummy out=selected
   sampsize=selectcontrol  /*<= this tells SAS to use the control set*/
;
strata state;
run;
;

This would be a bit of overkill once you understand how the STRATA and SAMPSIZE options interact. You can specify a list of values for SAMPSIZE such as Sampsize=(60 30 10 ). The FIRST number is the number of observations to select from the first strata value, second from the second strata, third from the third ( and so on). This gets a bit more complicated with two strata variables to match the number to the combination. Look closely at your output data set from sorting the population.

This will create a similar selection to the previous:

Proc surveyselect data=dummy out=selected2
   sampsize=(60 10 30)
;
strata state;
run;

If you want to be able to repeat the same selection with the exact same code specify the SEED= <some number between and 32K>. Note that changing OS or SAS version is likely to not duplicate the results because of other factors out of your control.

View solution in original post

ballardw · Posted 10-17-2023 05:30 PM

Proc Surveyselects assumes that you have some sort of population file that you want to select observations from.

I don't see a clear statement that you have such a data set.

Do you want to select a COUNT or a PERCENTAGE of records? You get to pick one with surveyselect.

If you specify a percentage (SAMPRATE) that will apply the percentage to a STRATA if any and is almost certainly not what you want. If your population set is large enough and you select a random sample of size N without any strata the result from surveyselect will be close to the proportion of any variable like state in the data. But likely won't be exact.

SAMPSIZE will let specify an exact number for each strata that SAS will attempt to match if there is enough in the population.

One example:

/* create a populstion data set*/
data dummy;
  do i=1 to 10000;
     assign=rand('integer',3);
     select (assign);
        when(1) state='PA';
        when(2) state='NY';
        when(3) state='DE';
        otherwise;
     end;
     output;
  end;
run;
/* strata wants sorted data by the strata variable*/
proc sort data=dummy;
   by state;
run;
/*_nsize_ is specific keyword to specify 
   count of observations, other keywords
   for other criteria
   The strata variable(s) values must match that
   of the population set 
*/
data selectcontrol;
   input state :$2. _nsize_;
datalines;
PA 10
NY 30
DE 60
;
/* sort order fro the control data set 
   has to match the strata
*/
proc sort data=selectcontrol;
   by state;
run;


Proc surveyselect data=dummy out=selected
   sampsize=selectcontrol  /*<= this tells SAS to use the control set*/
;
strata state;
run;
;

This would be a bit of overkill once you understand how the STRATA and SAMPSIZE options interact. You can specify a list of values for SAMPSIZE such as Sampsize=(60 30 10 ). The FIRST number is the number of observations to select from the first strata value, second from the second strata, third from the third ( and so on). This gets a bit more complicated with two strata variables to match the number to the combination. Look closely at your output data set from sorting the population.

This will create a similar selection to the previous:

Proc surveyselect data=dummy out=selected2
   sampsize=(60 10 30)
;
strata state;
run;

If you want to be able to repeat the same selection with the exact same code specify the SEED= <some number between and 32K>. Note that changing OS or SAS version is likely to not duplicate the results because of other factors out of your control.

Tommy1 · Posted 10-18-2023 12:58 PM

@ballardw Thank you so much for the speedy reply! In my effort to create a simplified example to explain what I am trying to do with the data, I didn't show that I do in fact have a population file.

You explain this so well and make it so easy to understand. I feel like the documentation was so confusing that I couldn't figure out which options to use. I was able to modify your code to do exactly what I wanted. I am creating a ton of different versions for what I need to do and this allows me to make all those different versions.

Thank you so much for your help!

SAS Programming

Create random sample using the distribution of one dataset to randomly sample from another data set

Re: Create random sample using the distribution of one dataset to randomly sample from another data

Re: Create random sample using the distribution of one dataset to randomly sample from another data

Re: Create random sample using the distribution of one dataset to randomly sample from another data

generate random sample with dataset

random sample from another data set

Create random sample from a dataset

Sample Decision Tree Data

Creating permanent dataset after sampling

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...