Hi SAS Community,
I have a distribution of how often data occurs in certain regions in dataset 1. I am trying to create a stratified random sample of another dataset with the same region variable so that it has the same proportion of accounts as from each region as in dataset 1.
I have been trying to use PROC SURVEYSELECT to do this, but I am struggling to figure out how to tie the distribution from one dataset to another and the documentation has only made me more confused.
An example of what I am trying to do is:
Dataset 1:
State Count Percent_of_total
PA 10 %10
NY 30 %30
DE 60 %60
Dataset 2 N=200
State Count Percent_of_total
PA 40 %20
NY 50 %25
DE 110 %55
Lets say I want to randomly sample 100 of the 200. I would want to match the proportions from dataset 1 so that PA made up %10 of the sample of 100, NY would be 30% of the data, and DE would be 60% of the data. That would mean that PA I would select 10/40, NY I would pick 30/50, and DE 60/110.
I hope this makes sense what I am trying to do. Happy to clarify if what I am saying doesn't make sense.
Thanks so much for the help!
Proc Surveyselects assumes that you have some sort of population file that you want to select observations from.
I don't see a clear statement that you have such a data set.
Do you want to select a COUNT or a PERCENTAGE of records? You get to pick one with surveyselect.
If you specify a percentage (SAMPRATE) that will apply the percentage to a STRATA if any and is almost certainly not what you want. If your population set is large enough and you select a random sample of size N without any strata the result from surveyselect will be close to the proportion of any variable like state in the data. But likely won't be exact.
SAMPSIZE will let specify an exact number for each strata that SAS will attempt to match if there is enough in the population.
One example:
/* create a populstion data set*/ data dummy; do i=1 to 10000; assign=rand('integer',3); select (assign); when(1) state='PA'; when(2) state='NY'; when(3) state='DE'; otherwise; end; output; end; run; /* strata wants sorted data by the strata variable*/ proc sort data=dummy; by state; run; /*_nsize_ is specific keyword to specify count of observations, other keywords for other criteria The strata variable(s) values must match that of the population set */ data selectcontrol; input state :$2. _nsize_; datalines; PA 10 NY 30 DE 60 ; /* sort order fro the control data set has to match the strata */ proc sort data=selectcontrol; by state; run; Proc surveyselect data=dummy out=selected sampsize=selectcontrol /*<= this tells SAS to use the control set*/ ; strata state; run; ;
This would be a bit of overkill once you understand how the STRATA and SAMPSIZE options interact. You can specify a list of values for SAMPSIZE such as Sampsize=(60 30 10 ). The FIRST number is the number of observations to select from the first strata value, second from the second strata, third from the third ( and so on). This gets a bit more complicated with two strata variables to match the number to the combination. Look closely at your output data set from sorting the population.
This will create a similar selection to the previous:
Proc surveyselect data=dummy out=selected2 sampsize=(60 10 30) ; strata state; run;
If you want to be able to repeat the same selection with the exact same code specify the SEED= <some number between and 32K>. Note that changing OS or SAS version is likely to not duplicate the results because of other factors out of your control.
Proc Surveyselects assumes that you have some sort of population file that you want to select observations from.
I don't see a clear statement that you have such a data set.
Do you want to select a COUNT or a PERCENTAGE of records? You get to pick one with surveyselect.
If you specify a percentage (SAMPRATE) that will apply the percentage to a STRATA if any and is almost certainly not what you want. If your population set is large enough and you select a random sample of size N without any strata the result from surveyselect will be close to the proportion of any variable like state in the data. But likely won't be exact.
SAMPSIZE will let specify an exact number for each strata that SAS will attempt to match if there is enough in the population.
One example:
/* create a populstion data set*/ data dummy; do i=1 to 10000; assign=rand('integer',3); select (assign); when(1) state='PA'; when(2) state='NY'; when(3) state='DE'; otherwise; end; output; end; run; /* strata wants sorted data by the strata variable*/ proc sort data=dummy; by state; run; /*_nsize_ is specific keyword to specify count of observations, other keywords for other criteria The strata variable(s) values must match that of the population set */ data selectcontrol; input state :$2. _nsize_; datalines; PA 10 NY 30 DE 60 ; /* sort order fro the control data set has to match the strata */ proc sort data=selectcontrol; by state; run; Proc surveyselect data=dummy out=selected sampsize=selectcontrol /*<= this tells SAS to use the control set*/ ; strata state; run; ;
This would be a bit of overkill once you understand how the STRATA and SAMPSIZE options interact. You can specify a list of values for SAMPSIZE such as Sampsize=(60 30 10 ). The FIRST number is the number of observations to select from the first strata value, second from the second strata, third from the third ( and so on). This gets a bit more complicated with two strata variables to match the number to the combination. Look closely at your output data set from sorting the population.
This will create a similar selection to the previous:
Proc surveyselect data=dummy out=selected2 sampsize=(60 10 30) ; strata state; run;
If you want to be able to repeat the same selection with the exact same code specify the SEED= <some number between and 32K>. Note that changing OS or SAS version is likely to not duplicate the results because of other factors out of your control.
@ballardw Thank you so much for the speedy reply! In my effort to create a simplified example to explain what I am trying to do with the data, I didn't show that I do in fact have a population file.
You explain this so well and make it so easy to understand. I feel like the documentation was so confusing that I couldn't figure out which options to use. I was able to modify your code to do exactly what I wanted. I am creating a ton of different versions for what I need to do and this allows me to make all those different versions.
Thank you so much for your help!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.