Quartz | Level 8

Divide data into multiple samples by groups/strata

I have a data set with unique individuals and their basic demographics such as gender and age group. How can I divide the data into multiple samples with the similar distribution of demographics to the original data set?  My guess is likely to use PROC SURVEYSELECT, but not sure how to set it up.

For example, there are 30 individuals in the file below with gender and age_group information. To dividual the file into four samples with similar demographic distribution to the original 30 individuals. Similarly, if there are 500 distinct individuals with 10 strata by demographics, I'd like to have 5 data sets with the same distribution as the original data. How can I achieve that? Thanks a lot!

data person_fl;
infile datalines truncover dsd;
input Person_ID gender \$ age_group \$9.;
datalines;
1,F,Age 21-30
2,F,Age 31-40
3,M,Age 51-60
4,M,Age 41-50
5,F,Age 21-30
6,M,Age 31-40
7,F,Age 51-60
8,F,Age 41-50
9,F,Age 21-30
10,M,Age 31-40
11,M,Age 51-60
12,F,Age 41-50
13,M,Age 21-30
14,F,Age 31-40
15,F,Age 51-60
16,F,Age 41-50
17,M,Age 21-30
18,M,Age 31-40
19,F,Age 51-60
20,M,Age 41-50
21,F,Age 21-30
22,F,Age 31-40
23,M,Age 51-60
24,M,Age 41-50
25,F,Age 21-30
26,M,Age 31-40
27,F,Age 51-60
28,F,Age 41-50
29,M,Age 21-30
30,M,Age 31-40
;
run;
proc freq data=person_fl;
table gender*age_group/list missing;
run;

1 ACCEPTED SOLUTION

Accepted Solutions
SAS Employee

Re: Divide data into multiple samples by groups/strata

Hi @lichee, you're right PROC SURVEYSELECT will do the trick here with the GROUPS=n option.

First you'll want to make sure the data is sorted by the variables that will define your strata.

``proc sort data=person_fl out=sorted;   by gender age_group;run;``

Then use PROC SURVEYSELECT using the GROUPS=n option and the STRATA statement.

``````proc surveyselect data=sorted groups=4 out=sampled;
strata gender age_group;
run;``````

You'll want to be careful and choose a number for groups no bigger than the number of observations in the smallest stratum, otherwise this will throw an error. I tried running this with GROUPS=4 on your sample dataset and got an error.

Here's the documention explaining the GROUPS option: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_surveyselect_syntax01.htm#statu...

SAS Employee

Re: Divide data into multiple samples by groups/strata

Hi @lichee, you're right PROC SURVEYSELECT will do the trick here with the GROUPS=n option.

First you'll want to make sure the data is sorted by the variables that will define your strata.

``proc sort data=person_fl out=sorted;   by gender age_group;run;``

Then use PROC SURVEYSELECT using the GROUPS=n option and the STRATA statement.

``````proc surveyselect data=sorted groups=4 out=sampled;
strata gender age_group;
run;``````

You'll want to be careful and choose a number for groups no bigger than the number of observations in the smallest stratum, otherwise this will throw an error. I tried running this with GROUPS=4 on your sample dataset and got an error.

Here's the documention explaining the GROUPS option: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_surveyselect_syntax01.htm#statu...

Discussion stats