I have a data set with unique individuals and their basic demographics such as gender and age group. How can I divide the data into multiple samples with the similar distribution of demographics to the original data set? My guess is likely to use PROC SURVEYSELECT, but not sure how to set it up.
For example, there are 30 individuals in the file below with gender and age_group information. To dividual the file into four samples with similar demographic distribution to the original 30 individuals. Similarly, if there are 500 distinct individuals with 10 strata by demographics, I'd like to have 5 data sets with the same distribution as the original data. How can I achieve that? Thanks a lot!
data person_fl;
infile datalines truncover dsd;
input Person_ID gender $ age_group $9.;
datalines;
1,F,Age 21-30
2,F,Age 31-40
3,M,Age 51-60
4,M,Age 41-50
5,F,Age 21-30
6,M,Age 31-40
7,F,Age 51-60
8,F,Age 41-50
9,F,Age 21-30
10,M,Age 31-40
11,M,Age 51-60
12,F,Age 41-50
13,M,Age 21-30
14,F,Age 31-40
15,F,Age 51-60
16,F,Age 41-50
17,M,Age 21-30
18,M,Age 31-40
19,F,Age 51-60
20,M,Age 41-50
21,F,Age 21-30
22,F,Age 31-40
23,M,Age 51-60
24,M,Age 41-50
25,F,Age 21-30
26,M,Age 31-40
27,F,Age 51-60
28,F,Age 41-50
29,M,Age 21-30
30,M,Age 31-40
;
run;
proc freq data=person_fl;
table gender*age_group/list missing;
run;
Hi @lichee, you're right PROC SURVEYSELECT will do the trick here with the GROUPS=n option.
First you'll want to make sure the data is sorted by the variables that will define your strata.
proc sort data=person_fl out=sorted;
by gender age_group;
run;
Then use PROC SURVEYSELECT using the GROUPS=n option and the STRATA statement.
proc surveyselect data=sorted groups=4 out=sampled;
strata gender age_group;
run;
You'll want to be careful and choose a number for groups no bigger than the number of observations in the smallest stratum, otherwise this will throw an error. I tried running this with GROUPS=4 on your sample dataset and got an error.
Here's the documention explaining the GROUPS option: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_surveyselect_syntax01.htm#statu...
Hi @lichee, you're right PROC SURVEYSELECT will do the trick here with the GROUPS=n option.
First you'll want to make sure the data is sorted by the variables that will define your strata.
proc sort data=person_fl out=sorted;
by gender age_group;
run;
Then use PROC SURVEYSELECT using the GROUPS=n option and the STRATA statement.
proc surveyselect data=sorted groups=4 out=sampled;
strata gender age_group;
run;
You'll want to be careful and choose a number for groups no bigger than the number of observations in the smallest stratum, otherwise this will throw an error. I tried running this with GROUPS=4 on your sample dataset and got an error.
Here's the documention explaining the GROUPS option: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_surveyselect_syntax01.htm#statu...
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.