BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
lichee
Quartz | Level 8

I have a data set with unique individuals and their basic demographics such as gender and age group. How can I divide the data into multiple samples with the similar distribution of demographics to the original data set?  My guess is likely to use PROC SURVEYSELECT, but not sure how to set it up.

For example, there are 30 individuals in the file below with gender and age_group information. To dividual the file into four samples with similar demographic distribution to the original 30 individuals. Similarly, if there are 500 distinct individuals with 10 strata by demographics, I'd like to have 5 data sets with the same distribution as the original data. How can I achieve that? Thanks a lot! 

 

data person_fl;
infile datalines truncover dsd;
input Person_ID gender $ age_group $9.;
datalines;
1,F,Age 21-30
2,F,Age 31-40
3,M,Age 51-60
4,M,Age 41-50
5,F,Age 21-30
6,M,Age 31-40
7,F,Age 51-60
8,F,Age 41-50
9,F,Age 21-30
10,M,Age 31-40
11,M,Age 51-60
12,F,Age 41-50
13,M,Age 21-30
14,F,Age 31-40
15,F,Age 51-60
16,F,Age 41-50
17,M,Age 21-30
18,M,Age 31-40
19,F,Age 51-60
20,M,Age 41-50
21,F,Age 21-30
22,F,Age 31-40
23,M,Age 51-60
24,M,Age 41-50
25,F,Age 21-30
26,M,Age 31-40
27,F,Age 51-60
28,F,Age 41-50
29,M,Age 21-30
30,M,Age 31-40
;
run;
proc freq data=person_fl;
table gender*age_group/list missing;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
antonbcristina
SAS Employee

Hi @lichee, you're right PROC SURVEYSELECT will do the trick here with the GROUPS=n option.

 

First you'll want to make sure the data is sorted by the variables that will define your strata. 

proc sort data=person_fl out=sorted;
by gender age_group;
run;

 

Then use PROC SURVEYSELECT using the GROUPS=n option and the STRATA statement. 

proc surveyselect data=sorted groups=4 out=sampled;
   strata gender age_group;
run;

 

You'll want to be careful and choose a number for groups no bigger than the number of observations in the smallest stratum, otherwise this will throw an error. I tried running this with GROUPS=4 on your sample dataset and got an error.

 

Here's the documention explaining the GROUPS option: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_surveyselect_syntax01.htm#statu... 

View solution in original post

1 REPLY 1
antonbcristina
SAS Employee

Hi @lichee, you're right PROC SURVEYSELECT will do the trick here with the GROUPS=n option.

 

First you'll want to make sure the data is sorted by the variables that will define your strata. 

proc sort data=person_fl out=sorted;
by gender age_group;
run;

 

Then use PROC SURVEYSELECT using the GROUPS=n option and the STRATA statement. 

proc surveyselect data=sorted groups=4 out=sampled;
   strata gender age_group;
run;

 

You'll want to be careful and choose a number for groups no bigger than the number of observations in the smallest stratum, otherwise this will throw an error. I tried running this with GROUPS=4 on your sample dataset and got an error.

 

Here's the documention explaining the GROUPS option: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_surveyselect_syntax01.htm#statu... 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 610 views
  • 1 like
  • 2 in conversation