Hi everyone,
I have a dataset with two identifiers: one (person_id) for each person, and the other (house_id) for each household. Multiple members with different person_ids in a household share one house_id. Is there a quick way to draw randomly one member out of each household?
Below is my sample data:
person_id | house_id | Gender | Category |
12345611 | 123456 | F | 1 |
12345612 | 123456 | M | 2 |
12345613 | 123456 | M | 2 |
23456711 | 234567 | M | 1 |
23456712 | 234567 | F | 3 |
45678911 | 456789 | M | 2 |
45678912 | 456789 | F | 2 |
65432111 | 654321 | M | 3 |
65432112 | 654321 | F | 1 |
The target dataset could be:
person_id | house_id | Gender | Category |
12345611 | 123456 | F | 1 |
23456712 | 234567 | F | 3 |
45678911 | 456789 | M | 2 |
65432112 | 654321 | F | 1 |
Thank you!
Lizi
Hi @lizzy28
You can also use PROC SURVEYSELECT as follows:
data have;
infile datalines dlm="09"x;
input person_id house_id Gender $ Category;
datalines;
12345611 123456 F 1
12345612 123456 M 2
12345613 123456 M 2
23456711 234567 M 1
23456712 234567 F 3
45678911 456789 M 2
45678912 456789 F 2
65432111 654321 M 3
65432112 654321 F 1
;
run;
proc surveyselect data=have method=srs n=1 out=want (drop=selectionprob samplingweight) noprint;
strata house_id;
run;
The STRATA statement names the stratification variables HOUSE_ID. In the PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=1 option specifies a sample size of 1 observation for each stratum.
If your data are sorted by house_id, then you can:
Hi @lizzy28 ,
The code-implemetation of @mkeintz is given below:
data want;
call streaminit(123);
do count = 1 by 1 until(last.house_id);
set household;
by house_id;
end;
rnd = ceil(rand('UNIFORM') * count);
do count = 1 by 1 until(last.house_id);
set household;
by house_id;
if rnd = count then output;
end;
drop count rnd ;
run;
Hi @lizzy28
You can also use PROC SURVEYSELECT as follows:
data have;
infile datalines dlm="09"x;
input person_id house_id Gender $ Category;
datalines;
12345611 123456 F 1
12345612 123456 M 2
12345613 123456 M 2
23456711 234567 M 1
23456712 234567 F 3
45678911 456789 M 2
45678912 456789 F 2
65432111 654321 M 3
65432112 654321 F 1
;
run;
proc surveyselect data=have method=srs n=1 out=want (drop=selectionprob samplingweight) noprint;
strata house_id;
run;
The STRATA statement names the stratification variables HOUSE_ID. In the PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=1 option specifies a sample size of 1 observation for each stratum.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.