Hi everyone,
I have a dataset with two identifiers: one (person_id) for each person, and the other (house_id) for each household. Multiple members with different person_ids in a household share one house_id. Is there a quick way to draw randomly one member out of each household?
Below is my sample data:
person_id | house_id | Gender | Category |
12345611 | 123456 | F | 1 |
12345612 | 123456 | M | 2 |
12345613 | 123456 | M | 2 |
23456711 | 234567 | M | 1 |
23456712 | 234567 | F | 3 |
45678911 | 456789 | M | 2 |
45678912 | 456789 | F | 2 |
65432111 | 654321 | M | 3 |
65432112 | 654321 | F | 1 |
The target dataset could be:
person_id | house_id | Gender | Category |
12345611 | 123456 | F | 1 |
23456712 | 234567 | F | 3 |
45678911 | 456789 | M | 2 |
65432112 | 654321 | F | 1 |
Thank you!
Lizi
Hi @lizzy28
You can also use PROC SURVEYSELECT as follows:
data have;
infile datalines dlm="09"x;
input person_id house_id Gender $ Category;
datalines;
12345611 123456 F 1
12345612 123456 M 2
12345613 123456 M 2
23456711 234567 M 1
23456712 234567 F 3
45678911 456789 M 2
45678912 456789 F 2
65432111 654321 M 3
65432112 654321 F 1
;
run;
proc surveyselect data=have method=srs n=1 out=want (drop=selectionprob samplingweight) noprint;
strata house_id;
run;
The STRATA statement names the stratification variables HOUSE_ID. In the PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=1 option specifies a sample size of 1 observation for each stratum.
If your data are sorted by house_id, then you can:
Hi @lizzy28 ,
The code-implemetation of @mkeintz is given below:
data want;
call streaminit(123);
do count = 1 by 1 until(last.house_id);
set household;
by house_id;
end;
rnd = ceil(rand('UNIFORM') * count);
do count = 1 by 1 until(last.house_id);
set household;
by house_id;
if rnd = count then output;
end;
drop count rnd ;
run;
Hi @lizzy28
You can also use PROC SURVEYSELECT as follows:
data have;
infile datalines dlm="09"x;
input person_id house_id Gender $ Category;
datalines;
12345611 123456 F 1
12345612 123456 M 2
12345613 123456 M 2
23456711 234567 M 1
23456712 234567 F 3
45678911 456789 M 2
45678912 456789 F 2
65432111 654321 M 3
65432112 654321 F 1
;
run;
proc surveyselect data=have method=srs n=1 out=want (drop=selectionprob samplingweight) noprint;
strata house_id;
run;
The STRATA statement names the stratification variables HOUSE_ID. In the PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=1 option specifies a sample size of 1 observation for each stratum.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.