BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lizzy28
Quartz | Level 8

Hi everyone,

 

I have a dataset with two identifiers: one (person_id) for each person, and the other (house_id) for each household. Multiple members with different person_ids in a household share one house_id. Is there a quick way to draw randomly one member out of each household?

 

Below is my sample data:

person_id house_id Gender Category
12345611 123456 F 1
12345612 123456 M 2
12345613 123456 M 2
23456711 234567 M 1
23456712 234567 F 3
45678911 456789 M 2
45678912 456789 F 2
65432111 654321 M 3
65432112 654321 F 1

 

The target dataset could be:

person_id house_id Gender Category
12345611 123456 F 1
23456712 234567 F 3
45678911 456789 M 2
65432112 654321 F 1

 

Thank you!

Lizi

1 ACCEPTED SOLUTION

Accepted Solutions
ed_sas_member
Meteorite | Level 14

Hi @lizzy28 

You can also use PROC SURVEYSELECT as follows:

data have;
	infile datalines dlm="09"x;
	input person_id	house_id Gender $ Category;
	datalines;
12345611	123456	F	1
12345612	123456	M	2
12345613	123456	M	2
23456711	234567	M	1
23456712	234567	F	3
45678911	456789	M	2
45678912	456789	F	2
65432111	654321	M	3
65432112	654321	F	1
;
run;

proc surveyselect data=have method=srs n=1 out=want (drop=selectionprob samplingweight) noprint;
	strata house_id;
run;

The STRATA statement names the stratification variables HOUSE_ID. In the PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=1 option specifies a sample size of 1 observation for each stratum.

View solution in original post

4 REPLIES 4
mkeintz
PROC Star

If your data are sorted by house_id, then you can:

  1. Read and count consecutive person records in one household setting _N_PERS to the count.
  2. Generate a random integer _RAND_DRAW between 1 and _N_PERS
  3. Re-read the same household and output the record whose sequence matches _RAND_DRAW
  4. Do the same sequence for the next household

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
KachiM
Rhodochrosite | Level 12

Hi @lizzy28 ,

 

The code-implemetation of @mkeintz is given below:

 

data want;
   call streaminit(123);
   do count = 1 by 1 until(last.house_id);
      set household;
      by house_id;
   end;
   rnd = ceil(rand('UNIFORM') * count);
   do count = 1 by 1 until(last.house_id);
      set household;
      by house_id;
      if rnd = count then output; 
    end;
drop count rnd ;
run;

 

 

lizzy28
Quartz | Level 8
Thank you!

I tested the data. The only issue is that the outcome data turn out to have 'M' gender only. I believe it is related to how the data is sorted.
ed_sas_member
Meteorite | Level 14

Hi @lizzy28 

You can also use PROC SURVEYSELECT as follows:

data have;
	infile datalines dlm="09"x;
	input person_id	house_id Gender $ Category;
	datalines;
12345611	123456	F	1
12345612	123456	M	2
12345613	123456	M	2
23456711	234567	M	1
23456712	234567	F	3
45678911	456789	M	2
45678912	456789	F	2
65432111	654321	M	3
65432112	654321	F	1
;
run;

proc surveyselect data=have method=srs n=1 out=want (drop=selectionprob samplingweight) noprint;
	strata house_id;
run;

The STRATA statement names the stratification variables HOUSE_ID. In the PROC SURVEYSELECT statement, the METHOD=SRS option specifies simple random sampling. The N=1 option specifies a sample size of 1 observation for each stratum.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 550 views
  • 3 likes
  • 4 in conversation