I have a dataset with 71 million household ids that are associated with 73 million people. The issue is that each person can have a differing number of multiple house ids. For example:
person_id house_id
1 1
1 2
1 3
2 2
3 2
3 3
4 1
4 3
5 6
5 4
5 5
6 6
6 4
What is the best way to figure out which people belong to the same house? In the above example house id 1 would include person Ids 1-4 because they belong to household Ids 1-3. House Id 2 would include person Ids 5-6, because they belong to household Ids 4-6. I would like the data to look like this:
house_id person_one person_two person_three person_four
1 1 2 3 4
2 5 6 . .
Ideally there would be one, unique house Id per row (which subsumes all house Ids that correspond to the same people). Each column is person 1, person 2, person 3, etc. associated with that house Id.
I thought this was a network problem, in which you are trying to identify collections of nodes (person id's) that are connected (i.e. belong to at least one instance of the same household), but have no such connection to persons in other households.
But if that's the definition of SET you have in mind, then how does your data end up with 2 sets?
Here is a table of your house id's (rows) and person id's (cols). I don't see the first (1,2,3,4) or second (4,6) set as you post it.
Person ID | ||||||
1 | 2 | 3 | 4 | 5 | ||
H_ID | 1 | x | x | |||
2 | x | x | x | |||
3 | x | x | x | |||
6 | x | x |
I recognize that I can transpose the data to get the Ids associated with each household, but I need to go a step further, and create one household id that links all members. So households '1', '2', and '3' need to be collapsed into one Id that contains person Ids 1-4. There should be one row per house id for all affiliated members.
For clarity sake, just which "id" is represented by 'id_one' etch in this?
set id_one id_two id_three id_four 1 1 2 3 4 2 4 6 . .
You show exactly two variables, both named something_id and similar ranges of values. So the above is not very clear as to intent or rules or interpretation.
Can you clearly state what you want the result to show? "Best way" would depend on what questions need to be answered from the data.
The 'set' variable is the created household Id that links all members affiliated with all houses. 'id_one', 'id_two', 'id_three', etc. corresponds to each person in the household. The number of people associated with the same households will correspond to the number of non-missing columns.
@Caetreviop543 wrote:
The 'set' variable is the created household Id that links all members affiliated with all houses. 'id_one', 'id_two', 'id_three', etc. corresponds to each person in the household. The number of people associated with the same households will correspond to the number of non-missing columns.
Then why does SET 1 contain person id's 1, 2, 3, and 4, but does not contain person id 5, which shares a house id (#6) with person id 4, In your data example, it appears to me that you have a "super-household" containing all person id's in the data. That is, there is a "path" (i.e. sequence of shared households) from every person to every other person.
That's why I showed the table.
I put an example of a person belonging to two ostensibly different households because there is bad data; I was thinking I would later create a rule to remove duplicate people.
For the sake of this question, assume the same person can belong to more than one household. So person 4 can belong to both households 1-3 and 6. (I will clarify in the original post).
I am trying to figure out how to collapse households 1-3 into one identifier that encompasses all members, person Id 1 to person Id 4.
You seem to have circular logic going on here.
The only thing that links households 1 to 3 together (that doesn't also include the other households) is that fact it is the set of households that person 1 is a member of.
Person ID | ||||||
1 | 2 | 3 | 4 | 5 | ||
H_ID | 1 | x | x | |||
2 | x | x | x | |||
3 | x | x | x | |||
6 | x | x |
Can you create a better example?
You are right. I tried to simplify the example by indicating that people belong to disinct households.
@Caetreviop543 wrote:
I put an example of a person belonging to two ostensibly different households because there is bad data; I was thinking I would later create a rule to remove duplicate people.
If you know there is bad data then one question is can you identify it? Then clean it up so it is no longer bad. Or is that part of the purpose of this exercise?
You are still not clearly describing what you are considering to be a "set". You are obviously applying some rule(s) but have not shared them with us.
Another thing to consider is how are the person and house id values actually created. It may be that the actual values, since I really doubt the real values are simple integers in sequence as shown, have content buried in them. For example the in the US Social Security Number values the first three digits indicate a geography related to when/where a person applied for the number. Perhaps that sort of information is available and provides additional information for the process. Or perhaps the value of house id could contain implied time of assigned values (much larger for example) and a "person" associated with two values "far enough apart" might make one of the values unlikely.
What will this identified data actually be used for? 70+ million records is fairly large when discussing people. Perhaps a sample would make sense to simplify what ever process the information is used for.
The data is medical; unfortunately different, but connected house ids don't contain a numeric pattern. They do contain eligibility dates, but annoyingly, these dates do not match for patients with the same house Id. I don't think I could use a sample, because then some people in the same household would be eliminated.
Have you looked very closely at your data source definition of "house id"? That may give some clues.
I do feel your pain as I have been asked about "unique" patients when each medical record system has different patient identification and I do not typically have much detail but do know the id's are not unique to individuals across systems. So any person seen by two or more different service providers has two or more Id values.
/*When I am looking at your data after an hour,
I think you should make your person_id and house_id are different from each other
by adding prefix or do something , Like:*/
data have(rename=(person_id=from house_id=to));
infile cards ;
input person_id $ house_id $;
cards;
person_1 house_1
person_1 house_2
person_1 house_3
person_2 house_2
person_3 house_2
person_3 house_3
person_4 house_1
person_4 house_3
person_5 house_6
person_5 house_4
person_5 house_5
person_6 house_6
person_6 house_4
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
proc transpose data=want(where=(node=:'person')) out=final_want(drop=_:) prefix=person;
by household;
var node;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.