Re: Grouping data by sets

Caetreviop543 · Posted 07-19-2023 12:00 PM

I have a dataset with 71 million household ids that are associated with 73 million people. The issue is that each person can have a differing number of multiple house ids. For example:

person_id  house_id
1               1
1               2
1               3
2               2
3               2
3               3
4               1
4               3
5               6
5               4
5               5
6               6
6               4

What is the best way to figure out which people belong to the same house? In the above example house id 1 would include person Ids 1-4 because they belong to household Ids 1-3. House Id 2 would include person Ids 5-6, because they belong to household Ids 4-6. I would like the data to look like this:

house_id  person_one  person_two  person_three  person_four 
1         1           2           3             4           
2         5           6           .             .

Ideally there would be one, unique house Id per row (which subsumes all house Ids that correspond to the same people). Each column is person 1, person 2, person 3, etc. associated with that house Id.

mkeintz · Posted 07-19-2023 12:42 PM

I thought this was a network problem, in which you are trying to identify collections of nodes (person id's) that are connected (i.e. belong to at least one instance of the same household), but have no such connection to persons in other households.

But if that's the definition of SET you have in mind, then how does your data end up with 2 sets?

Here is a table of your house id's (rows) and person id's (cols). I don't see the first (1,2,3,4) or second (4,6) set as you post it.

		Person ID
		1	2	3	4	5
H_ID	1	x			x
	2	x	x	x
	3	x		x	x
	6				x	x

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Caetreviop543 · Posted 07-19-2023 01:07 PM

I recognize that I can transpose the data to get the Ids associated with each household, but I need to go a step further, and create one household id that links all members. So households '1', '2', and '3' need to be collapsed into one Id that contains person Ids 1-4. There should be one row per house id for all affiliated members.

ballardw · Posted 07-19-2023 12:58 PM

For clarity sake, just which "id" is represented by 'id_one' etch in this?

set  id_one id_two id_three id_four 
1     1          2          3           4           
2     4          6          .           .

You show exactly two variables, both named something_id and similar ranges of values. So the above is not very clear as to intent or rules or interpretation.

Can you clearly state what you want the result to show? "Best way" would depend on what questions need to be answered from the data.

Caetreviop543 · Posted 07-19-2023 01:11 PM

The 'set' variable is the created household Id that links all members affiliated with all houses. 'id_one', 'id_two', 'id_three', etc. corresponds to each person in the household. The number of people associated with the same households will correspond to the number of non-missing columns.

mkeintz · Posted 07-19-2023 01:19 PM

@Caetreviop543 wrote:

The 'set' variable is the created household Id that links all members affiliated with all houses. 'id_one', 'id_two', 'id_three', etc. corresponds to each person in the household. The number of people associated with the same households will correspond to the number of non-missing columns.

Then why does SET 1 contain person id's 1, 2, 3, and 4, but does not contain person id 5, which shares a house id (#6) with person id 4, In your data example, it appears to me that you have a "super-household" containing all person id's in the data. That is, there is a "path" (i.e. sequence of shared households) from every person to every other person.

That's why I showed the table.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Caetreviop543 · Posted 07-19-2023 01:28 PM

I put an example of a person belonging to two ostensibly different households because there is bad data; I was thinking I would later create a rule to remove duplicate people.

For the sake of this question, assume the same person can belong to more than one household. So person 4 can belong to both households 1-3 and 6. (I will clarify in the original post).

I am trying to figure out how to collapse households 1-3 into one identifier that encompasses all members, person Id 1 to person Id 4.

Tom · Posted 07-19-2023 01:51 PM

You seem to have circular logic going on here.

The only thing that links households 1 to 3 together (that doesn't also include the other households) is that fact it is the set of households that person 1 is a member of.

		Person ID
		1	2	3	4	5
H_ID	1	x			x
	2	x	x	x
	3	x		x	x
	6				x	x

Can you create a better example?

Caetreviop543 · Posted 07-19-2023 02:14 PM

You are right. I tried to simplify the example by indicating that people belong to disinct households.

ballardw · Posted 07-19-2023 01:51 PM

@Caetreviop543 wrote:

I put an example of a person belonging to two ostensibly different households because there is bad data; I was thinking I would later create a rule to remove duplicate people.

If you know there is bad data then one question is can you identify it? Then clean it up so it is no longer bad. Or is that part of the purpose of this exercise?

You are still not clearly describing what you are considering to be a "set". You are obviously applying some rule(s) but have not shared them with us.

Another thing to consider is how are the person and house id values actually created. It may be that the actual values, since I really doubt the real values are simple integers in sequence as shown, have content buried in them. For example the in the US Social Security Number values the first three digits indicate a geography related to when/where a person applied for the number. Perhaps that sort of information is available and provides additional information for the process. Or perhaps the value of house id could contain implied time of assigned values (much larger for example) and a "person" associated with two values "far enough apart" might make one of the values unlikely.

What will this identified data actually be used for? 70+ million records is fairly large when discussing people. Perhaps a sample would make sense to simplify what ever process the information is used for.

Caetreviop543 · Posted 07-19-2023 02:19 PM

The data is medical; unfortunately different, but connected house ids don't contain a numeric pattern. They do contain eligibility dates, but annoyingly, these dates do not match for patients with the same house Id. I don't think I could use a sample, because then some people in the same household would be eliminated.

ballardw · Posted 07-19-2023 03:54 PM

Have you looked very closely at your data source definition of "house id"? That may give some clues.

I do feel your pain as I have been asked about "unique" patients when each medical record system has different patient identification and I do not typically have much detail but do know the id's are not unique to individuals across systems. So any person seen by two or more different service providers has two or more Id values.

Ksharp · Posted 07-20-2023 07:39 AM

/*When I am looking at your data after an hour,
I think you should make your person_id and house_id are different from each other   
by adding prefix or do something , Like:*/
data have(rename=(person_id=from house_id=to));
infile cards ;
input person_id $ house_id $;
cards;
person_1               house_1
person_1               house_2
person_1               house_3
person_2               house_2
person_3               house_2
person_3               house_3
person_4               house_1
person_4               house_3
person_5               house_6
person_5               house_4
person_5               house_5
person_6               house_6
person_6               house_4
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;

data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
proc transpose data=want(where=(node=:'person')) out=final_want(drop=_:) prefix=person;
by household;
var node;
run;

Registration is open

SAS Training: Just a Click Away