Solved: Group IDs from Unique IDs

danjan86 · Posted 01-27-2024 09:46 AM

I found a lot of posts about assigning group IDs, but all of them seemed a little different from what I'm looking for.

I have a dataset that includes individuals and their family members - some of these family members are included in the same study (and have their own unique study ID and information about their relationship) as well as some information on some other family members not included in the study (and do not have a unique ID but have their relationship to those in the study). I'm wanting to assign a group ID for each family.

The have data looks like this - where relativeID indicates the relatives ID if they are included in the study, relativerelationship is their relationship to the ID case, and IDs within a family aren't necessarily sequential.

ID	RelativeID	RelativeRelationship
10001	.	Grandfather
10001	10010	Mother
10001	12200	Brother
10010	.	Father
10010	10001	Son
10010	12200	Son
11100	33000	Mother
12200	.	Grandfather
12200	10001	Brother
12200	10010	Mother
33000	11100	Son

The want data would look like this. I'm wanting to assign an ID for each unique family group, including for those who do not have a relativeID:

ID	RelativeID	RelativeRelationship	FamilyID
10001	.	Grandfather	111
10001	10010	Mother	111
10001	12200	Brother	111
10010	.	Father	111
10010	10001	Son	111
10010	12200	Son	111
11100	33000	Mother	222
12200	.	Grandfather	111
12200	10001	Brother	111
12200	10010	Mother	111
33000	11100	Son	222

Maybe I'm overthinking things, but it seems like the family members that don't have their own unique RelativeID kind of complicates things. I'm struggling to think of an efficient way to do this. My SQL coding is pretty rusty, but my brain is saying this is probably an issue best solved with SQL, but I could be wrong.

Any insight on this would be appreciated.

Kurt_Bremser · Posted 01-27-2024 10:05 AM

DATA step with hash object makes this easy:

data have;
infile datalines;
input ID RelativeID RelativeRelationship :$15.;
datalines;
10001 . Grandfather
10001 10010 Mother
10001 12200 Brother
10010 . Father
10010 10001 Son
10010 12200 Son
11100 33000 Mother
12200 . Grandfather
12200 10001 Brother
12200 10010 Mother
33000 11100 Son
;

data want;
set have;
retain r_family 0;
if _n_ = 1
then do;
  length family_id 8;
  declare hash fam ();
  fam.definekey("id");
  fam.definedata("family_id");
  fam.definedone();
end;
if fam.find() ne 0
then do;
  r_family + 1;
  family_id = r_family;
  rc = fam.add();
end;
if relativeid ne . and fam.check(key:relativeid) ne 0
then rc = fam.add(key:relativeid,data:family_id);
drop r_family;
run;

You only need to tweak the code for creating a new family_id.

It may be that this code runs out of memory if it has to deal with a VERY large dataset; each distinct id should consume 48 bytes of memory.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 01-27-2024 10:05 AM

DATA step with hash object makes this easy:

data have;
infile datalines;
input ID RelativeID RelativeRelationship :$15.;
datalines;
10001 . Grandfather
10001 10010 Mother
10001 12200 Brother
10010 . Father
10010 10001 Son
10010 12200 Son
11100 33000 Mother
12200 . Grandfather
12200 10001 Brother
12200 10010 Mother
33000 11100 Son
;

data want;
set have;
retain r_family 0;
if _n_ = 1
then do;
  length family_id 8;
  declare hash fam ();
  fam.definekey("id");
  fam.definedata("family_id");
  fam.definedone();
end;
if fam.find() ne 0
then do;
  r_family + 1;
  family_id = r_family;
  rc = fam.add();
end;
if relativeid ne . and fam.check(key:relativeid) ne 0
then rc = fam.add(key:relativeid,data:family_id);
drop r_family;
run;

You only need to tweak the code for creating a new family_id.

It may be that this code runs out of memory if it has to deal with a VERY large dataset; each distinct id should consume 48 bytes of memory.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

danjan86 · Posted 01-27-2024 10:19 AM

This is exactly what I was looking for!

I've put off learning about hash objects for years, but this is really cool. Looks like it's time for me to learn more about them.

Thanks for the help!

Ksharp · Posted 01-28-2024 09:48 PM

data x;
infile datalines;
input ID RelativeID RelativeRelationship :$15.;
datalines;
10001 . Grandfather
10001 10010 Mother
10001 12200 Brother
10010 . Father
10010 10001 Son
10010 12200 Son
11100 33000 Mother
12200 . Grandfather
12200 10001 Brother
12200 10010 Mother
33000 11100 Son
;


data have;
 set x(rename=(ID=from RelativeID=to));
 keep from to;
run;


data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;


data final_want;
 if _n_=1 then do;
  if 0 then set want(rename=(household=FamilyID));
  declare hash h(dataset:'want(rename=(household=FamilyID))');
  h.definekey('node');
  h.definedata('FamilyID');
  h.definedone();
 end;
set x;
call missing(FamilyID);
rc=h.find(key:id);
drop rc node;
run;

Group IDs from Unique IDs

Re: Group IDs from Unique IDs

Re: Group IDs from Unique IDs

Re: Group IDs from Unique IDs

Re: Group IDs from Unique IDs

Group IDs from Unique IDs

Re: Group IDs from Unique IDs

Re: Group IDs from Unique IDs

Re: Group IDs from Unique IDs

Re: Group IDs from Unique IDs

Registration is open

SAS Training: Just a Click Away