BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
danjan86
Calcite | Level 5

I found a lot of posts about assigning group IDs, but all of them seemed a little different from what I'm looking for.

 

I have a dataset that includes individuals and their family members - some of these family members are included in the same study (and have their own unique study ID and information about their relationship) as well as some information on some other family members not included in the study (and do not have a unique ID but have their relationship to those in the study).  I'm wanting to assign a group ID for each family.  

The have data looks like this - where relativeID indicates the relatives ID if they are included in the study, relativerelationship is their relationship to the ID case, and IDs within a family aren't necessarily sequential. 

IDRelativeIDRelativeRelationship
10001.Grandfather
1000110010Mother
1000112200Brother
10010.Father
1001010001Son
1001012200Son
1110033000Mother
12200.Grandfather
1220010001Brother
1220010010Mother
3300011100Son

 

The want data would look like this. I'm wanting to assign an ID for each unique family group, including for those who do not have a relativeID:

 

IDRelativeIDRelativeRelationshipFamilyID
10001.Grandfather111
1000110010Mother111
1000112200Brother111
10010.Father111
1001010001Son111
1001012200Son111
1110033000Mother222
12200.Grandfather111
1220010001Brother111
1220010010Mother111
3300011100Son222

 

Maybe I'm overthinking things, but it seems like the family members that don't have their own unique RelativeID kind of complicates things. I'm struggling to think of an efficient way to do this. My SQL coding is pretty rusty, but my brain is saying this is probably an issue best solved with SQL, but I could be wrong.

Any insight on this would be appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

DATA step with hash object makes this easy:

data have;
infile datalines;
input ID RelativeID RelativeRelationship :$15.;
datalines;
10001 . Grandfather
10001 10010 Mother
10001 12200 Brother
10010 . Father
10010 10001 Son
10010 12200 Son
11100 33000 Mother
12200 . Grandfather
12200 10001 Brother
12200 10010 Mother
33000 11100 Son
;

data want;
set have;
retain r_family 0;
if _n_ = 1
then do;
  length family_id 8;
  declare hash fam ();
  fam.definekey("id");
  fam.definedata("family_id");
  fam.definedone();
end;
if fam.find() ne 0
then do;
  r_family + 1;
  family_id = r_family;
  rc = fam.add();
end;
if relativeid ne . and fam.check(key:relativeid) ne 0
then rc = fam.add(key:relativeid,data:family_id);
drop r_family;
run;

You only need to tweak the code for creating a new family_id.

It may be that this code runs out of memory if it has to deal with a VERY large dataset; each distinct id should consume 48 bytes of memory.

View solution in original post

3 REPLIES 3
Kurt_Bremser
Super User

DATA step with hash object makes this easy:

data have;
infile datalines;
input ID RelativeID RelativeRelationship :$15.;
datalines;
10001 . Grandfather
10001 10010 Mother
10001 12200 Brother
10010 . Father
10010 10001 Son
10010 12200 Son
11100 33000 Mother
12200 . Grandfather
12200 10001 Brother
12200 10010 Mother
33000 11100 Son
;

data want;
set have;
retain r_family 0;
if _n_ = 1
then do;
  length family_id 8;
  declare hash fam ();
  fam.definekey("id");
  fam.definedata("family_id");
  fam.definedone();
end;
if fam.find() ne 0
then do;
  r_family + 1;
  family_id = r_family;
  rc = fam.add();
end;
if relativeid ne . and fam.check(key:relativeid) ne 0
then rc = fam.add(key:relativeid,data:family_id);
drop r_family;
run;

You only need to tweak the code for creating a new family_id.

It may be that this code runs out of memory if it has to deal with a VERY large dataset; each distinct id should consume 48 bytes of memory.

danjan86
Calcite | Level 5

This is exactly what I was looking for!

I've put off learning about hash objects for years, but this is really cool. Looks like it's time for me to learn more about them.

 

Thanks for the help!

Ksharp
Super User
data x;
infile datalines;
input ID RelativeID RelativeRelationship :$15.;
datalines;
10001 . Grandfather
10001 10010 Mother
10001 12200 Brother
10010 . Father
10010 10001 Son
10010 12200 Son
11100 33000 Mother
12200 . Grandfather
12200 10001 Brother
12200 10010 Mother
33000 11100 Son
;


data have;
 set x(rename=(ID=from RelativeID=to));
 keep from to;
run;


data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;


data final_want;
 if _n_=1 then do;
  if 0 then set want(rename=(household=FamilyID));
  declare hash h(dataset:'want(rename=(household=FamilyID))');
  h.definekey('node');
  h.definedata('FamilyID');
  h.definedone();
 end;
set x;
call missing(FamilyID);
rc=h.find(key:id);
drop rc node;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 551 views
  • 1 like
  • 3 in conversation