BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
alvarogonzalez
Calcite | Level 5

Good morning everyone,

 

I describe my problem and the solution I need. I have a table with pairs that identify relationships between two people (for example, friends in a social network). My end result should be a table in which each record defines a group in which all people are related, directly or through other people.

 

For example, starting from the following image

alvarogonzalez_0-1609152678632.png

My input table looks like this:

alvarogonzalez_1-1609152898580.png

data have;
input member1 member2;
datalines;
1 2
1 4
2 3
2 4
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;

And my output table should be the following:

alvarogonzalez_2-1609153189626.png

Initially I don't know the maximum number of people in a relationship (but I don't think it will exceed 20 in this problem)

 

Has anyone faced a similar problem? Can you think of any way to solve it?

 

Thanks for your time, Merry Christmas and Happy 2021!!

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

If you don't have SAS/OR .

 



data have;
infile cards ;
input from $  to $ ;
cards;
1 2
1 4
2 3
2 4
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

View solution in original post

14 REPLIES 14
Ksharp
Super User

If you don't have SAS/OR .

 



data have;
infile cards ;
input from $  to $ ;
cards;
1 2
1 4
2 3
2 4
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
alvarogonzalez
Calcite | Level 5
Goo morning everyone,

I have tested this code on my dataset (over 300,000 pairs), and it works perfectly. Thank you very much for your time!!! Also, it has helped me to discover the use of hashing in SAS.

Happy New Year to all!
adornodj
Obsidian | Level 7

@Ksharp - Very helpful code!  This post and your solution are helping me on a challenge I currently have.  Would there be a way to score individuals in a household to say how many connections they have, or how "critical" they are to the household?  For example, in the original post, Person #4 has 4 connections, so a way to give them a score of 4 (or something similar).  I'm trying to explore ways to dive into the results I'm getting where some households have 500+ nodes.   I do not have SAS/OR, only access to BASE and SAS/STAT.  

 

Thank you in advance! 

Ksharp
Super User

This post is long time ago . I strongly suggest you to start a brand-new session .

Anyway, Here is an example you are looking for.

 


data have;
input _start $ _end $;
cards;
A      F
F      G
G      J
J      C
A      B
B      C
A      D
D      C
;
run;
 
proc sql;
create table ancient as
 select _start,_end from have
  where _start not in (select distinct _end from have);
quit;
data want(keep=path n_word);
if _n_ eq 1 then do;
length path _path  $ 800 ;
if 0 then set have;
declare hash ha(hashexp:20,dataset:'have(where=(_start is not missing and _end is not missing))',multidata:'y');
ha.definekey('_start');
ha.definedata('_end');
ha.definedone();

declare hash pa(ordered:'y');
declare hiter hi_path('pa');
pa.definekey('n');
pa.definedata('n','path');
pa.definedone();

end;


set ancient;
count=1;n=1;_n=1;
path=catx('|',_start,_end);
   
pa.add();
do while(hi_path.next()=0);
 if n ne 1 then pa.remove(key:_n);_n=n;
 _path=path;  
 _start=scan(path,-1,'|');
 rc=ha.find();
 if rc ne 0 then do;
  found=1;
  n_word=countw(path,'|');
  output;
 end;
 
if not found then do;
 do while(rc=0);
  if not findw(path,strip(_end),'|') then do;
   if length(path)+length(_end)+1 gt lengthc(path) then do;
    putlog 'ERROR: The length of path and _path are set too short';
    stop;
   end;
   
   count+1;n=count;
   path=catx('|',path,_end);
   pa.add();
   path=_path;
  end;
  rc=ha.find_next();
 end;
end;

end;
pa.clear();
run;
Ksharp
Super User

If you have SAS/OR .

 


data have;
infile cards ;
input from $  to $ ;
cards;
1 2
1 4
2 3
2 4
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;


/* Same code as SAS/OR */
proc optnet data_links=have out_nodes=want GRAPH_DIRECTION=UNDIRECTED;
data_links_var from=from to=to;
concomp;
run;
RichardDeVen
Barite | Level 11

You can use a Hash of Hash approach to conceptually construct a list of sets that is output tall wise as set_id, set_member_id which in turn is transposed into the desired wanted data layout.

 

Each member of a pair in the original data is either

- both not found in the list of sets

  - a new set must be created and added to the list of sets

- only one is found in the list of sets

  - add the other member to the set containing the found member

- both are found but in different sets

  - merge the smaller set into the larger set

 

Sample code with extra relationship data that requires a merge case

 

* new.add.merge in data simply there to indicate what the processing will do;
data have;
input member1 member2;
datalines;
1 2  new 1
1 4  add 1
2 3  add 1
2 4  both 1
3 8  add 1
4 5  add 1
4 6  add 1
5 7  add 1
5 8  both 1
6 7  both 1
7 8  both 1
9 10  new 2
9 11  add 2
9 12  add 2
10 12 both 2
11 12 both 2
13 14 new 3
14 15 add 3
16 17 new 4
17 19 add 4
14 18 add 3
15 16 merge 3 4
;

* separate data into distinct sets of vertices corresponding to linked nodes;
data want_tall;
  call missing (id, member);

  declare hash set set1 set2 setP setQ;
  declare hiter set_iter;

  if _n_ = 1 then do;
    declare hash sets(ordered:'a');
    sets.defineKey('id');
    sets.defineData('id', 'set');
    sets.defineDone();

    declare hiter sets_iter('sets');
  end;

  set have end=done;

  rc = sets_iter.first();
  do while (rc=0);
    if not found1 then 
      found1 = ifn(set.check(key:member1) eq 0,id,0);

    if not found2 then
      found2 = ifn(set.check(key:member2) eq 0,id,0);

    rc = sets_iter.next(); * advance to next item, ensures set gets unlocked from iteration tracking;

    if found1 and found2 then leave;
  end;

  if not found1 and not found2 then do;
*    put 'newing ' _n_=;

    set = _new_ hash(ordered:'a');
    set.defineKey('member');
    set.defineData('member');
    set.defineDone();

    set.add(key:member1, data:member1);
    set.add(key:member2, data:member2);

    id = _n_;
    sets.add();
  end;
  else if found1 ne found2 then do;
    * case is either;
    * - one member not found and should be added to others set,
    * - or each member is in a different set and the sets need to be merged;

    if found1 = 0 or found2 = 0 then do;
*      put 'adding ' _n_=;

      id = found1 + found2;
      sets.find();

      if found1 
        then set.add(key:member2, data:member2);
        else set.add(key:member1, data:member1);
    end;
    else do;
*      put 'merging ' _n_= found1= found2=;

      sets.find(key:found1); set1 = set;
      sets.find(key:found2); set2 = set;

      if set1.num_items > set2.num_items then do;
        setP = set1;    * merge 2 into 1;
        setQ = set2;
        keyQ = found2;
      end;
      else do;
        setP = set2;    * merge 1 into 2;
        setQ = set1;
        keyQ = found1;
      end;

* merge smaller set into the larger set and remove smaller set from memory; set_iter = _new_ hiter('setQ'); do while (set_iter.next() = 0); setP.add(); end; rc = set_iter.delete(); sets.remove(key:keyQ); setQ.delete(); end; end; if done then do; declare hiter i1('sets'); do while(i1.next() eq 0); declare hiter i2('set'); do while (i2.next() eq 0); output; end; end; end; keep id member; run; proc transpose data=want_tall out=want(drop=_name_) prefix=member_; by id; var member; format member: 4.; run;

Output table after transpose

 

RichardADeVenezia_0-1609186112813.png

 

s_lassen
Meteorite | Level 14

Here is a data step solution without any hash tables, just indexed access:

proc sql;
  create table roots as select distinct member1 from have
  where member1 not in(select member2 from have);
  create index member1 on have(member1);
quit;

data want;
  array members(*) 8 member1-member20;
  set roots;
  do p=1 to dim(members) until(missing(members(p)));
    m1=members(p);
    do until(0);
      set have(rename=(member1=m1 member2=m2)) key=m1;
      if _iorc_ then do;
        _error_=0;
        leave;
        end;
      else if not whichn(m2,of members(*)) then do;
        if n(of members(*))=dim(members) then do;
          error 'Array dimension too small';
          stop;
          end;
        else members(n(of members(*))+1)=m2;
        end;
      end;
    end;
  call sortn(of members(*));
/* missing come first with SORTN, correct that */ do _N_=1 to p-1; members(_N_)=members(_N_+dim(members)-p+1); end; do _N_=p to dim(members); call missing(members(_N_)); end; drop p m1 m2; run;
RichardDeVen
Barite | Level 11

Just a caution that the root detection will only work if the constraint rule member1 < member2 is in force.

 

For example, if a relationship link between 1 and 8 is entered as data (1, 😎 the query will still detect 1 as the root. However, if the data is (8, 1) the query will not detect a root for the blue network.

 

The robust root detection would ensure ordered (or constrained) linkage data

proc sql;
  create table roots as 
  select distinct member1 from
  (select 
    case when member1 < member1 then member1 else member2 end as member1
   from have
  ) 
  where member1 not in (
    select 
      case when member1 < member1 then member2 else member1 end
    from have
  );

could still be defeated by a self linked node.

 

The keyed the looks up would also have account for the non-constrained cases.

s_lassen
Meteorite | Level 14

You may be right. But the easiest solution to this seems to be sorting the relations (they were sorted already in the example data), e.g.:

data temp;
  set have;
  where member1 ne member2; /* get rid of self-linked codes */
  call sortn(member1,member2);
run;

and then continue with the TEMP table as input instead of HAVE. 

Ksharp
Super User

What about this dataset ? your code would get wrong result .

 

data have;
infile cards ;
input member1 member2 ;
cards;
1 2
1 4
2 3
2 4
4 1
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;
s_lassen
Meteorite | Level 14

That's exactly the kind of situation that my answer to @RichardDeVen will remedy.

 

So the total code is now:

data have;
infile cards ;
input member1 member2 ;
cards;
1 2
1 4
2 3
2 4
4 1
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;

data temp;
  set have;
  where member1 ne member2; /* get rid of self-linked codes */
  call sortn(member1,member2);
run;

proc sql;
  create table roots as select distinct member1 from temp
  where member1 not in(select member2 from temp);
  create index member1 on temp(member1);
quit;

data want;
  array members(*) 8 member1-member20;
  set roots;
  do p=1 to dim(members) until(missing(members(p)));
    m1=members(p);
    do until(0);
      set temp(rename=(member1=m1 member2=m2)) key=m1;
      if _iorc_ then do;
        _error_=0;
        leave;
        end;
      else if not whichn(m2,of members(*)) then do;
        if n(of members(*))=dim(members) then do;
          error 'Array dimension too small';
          stop;
          end;
        else members(n(of members(*))+1)=m2;
        end;
      end;
    end;
  call sortn(of members(*));  /* missing come first with SORTN, correct that */
  do _N_=1 to p-1;
    members(_N_)=members(_N_+dim(members)-p+1);
    end;
  do _N_=p to dim(members);
    call missing(members(_N_));
    end;
  drop p m1 m2;
run;
Ksharp
Super User
data have;
infile cards ;
input member1 member2 ;
cards;
1 2
1 4
2 3
2 4
8 1
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
14 .
;
run;

What if this dataset ? Your code still get wrong result.

s_lassen
Meteorite | Level 14

When I run my code against your example, I get two lines with the "families", and 14 all alone in the world. Is that wrong?

 

But you have a point: If the data contains multiple "singles" (member1 or member2 missing), these will be grouped together because "missing" is their common parent. So the code may have to be adjusted for that, unless that is how it should be. If there are any such singles, that is. The example data did not show any of those. 

 

But the code I have posted gets the right result with the original input data, and with your increasingly contrived exceptions.

 

 

Ksharp
Super User
If data like this ,Your code still get wrong result.
You can't expect data always look like what you want .

...........
10 12
11 12
14 14

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 6674 views
  • 20 likes
  • 5 in conversation