- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Good morning everyone,
I describe my problem and the solution I need. I have a table with pairs that identify relationships between two people (for example, friends in a social network). My end result should be a table in which each record defines a group in which all people are related, directly or through other people.
For example, starting from the following image
My input table looks like this:
data have;
input member1 member2;
datalines;
1 2
1 4
2 3
2 4
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
And my output table should be the following:
Initially I don't know the maximum number of people in a relationship (but I don't think it will exceed 20 in this problem)
Has anyone faced a similar problem? Can you think of any way to solve it?
Thanks for your time, Merry Christmas and Happy 2021!!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you don't have SAS/OR .
data have;
infile cards ;
input from $ to $ ;
cards;
1 2
1 4
2 3
2 4
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you don't have SAS/OR .
data have;
infile cards ;
input from $ to $ ;
cards;
1 2
1 4
2 3
2 4
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have tested this code on my dataset (over 300,000 pairs), and it works perfectly. Thank you very much for your time!!! Also, it has helped me to discover the use of hashing in SAS.
Happy New Year to all!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ksharp - Very helpful code! This post and your solution are helping me on a challenge I currently have. Would there be a way to score individuals in a household to say how many connections they have, or how "critical" they are to the household? For example, in the original post, Person #4 has 4 connections, so a way to give them a score of 4 (or something similar). I'm trying to explore ways to dive into the results I'm getting where some households have 500+ nodes. I do not have SAS/OR, only access to BASE and SAS/STAT.
Thank you in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This post is long time ago . I strongly suggest you to start a brand-new session .
Anyway, Here is an example you are looking for.
data have;
input _start $ _end $;
cards;
A F
F G
G J
J C
A B
B C
A D
D C
;
run;
proc sql;
create table ancient as
select _start,_end from have
where _start not in (select distinct _end from have);
quit;
data want(keep=path n_word);
if _n_ eq 1 then do;
length path _path $ 800 ;
if 0 then set have;
declare hash ha(hashexp:20,dataset:'have(where=(_start is not missing and _end is not missing))',multidata:'y');
ha.definekey('_start');
ha.definedata('_end');
ha.definedone();
declare hash pa(ordered:'y');
declare hiter hi_path('pa');
pa.definekey('n');
pa.definedata('n','path');
pa.definedone();
end;
set ancient;
count=1;n=1;_n=1;
path=catx('|',_start,_end);
pa.add();
do while(hi_path.next()=0);
if n ne 1 then pa.remove(key:_n);_n=n;
_path=path;
_start=scan(path,-1,'|');
rc=ha.find();
if rc ne 0 then do;
found=1;
n_word=countw(path,'|');
output;
end;
if not found then do;
do while(rc=0);
if not findw(path,strip(_end),'|') then do;
if length(path)+length(_end)+1 gt lengthc(path) then do;
putlog 'ERROR: The length of path and _path are set too short';
stop;
end;
count+1;n=count;
path=catx('|',path,_end);
pa.add();
path=_path;
end;
rc=ha.find_next();
end;
end;
end;
pa.clear();
run;
- Tags:
- T
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you have SAS/OR .
data have;
infile cards ;
input from $ to $ ;
cards;
1 2
1 4
2 3
2 4
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;
/* Same code as SAS/OR */
proc optnet data_links=have out_nodes=want GRAPH_DIRECTION=UNDIRECTED;
data_links_var from=from to=to;
concomp;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can use a Hash of Hash approach to conceptually construct a list of sets that is output tall wise as set_id, set_member_id which in turn is transposed into the desired wanted data layout.
Each member of a pair in the original data is either
- both not found in the list of sets
- a new set must be created and added to the list of sets
- only one is found in the list of sets
- add the other member to the set containing the found member
- both are found but in different sets
- merge the smaller set into the larger set
Sample code with extra relationship data that requires a merge case
* new.add.merge in data simply there to indicate what the processing will do;
data have;
input member1 member2;
datalines;
1 2 new 1
1 4 add 1
2 3 add 1
2 4 both 1
3 8 add 1
4 5 add 1
4 6 add 1
5 7 add 1
5 8 both 1
6 7 both 1
7 8 both 1
9 10 new 2
9 11 add 2
9 12 add 2
10 12 both 2
11 12 both 2
13 14 new 3
14 15 add 3
16 17 new 4
17 19 add 4
14 18 add 3
15 16 merge 3 4
;
* separate data into distinct sets of vertices corresponding to linked nodes;
data want_tall;
call missing (id, member);
declare hash set set1 set2 setP setQ;
declare hiter set_iter;
if _n_ = 1 then do;
declare hash sets(ordered:'a');
sets.defineKey('id');
sets.defineData('id', 'set');
sets.defineDone();
declare hiter sets_iter('sets');
end;
set have end=done;
rc = sets_iter.first();
do while (rc=0);
if not found1 then
found1 = ifn(set.check(key:member1) eq 0,id,0);
if not found2 then
found2 = ifn(set.check(key:member2) eq 0,id,0);
rc = sets_iter.next(); * advance to next item, ensures set gets unlocked from iteration tracking;
if found1 and found2 then leave;
end;
if not found1 and not found2 then do;
* put 'newing ' _n_=;
set = _new_ hash(ordered:'a');
set.defineKey('member');
set.defineData('member');
set.defineDone();
set.add(key:member1, data:member1);
set.add(key:member2, data:member2);
id = _n_;
sets.add();
end;
else if found1 ne found2 then do;
* case is either;
* - one member not found and should be added to others set,
* - or each member is in a different set and the sets need to be merged;
if found1 = 0 or found2 = 0 then do;
* put 'adding ' _n_=;
id = found1 + found2;
sets.find();
if found1
then set.add(key:member2, data:member2);
else set.add(key:member1, data:member1);
end;
else do;
* put 'merging ' _n_= found1= found2=;
sets.find(key:found1); set1 = set;
sets.find(key:found2); set2 = set;
if set1.num_items > set2.num_items then do;
setP = set1; * merge 2 into 1;
setQ = set2;
keyQ = found2;
end;
else do;
setP = set2; * merge 1 into 2;
setQ = set1;
keyQ = found1;
end;
* merge smaller set into the larger set and remove smaller set from memory;
set_iter = _new_ hiter('setQ');
do while (set_iter.next() = 0);
setP.add();
end;
rc = set_iter.delete();
sets.remove(key:keyQ);
setQ.delete();
end;
end;
if done then do;
declare hiter i1('sets');
do while(i1.next() eq 0);
declare hiter i2('set');
do while (i2.next() eq 0);
output;
end;
end;
end;
keep id member;
run;
proc transpose data=want_tall out=want(drop=_name_) prefix=member_;
by id;
var member;
format member: 4.;
run;
Output table after transpose
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is a data step solution without any hash tables, just indexed access:
proc sql;
create table roots as select distinct member1 from have
where member1 not in(select member2 from have);
create index member1 on have(member1);
quit;
data want;
array members(*) 8 member1-member20;
set roots;
do p=1 to dim(members) until(missing(members(p)));
m1=members(p);
do until(0);
set have(rename=(member1=m1 member2=m2)) key=m1;
if _iorc_ then do;
_error_=0;
leave;
end;
else if not whichn(m2,of members(*)) then do;
if n(of members(*))=dim(members) then do;
error 'Array dimension too small';
stop;
end;
else members(n(of members(*))+1)=m2;
end;
end;
end;
call sortn(of members(*)); /* missing come first with SORTN, correct that */
do _N_=1 to p-1;
members(_N_)=members(_N_+dim(members)-p+1);
end;
do _N_=p to dim(members);
call missing(members(_N_));
end;
drop p m1 m2;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just a caution that the root detection will only work if the constraint rule member1 < member2 is in force.
For example, if a relationship link between 1 and 8 is entered as data (1, 😎 the query will still detect 1 as the root. However, if the data is (8, 1) the query will not detect a root for the blue network.
The robust root detection would ensure ordered (or constrained) linkage data
proc sql; create table roots as select distinct member1 from (select case when member1 < member1 then member1 else member2 end as member1 from have ) where member1 not in ( select case when member1 < member1 then member2 else member1 end from have );
could still be defeated by a self linked node.
The keyed the looks up would also have account for the non-constrained cases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You may be right. But the easiest solution to this seems to be sorting the relations (they were sorted already in the example data), e.g.:
data temp;
set have;
where member1 ne member2; /* get rid of self-linked codes */
call sortn(member1,member2);
run;
and then continue with the TEMP table as input instead of HAVE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What about this dataset ? your code would get wrong result .
data have; infile cards ; input member1 member2 ; cards; 1 2 1 4 2 3 2 4 4 1 3 8 4 5 4 6 5 7 5 8 6 7 7 8 9 10 9 11 9 12 10 12 11 12 ; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That's exactly the kind of situation that my answer to @RichardDeVen will remedy.
So the total code is now:
data have;
infile cards ;
input member1 member2 ;
cards;
1 2
1 4
2 3
2 4
4 1
3 8
4 5
4 6
5 7
5 8
6 7
7 8
9 10
9 11
9 12
10 12
11 12
;
run;
data temp;
set have;
where member1 ne member2; /* get rid of self-linked codes */
call sortn(member1,member2);
run;
proc sql;
create table roots as select distinct member1 from temp
where member1 not in(select member2 from temp);
create index member1 on temp(member1);
quit;
data want;
array members(*) 8 member1-member20;
set roots;
do p=1 to dim(members) until(missing(members(p)));
m1=members(p);
do until(0);
set temp(rename=(member1=m1 member2=m2)) key=m1;
if _iorc_ then do;
_error_=0;
leave;
end;
else if not whichn(m2,of members(*)) then do;
if n(of members(*))=dim(members) then do;
error 'Array dimension too small';
stop;
end;
else members(n(of members(*))+1)=m2;
end;
end;
end;
call sortn(of members(*)); /* missing come first with SORTN, correct that */
do _N_=1 to p-1;
members(_N_)=members(_N_+dim(members)-p+1);
end;
do _N_=p to dim(members);
call missing(members(_N_));
end;
drop p m1 m2;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have; infile cards ; input member1 member2 ; cards; 1 2 1 4 2 3 2 4 8 1 3 8 4 5 4 6 5 7 5 8 6 7 7 8 9 10 9 11 9 12 10 12 11 12 14 . ; run;
What if this dataset ? Your code still get wrong result.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When I run my code against your example, I get two lines with the "families", and 14 all alone in the world. Is that wrong?
But you have a point: If the data contains multiple "singles" (member1 or member2 missing), these will be grouped together because "missing" is their common parent. So the code may have to be adjusted for that, unless that is how it should be. If there are any such singles, that is. The example data did not show any of those.
But the code I have posted gets the right result with the original input data, and with your increasingly contrived exceptions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can't expect data always look like what you want .
...........
10 12
11 12
14 14