Hi,
I'd like to come up with a more efficient way of searching a fairly large list of id pairs (let's say 200k+ id pairs) and creating chains of ids from them.
Example (12 pairs of linked ids, 3 chains of ids in total):
data have;
length ids $7;
input ids $;
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;
run;
The result I want is:
123-456-987
111-222-333-444-555-888-777-666
654-321
The output order is not important (e.g. 456-987-123 would still be fine in the first chain). In the real data each id is 14 characters long.
I have a macro which gives me this result (below) but to process 19k pairs of linked ids takes around 20 minutes which I'd really like to improve (ideally it would process 200k+ pairs in around 20 minutes).
Can anyone help me improve the efficiency of the logic I have, or, describe an alternative approach?
I found this link - http://support.sas.com/kb/26/160.html - which does something similar using hash tables (and runs in seconds with a test dataset of 100k+ id pairs) but if I break my 'have' dataset down into id and new_id and run it through that code it doesn't find all the links and produces duplicate chains.
This is my current code:
%macro chain();
/* count the number of links yet to become part of a chain */
proc sql noprint; select count(*) into :link_count from work.have; quit;
%do %until (%eval(&link_count = 0));
/* create a one row table containing each id pair */
proc transpose data=work.have out=work.have_t (drop=_name_) prefix = lp_;
var ids;
run;
/* do the linking */
data work.links (keep=all_linked_ids);
set work.have_t;
attrib all_linked_ids = $50;
array linked_ids[*] lp_: _temporary_;
do link_being_checked = 1 to 1;
all_linked_ids = linked_ids[link_being_checked];
do check_pstn = 1 to dim(linked_ids);
/* if the id on the left of the dash matches an id already in the chain */
/* and the id on the right of the dash hasn't already been added to */
/* the chain, add it to the chain and reset the check_pstn var so */
/* every id in the chain is always tested vs every available id. */
if findw(all_linked_ids, substr(linked_ids[check_pstn], 1, 3) and
check_pstn ^= link_being_checked and
findw(all_linked_ids, substr(linked_ids[check_pstn], 5, 3)) < 1
then do; all_linked_ids = catx('-', all_linked_ids, substr(linked_ids[check_pstn], 5, 3));
check_pstn = 1;
end;
else
/* if the id on the right of the dash matches an id already in the chain */
/* and the id on the left of the dash hasn't already been added to */
/* the chain, add it to the chain and reset the check_pstn var so */
/* every id in the chain is always tested vs every available id. */
if findw(all_linked_ids, substr(linked_ids[check_pstn], 5, 3) and
check_pstn ^= link_being_checked and
findw(all_linked_ids, substr(linked_ids[check_pstn], 1, 3)) < 1
then do; all_linked_ids = catx('-', all_linked_ids, substr(linked_ids[check_pstn], 1, 3));
check_pstn = 1;
end;
end;
/* when the end of the ids in the array is reached, output a record of this id chain */
output;
end;
run;
/* separate the chained ids out into individual ids */
data work.links_1 (keep=id);
set work.links;
length id $3;
do j = 1 to (length(compress(all_linked_ids))+1) / 4 until(p eq 0);
call scan(all_linked_ids, j, p, l);
id = substrn(all_linked_ids, p, l);
output;
run;
/* create a list of the ids in this chain so it can be removed from the main list of ids */
proc sql noprint; select "'"||id||"'" into :cid_list separated by ' ' from work.links_1; quit;
/* delete the ids from the main list */
proc sql noprint; delete from work.have where substr(ids, 1, 3) in (&cid_list.) or substr(ids, 5, 3) in (&cid_list.); quit;
/* maintain a list of all the link chains created so far */
proc append base=work.all_links
data=work.links;
run;
/* count the number of links yet to be tested to become part of a chain */
proc sql noprint; select count(*) into :link_count from work.have; quit;
%end;
%mend chain;
%chain;
I'm working in SAS 9.4. Many thanks in advance for any advice or suggestions.
Cy
What you describe amounts to finding the connected components in an undirected graph. Proc optnet (part of SAS/OR) has a routine for that:
data have;
infile datalines dlm="-";
length id1 id2 $7;
input (id1 id2) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;
proc optnet data_links=have graph_direction=undirected out_nodes=groups;
data_links_var from=id1 to=id2;
concomp;
run;
proc sort data=groups; by concomp node; run;
data chains;
length chain $100;
do until(last.concomp);
set groups; by concomp;
chain = catx("-", chain, node);
end;
keep concomp chain;
rename concomp=chain_ID;
run;
proc print data=chains noobs; run;
chain chain_ID 123-456-987 1 111-222-333-444-555-666-777-888 2 321-654 3
If you don't have SAS/OR .Try this one .
data have;
infile datalines dlm="-";
length from to $7;
input (from to) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
data final_want;
length chain $ 800;
do until(last.household);
set want;
by household;
chain = catx("-", chain, node);
end;
keep household chain;
run;
You can also make use of the subGraphs macro given here, this way :
data have;
infile datalines dlm="-";
length id1 id2 $7;
input (id1 id2) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;
%include "&sasforum\subGraphsMacro.sas";
%SubGraphs(have,from=id1,to=id2,out=groups);
proc sort data=groups; by clust node; run;
data chains;
length chain $100;
chain_ID + 1;
do until(last.clust);
set groups; by clust;
chain = catx("-", chain, node);
end;
keep chain_ID chain;
run;
proc print data=chains noobs; run;
chain chain_ID 123-456-987 1 111-222-333-444-555-666-777-888 2 321-654 3
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.