Hi, I'd like to come up with a more efficient way of searching a fairly large list of id pairs (let's say 200k+ id pairs) and creating chains of ids from them. Example (12 pairs of linked ids, 3 chains of ids in total): data have; length ids $7; input ids $; datalines; 123-456 111-222 111-333 222-333 333-444 333-555 666-777 444-888 444-555 888-777 123-987 654-321 ; run; The result I want is: 123-456-987 111-222-333-444-555-888-777-666 654-321 The output order is not important (e.g. 456-987-123 would still be fine in the first chain). In the real data each id is 14 characters long. I have a macro which gives me this result (below) but to process 19k pairs of linked ids takes around 20 minutes which I'd really like to improve (ideally it would process 200k+ pairs in around 20 minutes). Can anyone help me improve the efficiency of the logic I have, or, describe an alternative approach? I found this link - http://support.sas.com/kb/26/160.html - which does something similar using hash tables (and runs in seconds with a test dataset of 100k+ id pairs) but if I break my 'have' dataset down into id and new_id and run it through that code it doesn't find all the links and produces duplicate chains. This is my current code: %macro chain(); /* count the number of links yet to become part of a chain */ proc sql noprint; select count(*) into :link_count from work.have; quit; %do %until (%eval(&link_count = 0)); /* create a one row table containing each id pair */ proc transpose data=work.have out=work.have_t (drop=_name_) prefix = lp_; var ids; run; /* do the linking */ data work.links (keep=all_linked_ids); set work.have_t; attrib all_linked_ids = $50; array linked_ids[*] lp_: _temporary_; do link_being_checked = 1 to 1; all_linked_ids = linked_ids[link_being_checked]; do check_pstn = 1 to dim(linked_ids); /* if the id on the left of the dash matches an id already in the chain */ /* and the id on the right of the dash hasn't already been added to */ /* the chain, add it to the chain and reset the check_pstn var so */ /* every id in the chain is always tested vs every available id. */ if findw(all_linked_ids, substr(linked_ids[check_pstn], 1, 3) and check_pstn ^= link_being_checked and findw(all_linked_ids, substr(linked_ids[check_pstn], 5, 3)) < 1 then do; all_linked_ids = catx('-', all_linked_ids, substr(linked_ids[check_pstn], 5, 3)); check_pstn = 1; end; else /* if the id on the right of the dash matches an id already in the chain */ /* and the id on the left of the dash hasn't already been added to */ /* the chain, add it to the chain and reset the check_pstn var so */ /* every id in the chain is always tested vs every available id. */ if findw(all_linked_ids, substr(linked_ids[check_pstn], 5, 3) and check_pstn ^= link_being_checked and findw(all_linked_ids, substr(linked_ids[check_pstn], 1, 3)) < 1 then do; all_linked_ids = catx('-', all_linked_ids, substr(linked_ids[check_pstn], 1, 3)); check_pstn = 1; end; end; /* when the end of the ids in the array is reached, output a record of this id chain */ output; end; run; /* separate the chained ids out into individual ids */ data work.links_1 (keep=id); set work.links; length id $3; do j = 1 to (length(compress(all_linked_ids))+1) / 4 until(p eq 0); call scan(all_linked_ids, j, p, l); id = substrn(all_linked_ids, p, l); output; run; /* create a list of the ids in this chain so it can be removed from the main list of ids */ proc sql noprint; select "'"||id||"'" into :cid_list separated by ' ' from work.links_1; quit; /* delete the ids from the main list */ proc sql noprint; delete from work.have where substr(ids, 1, 3) in (&cid_list.) or substr(ids, 5, 3) in (&cid_list.); quit; /* maintain a list of all the link chains created so far */ proc append base=work.all_links data=work.links; run; /* count the number of links yet to be tested to become part of a chain */ proc sql noprint; select count(*) into :link_count from work.have; quit; %end; %mend chain; %chain; I'm working in SAS 9.4. Many thanks in advance for any advice or suggestions. Cy
... View more