BookmarkSubscribeRSS Feed
CyFerguson
Fluorite | Level 6

Hi,

 

I'd like to come up with a more efficient way of searching a fairly large list of id pairs (let's say 200k+ id pairs) and creating chains of ids from them.

 

Example (12 pairs of linked ids, 3 chains of ids in total):

 

data have;

  length ids $7;

  input ids $;

 

datalines;

123-456

111-222

111-333

222-333

333-444

333-555

666-777

444-888

444-555

888-777

123-987

654-321

;

run; 

 

The result I want is:

123-456-987

111-222-333-444-555-888-777-666

654-321

 

The output order is not important (e.g. 456-987-123 would still be fine in the first chain). In the real data each id is 14 characters long.

 

I have a macro which gives me this result (below) but to process 19k pairs of linked ids takes around 20 minutes which I'd really like to improve (ideally it would process 200k+ pairs in around 20 minutes).

 

Can anyone help me improve the efficiency of the logic I have, or, describe an alternative approach?

 

I found this link - http://support.sas.com/kb/26/160.html - which does something similar using hash tables (and runs in seconds with a test dataset of 100k+ id pairs) but if I break my 'have' dataset down into id and new_id and run it through that code it doesn't find all the links and produces duplicate chains. 

 

This is my current code:

 

%macro chain();

 

  /* count the number of links yet to become part of a chain */

  proc sql noprint; select count(*) into :link_count from work.have; quit;

 

  %do %until (%eval(&link_count = 0));

 

  /* create a one row table containing each id pair */

  proc transpose data=work.have out=work.have_t (drop=_name_) prefix = lp_;

    var ids;

  run;

 

  /* do the linking */

  data work.links (keep=all_linked_ids);

    set work.have_t;

    attrib all_linked_ids = $50;

 

    array linked_ids[*] lp_: _temporary_;

 

    do link_being_checked = 1 to 1;

      all_linked_ids = linked_ids[link_being_checked];

      

        do check_pstn = 1 to dim(linked_ids);

 

        /* if the id on the left of the dash matches an id already in the chain */

        /* and the id on the right of the dash hasn't already been added to   */

        /* the chain, add it to the chain and reset the check_pstn var so       */

        /* every id in the chain is always tested vs every available id.           */

 

        if findw(all_linked_ids, substr(linked_ids[check_pstn], 1, 3) and

           check_pstn ^= link_being_checked and

           findw(all_linked_ids, substr(linked_ids[check_pstn], 5, 3)) < 1

 

       then do; all_linked_ids = catx('-', all_linked_ids, substr(linked_ids[check_pstn], 5, 3));

         check_pstn = 1;

       end;

 

      else

 

        /* if the id on the right of the dash matches an id already in the chain */

        /* and the id on the left of the dash hasn't already been added to   */

        /* the chain, add it to the chain and reset the check_pstn var so       */

        /* every id in the chain is always tested vs every available id.           */

 

        if findw(all_linked_ids, substr(linked_ids[check_pstn], 5, 3) and

           check_pstn ^= link_being_checked and

           findw(all_linked_ids, substr(linked_ids[check_pstn], 1, 3)) < 1

 

       then do; all_linked_ids = catx('-', all_linked_ids, substr(linked_ids[check_pstn], 1, 3));

         check_pstn = 1;

       end;

 

      end;

 

    /* when the end of the ids in the array is reached, output a record of this id chain */

    output;

  

    end;

 

  run;

 

    /* separate the chained ids out into individual ids */

    data work.links_1 (keep=id);

      set work.links;

      length id $3;

        do j = 1 to (length(compress(all_linked_ids))+1) / 4 until(p eq 0);

          call scan(all_linked_ids, j, p, l);

          id = substrn(all_linked_ids, p, l);

          output;

    run;

 

    /* create a list of the ids in this chain so it can be removed from the main list of ids */

    proc sql noprint; select "'"||id||"'" into :cid_list separated by ' ' from work.links_1; quit;

 

    /* delete the ids from the main list */

    proc sql noprint; delete from work.have where substr(ids, 1, 3) in (&cid_list.) or substr(ids, 5, 3) in (&cid_list.); quit;

 

    /* maintain a list of all the link chains created so far */

    proc append base=work.all_links

                         data=work.links;

    run;

 

    /* count the number of links yet to be tested to become part of a chain */

    proc sql noprint; select count(*) into :link_count from work.have; quit;

 

  %end;

 

%mend chain;

 

%chain;

    

I'm working in SAS 9.4. Many thanks in advance for any advice or suggestions.

 

Cy

5 REPLIES 5
PGStats
Opal | Level 21

What you describe amounts to finding the connected components in an undirected graph. Proc optnet (part of SAS/OR) has a routine for that:

 

data have;
  infile datalines dlm="-";
  length id1 id2 $7;
  input (id1 id2) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;

proc optnet data_links=have graph_direction=undirected out_nodes=groups;
data_links_var from=id1 to=id2;
concomp;
run;

proc sort data=groups; by concomp node; run;

data chains;
length chain $100;
do until(last.concomp);
    set groups; by concomp;
    chain = catx("-", chain, node);
    end;
keep concomp chain;
rename concomp=chain_ID;
run;

proc print data=chains noobs; run;
                    chain                              chain_ID

                    123-456-987                            1
                    111-222-333-444-555-666-777-888        2
                    321-654                                3
PG
Ksharp
Super User

If you don't have SAS/OR .Try this one .

 

data have;
  infile datalines dlm="-";
  length from to $7;
  input (from to) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;



data final_want;
length chain $ 800;
do until(last.household);
    set want; 
    by household;
    chain = catx("-", chain, node);
    end;
keep household chain;
run;
CyFerguson
Fluorite | Level 6
PGStats - thanks, unfortunately we don't have the SAS/OR package at my site - pity as it looks like such a good solution!
 
Reeza and Ksharp, thanks for your replies, they both look encouraging - I will test them on Thursday when I'm back on site and provide an update.
 
Cy
PGStats
Opal | Level 21

You can also make use of the subGraphs macro given here, this way :

 

data have;
  infile datalines dlm="-";
  length id1 id2 $7;
  input (id1 id2) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;

%include "&sasforum\subGraphsMacro.sas";

%SubGraphs(have,from=id1,to=id2,out=groups);

proc sort data=groups; by clust node; run;

data chains;
length chain $100;
chain_ID + 1;
do until(last.clust);
    set groups; by clust;
    chain = catx("-", chain, node);
    end;
keep chain_ID chain;
run;

proc print data=chains noobs; run;
                    chain                              chain_ID

                    123-456-987                            1
                    111-222-333-444-555-666-777-888        2
                    321-654                                3
PG

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 2311 views
  • 4 likes
  • 4 in conversation