BookmarkSubscribeRSS Feed
CyFerguson
Fluorite | Level 6

Hi,

 

I'd like to come up with a more efficient way of searching a fairly large list of id pairs (let's say 200k+ id pairs) and creating chains of ids from them.

 

Example (12 pairs of linked ids, 3 chains of ids in total):

 

data have;

  length ids $7;

  input ids $;

 

datalines;

123-456

111-222

111-333

222-333

333-444

333-555

666-777

444-888

444-555

888-777

123-987

654-321

;

run; 

 

The result I want is:

123-456-987

111-222-333-444-555-888-777-666

654-321

 

The output order is not important (e.g. 456-987-123 would still be fine in the first chain). In the real data each id is 14 characters long.

 

I have a macro which gives me this result (below) but to process 19k pairs of linked ids takes around 20 minutes which I'd really like to improve (ideally it would process 200k+ pairs in around 20 minutes).

 

Can anyone help me improve the efficiency of the logic I have, or, describe an alternative approach?

 

I found this link - http://support.sas.com/kb/26/160.html - which does something similar using hash tables (and runs in seconds with a test dataset of 100k+ id pairs) but if I break my 'have' dataset down into id and new_id and run it through that code it doesn't find all the links and produces duplicate chains. 

 

This is my current code:

 

%macro chain();

 

  /* count the number of links yet to become part of a chain */

  proc sql noprint; select count(*) into :link_count from work.have; quit;

 

  %do %until (%eval(&link_count = 0));

 

  /* create a one row table containing each id pair */

  proc transpose data=work.have out=work.have_t (drop=_name_) prefix = lp_;

    var ids;

  run;

 

  /* do the linking */

  data work.links (keep=all_linked_ids);

    set work.have_t;

    attrib all_linked_ids = $50;

 

    array linked_ids[*] lp_: _temporary_;

 

    do link_being_checked = 1 to 1;

      all_linked_ids = linked_ids[link_being_checked];

      

        do check_pstn = 1 to dim(linked_ids);

 

        /* if the id on the left of the dash matches an id already in the chain */

        /* and the id on the right of the dash hasn't already been added to   */

        /* the chain, add it to the chain and reset the check_pstn var so       */

        /* every id in the chain is always tested vs every available id.           */

 

        if findw(all_linked_ids, substr(linked_ids[check_pstn], 1, 3) and

           check_pstn ^= link_being_checked and

           findw(all_linked_ids, substr(linked_ids[check_pstn], 5, 3)) < 1

 

       then do; all_linked_ids = catx('-', all_linked_ids, substr(linked_ids[check_pstn], 5, 3));

         check_pstn = 1;

       end;

 

      else

 

        /* if the id on the right of the dash matches an id already in the chain */

        /* and the id on the left of the dash hasn't already been added to   */

        /* the chain, add it to the chain and reset the check_pstn var so       */

        /* every id in the chain is always tested vs every available id.           */

 

        if findw(all_linked_ids, substr(linked_ids[check_pstn], 5, 3) and

           check_pstn ^= link_being_checked and

           findw(all_linked_ids, substr(linked_ids[check_pstn], 1, 3)) < 1

 

       then do; all_linked_ids = catx('-', all_linked_ids, substr(linked_ids[check_pstn], 1, 3));

         check_pstn = 1;

       end;

 

      end;

 

    /* when the end of the ids in the array is reached, output a record of this id chain */

    output;

  

    end;

 

  run;

 

    /* separate the chained ids out into individual ids */

    data work.links_1 (keep=id);

      set work.links;

      length id $3;

        do j = 1 to (length(compress(all_linked_ids))+1) / 4 until(p eq 0);

          call scan(all_linked_ids, j, p, l);

          id = substrn(all_linked_ids, p, l);

          output;

    run;

 

    /* create a list of the ids in this chain so it can be removed from the main list of ids */

    proc sql noprint; select "'"||id||"'" into :cid_list separated by ' ' from work.links_1; quit;

 

    /* delete the ids from the main list */

    proc sql noprint; delete from work.have where substr(ids, 1, 3) in (&cid_list.) or substr(ids, 5, 3) in (&cid_list.); quit;

 

    /* maintain a list of all the link chains created so far */

    proc append base=work.all_links

                         data=work.links;

    run;

 

    /* count the number of links yet to be tested to become part of a chain */

    proc sql noprint; select count(*) into :link_count from work.have; quit;

 

  %end;

 

%mend chain;

 

%chain;

    

I'm working in SAS 9.4. Many thanks in advance for any advice or suggestions.

 

Cy

5 REPLIES 5
PGStats
Opal | Level 21

What you describe amounts to finding the connected components in an undirected graph. Proc optnet (part of SAS/OR) has a routine for that:

 

data have;
  infile datalines dlm="-";
  length id1 id2 $7;
  input (id1 id2) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;

proc optnet data_links=have graph_direction=undirected out_nodes=groups;
data_links_var from=id1 to=id2;
concomp;
run;

proc sort data=groups; by concomp node; run;

data chains;
length chain $100;
do until(last.concomp);
    set groups; by concomp;
    chain = catx("-", chain, node);
    end;
keep concomp chain;
rename concomp=chain_ID;
run;

proc print data=chains noobs; run;
                    chain                              chain_ID

                    123-456-987                            1
                    111-222-333-444-555-666-777-888        2
                    321-654                                3
PG
Ksharp
Super User

If you don't have SAS/OR .Try this one .

 

data have;
  infile datalines dlm="-";
  length from to $7;
  input (from to) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;



data final_want;
length chain $ 800;
do until(last.household);
    set want; 
    by household;
    chain = catx("-", chain, node);
    end;
keep household chain;
run;
CyFerguson
Fluorite | Level 6
PGStats - thanks, unfortunately we don't have the SAS/OR package at my site - pity as it looks like such a good solution!
 
Reeza and Ksharp, thanks for your replies, they both look encouraging - I will test them on Thursday when I'm back on site and provide an update.
 
Cy
PGStats
Opal | Level 21

You can also make use of the subGraphs macro given here, this way :

 

data have;
  infile datalines dlm="-";
  length id1 id2 $7;
  input (id1 id2) (:$7.);
datalines;
123-456
111-222
111-333
222-333
333-444
333-555
666-777
444-888
444-555
888-777
123-987
654-321
;

%include "&sasforum\subGraphsMacro.sas";

%SubGraphs(have,from=id1,to=id2,out=groups);

proc sort data=groups; by clust node; run;

data chains;
length chain $100;
chain_ID + 1;
do until(last.clust);
    set groups; by clust;
    chain = catx("-", chain, node);
    end;
keep chain_ID chain;
run;

proc print data=chains noobs; run;
                    chain                              chain_ID

                    123-456-987                            1
                    111-222-333-444-555-666-777-888        2
                    321-654                                3
PG

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1615 views
  • 4 likes
  • 4 in conversation