BookmarkSubscribeRSS Feed
Giamma14
Fluorite | Level 6

Dear All,

I need to group pairs into groups / clusters.

For instance, in the following table:

Pair1

Pair2

1

2

1

3

2

4

5

6

The first 3 rows will be in the same cluster because they all have one pair in common and the cluster will then include pairs 1, 2, 3 and 4.

The last row will be in a different cluster because pairs 5 and 6 are not in any other row.

The final dataset will then be:

Pair1

Pair2

Cluster

1

2

1

1

3

1

2

4

1

5

6

2

I’ve written a code that works well but it’s inefficient (it uses a lot of memory and it's slow): when the input dataset is bigger than 50,000 pairs it goes in memory overflow.

Any help would be really appreciated.

Thanks a lot in advance

5 REPLIES 5
Peter_C
Rhodochrosite | Level 12

looks like a job for a hash table

Giamma14
Fluorite | Level 6

Thanks Peter,

I don't know hash table, could you please address me on the right topic or provide me an example?

art297
Opal | Level 21

Take a look at that different approaches suggested in the following thread: http://communities.sas.com/message/104329#104329

A hash approach is included in one of the suggestions.

Tom
Super User Tom
Super User

It looks like you are doing graph analysis. You want to get the connected sub-graphs. Here is a simplified version of a macro I did years ago for this.  Not sure how fast it is, but it will work as long as your disk is large enough. I think this version might be for directed graphs.  You could either fix the selection logic to check both directions or just duplicate your data with the values of pair1/pair2 flipped.

%macro subnet(in=,out=,from=from,to=to,subnet=subnet);

/*----------------------------------------------------------------------

SUBNET - Build connected subnets from pairs of nodes.

Input Table :FROM TO pairs of rows

Output Table:input data with &subnet added

Work Tables:

  NODES - List of all nodes in input.

  NEW - List of new nodes to assign to current subnet.

Algorithm:

Pick next unassigned node and grow the subnet by adding all connected

nodes. Repeat until all unassigned nodes are put into a subnet.

----------------------------------------------------------------------*/

%local subnetid next getnext ;

%*----------------------------------------------------------------------

Put code to get next unassigned node into macro variable because it is

used in two places in the program.

-----------------------------------------------------------------------;

%let getnext= select node into :next from nodes where subnet=.;

%*----------------------------------------------------------------------

Initialize subnet id counter.

-----------------------------------------------------------------------;

%let subnetid=0;

proc sql noprint;

*----------------------------------------------------------------------;

* Get list of all nodes ;

*----------------------------------------------------------------------;

  create table nodes as

    select distinct . as subnet, &from as node from &in

    union

    select distinct . as subnet, &to as node from &in

  ;

*----------------------------------------------------------------------;

* Get next unassigned node ;

*----------------------------------------------------------------------;

  &getnext;

%do %while (&sqlobs) ;

*----------------------------------------------------------------------;

* Set subnet to next id ;

*----------------------------------------------------------------------;

  %let subnetid=%eval(&subnetid+1);

  update nodes set subnet=&subnetid where node=&next;

  %do %while (&sqlobs) ;

*----------------------------------------------------------------------;

* Get list of connected nodes for this subnet ;

*----------------------------------------------------------------------;

    create table new as

      select distinct a.&to as node

        from &in a, nodes b, nodes c

        where a.&from= b.node

          and a.&to= c.node

          and b.subnet = &subnetid

          and c.subnet = .

    ;

*----------------------------------------------------------------------;

* Update subnet for these nodes ;

*----------------------------------------------------------------------;

    update nodes set subnet=&subnetid

      where node in

          (select node from new )

    ;

  %end;

*----------------------------------------------------------------------;

* Get next unassigned node ;

*----------------------------------------------------------------------;

  &getnext;

%end;

*----------------------------------------------------------------------;

* Create output dataset by adding subnet number. ;

*----------------------------------------------------------------------;

  create table &out as

    select distinct a.*,b.subnet as &subnet

      from &in a , nodes b

      where a.&from = b.node

  ;

quit;

%mend subnet ;

data in;

  input pair1 pair2 @@;

cards;

1 2 1 3 2 4 5 6

run;

%subnet(in=in,out=out,from=pair1,to=pair2,subnet=group);

proc print data=out;

run;

Obs    pair1    pair2    group

1       1        2        1

2       1        3        1

3       2        4        1

4       5        6        2

Ksharp
Super User

I want to know whether these obs is populated in the dataset randomly? there some obs maybe at the end of dataset . look like:

Pair1 Pair2

1    2

1       3

5       6

2    4

If it were so. I have coded it for a user ------- sas_Forum.

http://communities.sas.com/message/104261#104261

For Art297. I found the link is not available any more . Why ? because sas_Forum has deleted it?

Fortunately I have a copy.

The following code is an example for your question.

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
. . . .
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n
. . . .
. . . .
jjjj    eee     rrr       ooo    
sss   www  .    .             
 .         .        .        eee 
;
run;

 

options compress=yes;
data want(keep=pan1 pan2 pan3 add1 household);
/*to make speed faster*/
 declare hash ha(hashexp : 20,ordered : 'a');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20,ordered : 'a');
  _ha.definekey('key');
  _ha.definedone();

do until(last);
 set test end=last; 
/*Remove obs which variable's are all missing firstly*/
 if cmiss(pan1,pan2,pan3,add1) lt 4 then do;
                                           count+1;
                                           ha.add();
                                         end;
end;

length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
do while(_rc eq 0); *until the end of Hash Table HA;
/*assign a unique cluster flag(i.e. household)*/
 household+1;
 do i=1 to 4;
/*push not missing value of current obs into another Hash Table _HA*/
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
 end;   
/*start to run over Hash Table HA ,until can not find any more 
 observation which is the same cluster with current observation*/
 do until(x=1);
    x=1;
/*copy the first obs from Hash Table HA into PDV*/
    rc=hi.first();
    do while(rc=0);
      found=0;
      do j=1 to 4;
/*find whether any one of value is included in the current obs*/
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
/*if any one of value is included,then push the obs which is copied from
Hash Table HA into Hash Tables _HA,flag it the same cluster with the 
current obs and output it into dataset*/
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count;*keep this found obs's index; 
                    end;
      rc=hi.next();
/*remove the found obs from Hash Table HA,since it has been seared*/
      if found then rx=ha.remove(key : _count);
     end;
 end;
/*clear up all the index which is the same cluster with the current obs*/ 
 _ha.clear();
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
end; 
run;


Ksharp

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1476 views
  • 0 likes
  • 5 in conversation