Grouping pairs into groups

Giamma14 · Posted 10-31-2011 11:18 AM

Dear All,

I need to group pairs into groups / clusters.

For instance, in the following table:

Pair1	Pair2
1	2
1	3
2	4
5	6

The first 3 rows will be in the same cluster because they all have one pair in common and the cluster will then include pairs 1, 2, 3 and 4.

The last row will be in a different cluster because pairs 5 and 6 are not in any other row.

The final dataset will then be:

Pair1	Pair2	Cluster
1	2	1
1	3	1
2	4	1
5	6	2

I’ve written a code that works well but it’s inefficient (it uses a lot of memory and it's slow): when the input dataset is bigger than 50,000 pairs it goes in memory overflow.

Any help would be really appreciated.

Thanks a lot in advance

Peter_C · Posted 10-31-2011 11:22 AM

looks like a job for a hash table

Giamma14 · Posted 10-31-2011 12:39 PM

Thanks Peter,

I don't know hash table, could you please address me on the right topic or provide me an example?

art297 · Posted 10-31-2011 12:52 PM

Take a look at that different approaches suggested in the following thread: http://communities.sas.com/message/104329#104329

A hash approach is included in one of the suggestions.

Tom · Posted 10-31-2011 05:00 PM

It looks like you are doing graph analysis. You want to get the connected sub-graphs. Here is a simplified version of a macro I did years ago for this. Not sure how fast it is, but it will work as long as your disk is large enough. I think this version might be for directed graphs. You could either fix the selection logic to check both directions or just duplicate your data with the values of pair1/pair2 flipped.

%macro subnet(in=,out=,from=from,to=to,subnet=subnet);

/*----------------------------------------------------------------------

SUBNET - Build connected subnets from pairs of nodes.

Input Table :FROM TO pairs of rows

Output Table:input data with &subnet added

Work Tables:

NODES - List of all nodes in input.

NEW - List of new nodes to assign to current subnet.

Algorithm:

Pick next unassigned node and grow the subnet by adding all connected

nodes. Repeat until all unassigned nodes are put into a subnet.

----------------------------------------------------------------------*/

%local subnetid next getnext ;

%*----------------------------------------------------------------------

Put code to get next unassigned node into macro variable because it is

used in two places in the program.

-----------------------------------------------------------------------;

%let getnext= select node into :next from nodes where subnet=.;

%*----------------------------------------------------------------------

Initialize subnet id counter.

-----------------------------------------------------------------------;

%let subnetid=0;

proc sql noprint;

*----------------------------------------------------------------------;

* Get list of all nodes ;

*----------------------------------------------------------------------;

create table nodes as

select distinct . as subnet, &from as node from &in

union

select distinct . as subnet, &to as node from &in

;

*----------------------------------------------------------------------;

* Get next unassigned node ;

*----------------------------------------------------------------------;

&getnext;

%do %while (&sqlobs) ;

*----------------------------------------------------------------------;

* Set subnet to next id ;

*----------------------------------------------------------------------;

%let subnetid=%eval(&subnetid+1);

update nodes set subnet=&subnetid where node=&next;

%do %while (&sqlobs) ;

*----------------------------------------------------------------------;

* Get list of connected nodes for this subnet ;

*----------------------------------------------------------------------;

create table new as

select distinct a.&to as node

from &in a, nodes b, nodes c

where a.&from= b.node

and a.&to= c.node

and b.subnet = &subnetid

and c.subnet = .

;

*----------------------------------------------------------------------;

* Update subnet for these nodes ;

*----------------------------------------------------------------------;

update nodes set subnet=&subnetid

where node in

(select node from new )

;

%end;

*----------------------------------------------------------------------;

* Get next unassigned node ;

*----------------------------------------------------------------------;

&getnext;

%end;

*----------------------------------------------------------------------;

* Create output dataset by adding subnet number. ;

*----------------------------------------------------------------------;

create table &out as

select distinct a.*,b.subnet as &subnet

from &in a , nodes b

where a.&from = b.node

;

quit;

%mend subnet ;

data in;
  input pair1 pair2 @@;
cards;
1 2 1 3 2 4 5 6
run;
%subnet(in=in,out=out,from=pair1,to=pair2,subnet=group);
proc print data=out;
 run;

Obs pair1 pair2 group

1 1 2 1

2 1 3 1

3 2 4 1

4 5 6 2

Ksharp · Posted 11-01-2011 01:33 AM

I want to know whether these obs is populated in the dataset randomly? there some obs maybe at the end of dataset . look like:

Pair1 Pair2

1 2

1 3

5 6

2 4

If it were so. I have coded it for a user ------- sas_Forum.

http://communities.sas.com/message/104261#104261

For Art297. I found the link is not available any more . Why ? because sas_Forum has deleted it?

Fortunately I have a copy.

The following code is an example for your question.

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
. . . .
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n
. . . .
. . . .
jjjj    eee     rrr       ooo    
sss   www  .    .             
 .         .        .        eee 
;
run;

 

options compress=yes;
data want(keep=pan1 pan2 pan3 add1 household);
/*to make speed faster*/
 declare hash ha(hashexp : 20,ordered : 'a');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20,ordered : 'a');
  _ha.definekey('key');
  _ha.definedone();

do until(last);
 set test end=last; 
/*Remove obs which variable's are all missing firstly*/
 if cmiss(pan1,pan2,pan3,add1) lt 4 then do;
                                           count+1;
                                           ha.add();
                                         end;
end;

length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
do while(_rc eq 0); *until the end of Hash Table HA;
/*assign a unique cluster flag(i.e. household)*/
 household+1;
 do i=1 to 4;
/*push not missing value of current obs into another Hash Table _HA*/
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
 end;   
/*start to run over Hash Table HA ,until can not find any more 
 observation which is the same cluster with current observation*/
 do until(x=1);
    x=1;
/*copy the first obs from Hash Table HA into PDV*/
    rc=hi.first();
    do while(rc=0);
      found=0;
      do j=1 to 4;
/*find whether any one of value is included in the current obs*/
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
/*if any one of value is included,then push the obs which is copied from
Hash Table HA into Hash Tables _HA,flag it the same cluster with the 
current obs and output it into dataset*/
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count;*keep this found obs's index; 
                    end;
      rc=hi.next();
/*remove the found obs from Hash Table HA,since it has been seared*/
      if found then rx=ha.remove(key : _count);
     end;
 end;
/*clear up all the index which is the same cluster with the current obs*/ 
 _ha.clear();
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
end; 
run;

Ksharp