DATA Step, Macro, Functions and more

Grouping pairs into groups

Reply
Occasional Contributor
Posts: 16

Grouping pairs into groups

Dear All,

I need to group pairs into groups / clusters.

For instance, in the following table:

Pair1

Pair2

1

2

1

3

2

4

5

6

The first 3 rows will be in the same cluster because they all have one pair in common and the cluster will then include pairs 1, 2, 3 and 4.

The last row will be in a different cluster because pairs 5 and 6 are not in any other row.

The final dataset will then be:

Pair1

Pair2

Cluster

1

2

1

1

3

1

2

4

1

5

6

2

I’ve written a code that works well but it’s inefficient (it uses a lot of memory and it's slow): when the input dataset is bigger than 50,000 pairs it goes in memory overflow.

Any help would be really appreciated.

Thanks a lot in advance

Valued Guide
Posts: 2,177

Re: Grouping pairs into groups

looks like a job for a hash table

Occasional Contributor
Posts: 16

Grouping pairs into groups

Thanks Peter,

I don't know hash table, could you please address me on the right topic or provide me an example?

PROC Star
Posts: 7,484

Grouping pairs into groups

Take a look at that different approaches suggested in the following thread: http://communities.sas.com/message/104329#104329

A hash approach is included in one of the suggestions.

Super User
Super User
Posts: 7,062

Re: Grouping pairs into groups

It looks like you are doing graph analysis. You want to get the connected sub-graphs. Here is a simplified version of a macro I did years ago for this.  Not sure how fast it is, but it will work as long as your disk is large enough. I think this version might be for directed graphs.  You could either fix the selection logic to check both directions or just duplicate your data with the values of pair1/pair2 flipped.

%macro subnet(in=,out=,from=from,to=to,subnet=subnet);

/*----------------------------------------------------------------------

SUBNET - Build connected subnets from pairs of nodes.

Input Table :FROM TO pairs of rows

Output Table:input data with &subnet added

Work Tables:

  NODES - List of all nodes in input.

  NEW - List of new nodes to assign to current subnet.

Algorithm:

Pick next unassigned node and grow the subnet by adding all connected

nodes. Repeat until all unassigned nodes are put into a subnet.

----------------------------------------------------------------------*/

%local subnetid next getnext ;

%*----------------------------------------------------------------------

Put code to get next unassigned node into macro variable because it is

used in two places in the program.

-----------------------------------------------------------------------;

%let getnext= select node into :next from nodes where subnet=.;

%*----------------------------------------------------------------------

Initialize subnet id counter.

-----------------------------------------------------------------------;

%let subnetid=0;

proc sql noprint;

*----------------------------------------------------------------------;

* Get list of all nodes ;

*----------------------------------------------------------------------;

  create table nodes as

    select distinct . as subnet, &from as node from &in

    union

    select distinct . as subnet, &to as node from &in

  ;

*----------------------------------------------------------------------;

* Get next unassigned node ;

*----------------------------------------------------------------------;

  &getnext;

%do %while (&sqlobs) ;

*----------------------------------------------------------------------;

* Set subnet to next id ;

*----------------------------------------------------------------------;

  %let subnetid=%eval(&subnetid+1);

  update nodes set subnet=&subnetid where node=&next;

  %do %while (&sqlobs) ;

*----------------------------------------------------------------------;

* Get list of connected nodes for this subnet ;

*----------------------------------------------------------------------;

    create table new as

      select distinct a.&to as node

        from &in a, nodes b, nodes c

        where a.&from= b.node

          and a.&to= c.node

          and b.subnet = &subnetid

          and c.subnet = .

    ;

*----------------------------------------------------------------------;

* Update subnet for these nodes ;

*----------------------------------------------------------------------;

    update nodes set subnet=&subnetid

      where node in

          (select node from new )

    ;

  %end;

*----------------------------------------------------------------------;

* Get next unassigned node ;

*----------------------------------------------------------------------;

  &getnext;

%end;

*----------------------------------------------------------------------;

* Create output dataset by adding subnet number. ;

*----------------------------------------------------------------------;

  create table &out as

    select distinct a.*,b.subnet as &subnet

      from &in a , nodes b

      where a.&from = b.node

  ;

quit;

%mend subnet ;

data in;

  input pair1 pair2 @@;

cards;

1 2 1 3 2 4 5 6

run;

%subnet(in=in,out=out,from=pair1,to=pair2,subnet=group);

proc print data=out;

run;

Obs    pair1    pair2    group

1       1        2        1

2       1        3        1

3       2        4        1

4       5        6        2

Super User
Posts: 10,041

Re: Grouping pairs into groups

I want to know whether these obs is populated in the dataset randomly? there some obs maybe at the end of dataset . look like:

Pair1 Pair2

1    2

1       3

5       6

2    4

If it were so. I have coded it for a user ------- sas_Forum.

http://communities.sas.com/message/104261#104261

For Art297. I found the link is not available any more . Why ? because sas_Forum has deleted it?

Fortunately I have a copy.

The following code is an example for your question.

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
. . . .
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n
. . . .
. . . .
jjjj    eee     rrr       ooo    
sss   www  .    .             
 .         .        .        eee 
;
run;

 

options compress=yes;
data want(keep=pan1 pan2 pan3 add1 household);
/*to make speed faster*/
 declare hash ha(hashexp : 20,ordered : 'a');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20,ordered : 'a');
  _ha.definekey('key');
  _ha.definedone();

do until(last);
 set test end=last; 
/*Remove obs which variable's are all missing firstly*/
 if cmiss(pan1,pan2,pan3,add1) lt 4 then do;
                                           count+1;
                                           ha.add();
                                         end;
end;

length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
do while(_rc eq 0); *until the end of Hash Table HA;
/*assign a unique cluster flag(i.e. household)*/
 household+1;
 do i=1 to 4;
/*push not missing value of current obs into another Hash Table _HA*/
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
 end;   
/*start to run over Hash Table HA ,until can not find any more 
 observation which is the same cluster with current observation*/
 do until(x=1);
    x=1;
/*copy the first obs from Hash Table HA into PDV*/
    rc=hi.first();
    do while(rc=0);
      found=0;
      do j=1 to 4;
/*find whether any one of value is included in the current obs*/
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
/*if any one of value is included,then push the obs which is copied from
Hash Table HA into Hash Tables _HA,flag it the same cluster with the 
current obs and output it into dataset*/
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count;*keep this found obs's index; 
                    end;
      rc=hi.next();
/*remove the found obs from Hash Table HA,since it has been seared*/
      if found then rx=ha.remove(key : _count);
     end;
 end;
/*clear up all the index which is the same cluster with the current obs*/ 
 _ha.clear();
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
end; 
run;


Ksharp

Ask a Question
Discussion stats
  • 5 replies
  • 237 views
  • 0 likes
  • 5 in conversation