topic Re: Create clusters from pairs in SAS Programming

Create clusters from pairs

peter_sjogarde — Thu, 19 Mar 2015 16:07:42 GMT

I have a large dataset consisting of pairs (about 130 miljon rows). I would like to create clusters from all observations that are connected.

Example

1 2

1 3

4 5

5 2

6 7

Two groups should be created from the example data; group 1: 1,2,3,4,5 and group 2: 6,7.

Any ideas on an approach for this?

Thank you for any help!

Re: Create clusters from pairs

ballardw — Thu, 19 Mar 2015 16:42:30 GMT

What is the rule that makes values of 1,2,3,4,5 go to group 1 and 6 and 7 go to group 2? "Connected" is very much dependent on your knowledge.

Are there two different variables involved (not quite obvious from your example) or more than 2?

Are you looking to add a variable that says what "group" a record belongs in?

Suppose a few thousand records below your example we have 7 2. What group would this record be in?

Re: Create clusters from pairs

peter_sjogarde — Thu, 19 Mar 2015 16:52:20 GMT

Thank you for your question. I will try to make my self mor clear. The numbers are id:s and the connections are binary. All id:s that are connected, either directly or indirectly, should be clustered in the same group. 6 and 7 do not have any connections with 1-5, neither directly or indirectly so they do not cluster with that group. If we add a pair of 7 2 this will connect group 1 and 2 in the example and create one group.

Re: Create clusters from pairs

Astounding — Thu, 19 Mar 2015 16:55:40 GMT

Just a few questions to be able to think about this ... from most important to least important ...

How many different identifiers do you have (ballpark figure)?

It looks like your individual identifiers (1 through 7 in the example) will be numeric. Is that correct?

Do you have any preconceptions about the cluster sizes?

If you end up with just 1 cluster, would that be acceptable or would you want to apply a different set of rules?

Ballardw, if there was a record with 7 2, it would belong in both clusters. That means that the two clusters would get collapsed into one, and there would only be one cluster.

Re: Create clusters from pairs

peter_sjogarde — Thu, 19 Mar 2015 17:07:46 GMT

There are about 1.8 million unique identifiers and they are numeric. The cluster sizes will probably be from 2 to about 2000. If only connected identifiers are clustered I will not end up in one cluster, since I know that most observations are not linked. There will probably be something about 5-10 observations in each cluster in average, but I expect a skewed distribution so most clusters will only consist of a few observations.

Re: Create clusters from pairs

ballardw — Thu, 19 Mar 2015 17:26:15 GMT

Astounding

I wanted to make sure that the data did not rely on contiguous order. The example data showed a potential for that as an unstated rule.

Re: Create clusters from pairs

peter_sjogarde — Thu, 19 Mar 2015 18:11:52 GMT

I have restructured the indata so that the observations is divided into blocks, so the clustering can be done within each block which probably will faciliate the task.

block id1 id2

a 1 2

a 1 3

a 4 5

a 5 2

a 6 7

b 1 2

b 1 3

ballardw - I was trying to make the data in countiguous order earlier on but I could not figure out a way to do that. It would probably be a good sollution.

Re: Create clusters from pairs

Reeza — Thu, 19 Mar 2015 18:45:57 GMT

Does order matter, can you switch around ID1/ID2 without any repercussions?

If so it might be worth sorting so that ID1<ID2.

If you have SAS/OR look into PROC BOM.

Otherwise I think the macro here will work for you, courtesy of

Re: Create clusters from pairs

Astounding — Thu, 19 Mar 2015 21:26:56 GMT

Good news / bad news ... I have a plan that will get you most of the way there. However, it involves too much in the way of hashing skills and someone else would have to put the program together.

By "most of the way there", here's the output I envision for your sample data.

ID ClusterNum

------------------------

1 1

2 1

3 1

4 1

5 1

6 3

7 3

Each ID would end up assigned to a cluster number. And, as illustrated, the possibility exists that some cluster numbers would be unused (due to collapsing of clusters into one another as the processing proceeds).

It would remain to use this output to actually assign each pair to a cluster, but that is relatively easy.

If that's an acceptable result, I can sketch out what the program needs to do.

Re: Create clusters from pairs

PGStats — Thu, 19 Mar 2015 22:07:33 GMT

The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.

If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).

Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.

Re: Create clusters from pairs

peter_sjogarde — Fri, 20 Mar 2015 08:08:33 GMT

Thank you for all answers helping me move forward in this process. I tried iterating over the data creating a dataset for each block and running the Subgraphs macro for each dataset. This was (of course) to slow, but it seemed to work. I will take a look at the script to see if I could add a BY-option to the SubGraphs code (any help in this matter would be greatly appreciated).

Astounding – That is an acceptable result. Unfortunately I am new to hash, so I would probably not be able to putting such a script together. I realize that I need to study the hash object.

Re: Create clusters from pairs

Ksharp — Fri, 20 Mar 2015 10:39:00 GMT

Here is the code I wrote to find out who belong to the same household. But Your data is too big.

I don't know if you have enough memory to handle this :


data have;
infile cards ;
input from $  to $ ;
cards;
1     2
1     3
4     5
5     2
9     4
6     7
8     7
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; 
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

Xia Keshan

Message was edited by: xia keshan

Re: Create clusters from pairs

peter_sjogarde — Fri, 20 Mar 2015 16:08:54 GMT

I managed to do the clustering with the SubGraphs macro. I divided the data into sections of about 1 million records and iterated over these datasets. I used a script found here:

http://snipplr.com/view/6212/

The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level.

Many thanks for all the help I got on this issue!

Re: Create clusters from pairs

Ksharp — Sat, 21 Mar 2015 06:24:22 GMT

That code is to split a table into lots of sub-tables according to a group variable . I can do it better than him via Hash Table.

"The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level."

I think the following Hash should be set the maximize value 20 ,on account of your big table .

if 0 then set full;

declare hash from_to(dataset:'full',hashexp:20,multidata:'y');

Re: Create clusters from pairs

Ksharp — Sat, 21 Mar 2015 06:56:21 GMT

PG,

"Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing."

I can do that .

Xia Keshan

Re: Create clusters from pairs

billfish — Thu, 09 Apr 2015 19:16:41 GMT

Although the question is deemed answered, I want to put in 2 cents into the discussion.

I do not have access to SAS/OR so I cannot evaluate the chosen solution.

There are 2 offered solutions: 1) by xia keshan, and 2) SubGraphsMacro.sas (by PGStats)

In the following I propose a different solution.

First the input dataset t_a (here I will characterize it as 100000 (150000)):

/***********************/
/**** input dataset ****/
/***********************/
data t_a;
do _N_ = 1 to 100000;
      a1 = int(150000*ranuni(3));
      a2 = int(150000*ranuni(5));
      output;
end;
run;

Now some benchmarks for the 3 solutions: 1) xia 2) current 3) SubGraphsMacro

                          xia       current     SubGraphsMacro
100000   (150000) -> 54.00 sec ( 2.00 sec)   (7:00 min)
200000   (300000) ->   1:50 min ( 4.06 sec) (25:57 min)
400000   (600000) ->   3:35 min ( 9.00 sec)

1000000 (1500000) -> 9:15 min (13.50 sec)
2000000 (3000000) -> 19:19 min (34:00 sec)
4000000 (6000000) -> 37:45 min ( 1:03 min)
8000000 (12000000) -> 106:58 min ( 2:25 min)
16000000 (24000000) -> (not done) (memory crash)

Observations:
   1) SubGraphsMacro is by far the slowest, about 10 times slower that xia's solution.
   2) The current solution is about 30 times faster than xia's solution.
   3) If enough RAM resources are available the current solution can find the clusters for 130M record dataset in less than 1 hour.

The main ingredient of the proposed solution is the "symmetrization" of the dataset:

/*****************************/
/*** t_b = symmetrized t_a ***/
/*****************************/
data t_b(keep=a1 a2);
set t_a(rename=(a1=b1 a2=b2));
if b1=b2 then do; a1=b1; a2=b2; output; end;
else do; a1=b1; a2=b2; output;
a1=b2; a2=b1; output; end;
run;

/************************************/
/***** proposed hash solution 1 *****/
/************************************/
data _null_;
length yMbr Mbr SetNo a1 a2 SET_NO SetNo 8.;

   if _N_=1 then do;
      declare hash ha(dataset:'t_b',multidata:'Y');
         ha.definekey ('a1');
         ha.definedata('a1','a2');
         declare hiter aIter('ha');
         ha.definedone();
         call missing(a1,a2);

      declare hash hx(multidata:'N');
         hx.definekey ('xMbr');
         hx.definedata('xMbr');
         declare hiter xIter('hx');
         hx.definedone();

      declare hash hy(multidata:'N');
         hy.definekey ('yMbr');
         hy.definedata('yMbr');
         declare hiter yIter('hy');
         hy.definedone();

      declare hash hz(multidata:'N');
         hz.definekey ('Mbr');
         hz.definedata('Mbr','SetNo');
         hz.definedone();
   end;

   do until (aDone);
      set t_b end=aDone;
      xMbr=a1; hx.ref();
      xMbr=a2; hx.ref();
   end;
   /***************************/
   /*** start of clustering ***/
   /***************************/
   Set_No=0;
   aSum=0;
   k1=xIter.first();
   do while (k1=0);
      aSum+1;
      a1=xMbr; Mbr=xMbr;
      k2=ha.find();
      j2=hz.find();
      z1=hy.num_items;
      if (k2=0) and (not(j2=0)) and (z1=0) then do;
         Set_No+1; zChange=1;
         yMbr=a1; hy.ref();
         yMbr=a2; hy.ref();
         do until (zChange=0);
            zChange=0;
            k3=yIter.first();
            do while(k3=0);
               a1=yMbr;
               k4=ha.find();
               if (k4=0) then do; d1=ha.remove(key:a1); end;
               do while (k4=0);
                  yMbr=a2; zAdd=hy.add();
                  zChange+(zAdd=0);
                  k4=ha.find_next();
                  if (k4=0) then do; d1=ha.remove(key:a1); end;
               end;
               k3=yIter.next();
            end;
         end;
         k4=yIter.first();
         do while (k4=0);
            Mbr=yMbr; SETNO=SET_NO; hz.ref();
            k4=yIter.next();
         end;
         z1=hy.num_items;
         hy.clear();
         z1=hy.num_items;
      end;
      k1=xIter.next();
   end;
   hz.output(dataset:'t_c');
run;