Solved: Re: Create clusters from pairs

peter_sjogarde · Posted 03-19-2015 12:07 PM

I have a large dataset consisting of pairs (about 130 miljon rows). I would like to create clusters from all observations that are connected.

Example

1 2

1 3

4 5

5 2

6 7

Two groups should be created from the example data; group 1: 1,2,3,4,5 and group 2: 6,7.

Any ideas on an approach for this?

Thank you for any help!

PGStats · Posted 03-19-2015 06:07 PM

The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.

If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).

Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.

PG

View solution in original post

ballardw · Posted 03-19-2015 12:42 PM

What is the rule that makes values of 1,2,3,4,5 go to group 1 and 6 and 7 go to group 2? "Connected" is very much dependent on your knowledge.

Are there two different variables involved (not quite obvious from your example) or more than 2?

Are you looking to add a variable that says what "group" a record belongs in?

Suppose a few thousand records below your example we have 7 2. What group would this record be in?

peter_sjogarde · Posted 03-19-2015 12:52 PM

Thank you for your question. I will try to make my self mor clear. The numbers are id:s and the connections are binary. All id:s that are connected, either directly or indirectly, should be clustered in the same group. 6 and 7 do not have any connections with 1-5, neither directly or indirectly so they do not cluster with that group. If we add a pair of 7 2 this will connect group 1 and 2 in the example and create one group.

Astounding · Posted 03-19-2015 12:55 PM

Just a few questions to be able to think about this ... from most important to least important ...

How many different identifiers do you have (ballpark figure)?

It looks like your individual identifiers (1 through 7 in the example) will be numeric. Is that correct?

Do you have any preconceptions about the cluster sizes?

If you end up with just 1 cluster, would that be acceptable or would you want to apply a different set of rules?

Ballardw, if there was a record with 7 2, it would belong in both clusters. That means that the two clusters would get collapsed into one, and there would only be one cluster.

peter_sjogarde · Posted 03-19-2015 01:07 PM

There are about 1.8 million unique identifiers and they are numeric. The cluster sizes will probably be from 2 to about 2000. If only connected identifiers are clustered I will not end up in one cluster, since I know that most observations are not linked. There will probably be something about 5-10 observations in each cluster in average, but I expect a skewed distribution so most clusters will only consist of a few observations.

ballardw · Posted 03-19-2015 01:26 PM

Astounding

I wanted to make sure that the data did not rely on contiguous order. The example data showed a potential for that as an unstated rule.

peter_sjogarde · Posted 03-19-2015 02:11 PM

I have restructured the indata so that the observations is divided into blocks, so the clustering can be done within each block which probably will faciliate the task.

block id1 id2

a 1 2

a 1 3

a 4 5

a 5 2

a 6 7

b 1 2

b 1 3

ballardw - I was trying to make the data in countiguous order earlier on but I could not figure out a way to do that. It would probably be a good sollution.

Reeza · Posted 03-19-2015 02:45 PM

Does order matter, can you switch around ID1/ID2 without any repercussions?

If so it might be worth sorting so that ID1<ID2.

If you have SAS/OR look into PROC BOM.

Otherwise I think the macro here will work for you, courtesy of

PGStats · Posted 03-19-2015 06:07 PM

The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.

If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).

Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.

PG

peter_sjogarde · Posted 03-20-2015 04:08 AM

Thank you for all answers helping me move forward in this process. I tried iterating over the data creating a dataset for each block and running the Subgraphs macro for each dataset. This was (of course) to slow, but it seemed to work. I will take a look at the script to see if I could add a BY-option to the SubGraphs code (any help in this matter would be greatly appreciated).

Astounding – That is an acceptable result. Unfortunately I am new to hash, so I would probably not be able to putting such a script together. I realize that I need to study the hash object.

Ksharp · Posted 03-21-2015 02:56 AM

PG,

"Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing."

I can do that .

Xia Keshan

Astounding · Posted 03-19-2015 05:26 PM

Good news / bad news ... I have a plan that will get you most of the way there. However, it involves too much in the way of hashing skills and someone else would have to put the program together.

By "most of the way there", here's the output I envision for your sample data.

ID ClusterNum

------------------------

1 1

2 1

3 1

4 1

5 1

6 3

7 3

Each ID would end up assigned to a cluster number. And, as illustrated, the possibility exists that some cluster numbers would be unused (due to collapsing of clusters into one another as the processing proceeds).

It would remain to use this output to actually assign each pair to a cluster, but that is relatively easy.

If that's an acceptable result, I can sketch out what the program needs to do.

Ksharp · Posted 03-20-2015 06:39 AM

Here is the code I wrote to find out who belong to the same household. But Your data is too big.

I don't know if you have enough memory to handle this :


data have;
infile cards ;
input from $  to $ ;
cards;
1     2
1     3
4     5
5     2
9     4
6     7
8     7
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; 
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

Xia Keshan

Message was edited by: xia keshan

peter_sjogarde · Posted 03-20-2015 12:08 PM

I managed to do the clustering with the SubGraphs macro. I divided the data into sections of about 1 million records and iterated over these datasets. I used a script found here:

http://snipplr.com/view/6212/

The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level.

Many thanks for all the help I got on this issue!

Ksharp · Posted 03-21-2015 02:24 AM

That code is to split a table into lots of sub-tables according to a group variable . I can do it better than him via Hash Table.

"The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level."

I think the following Hash should be set the maximize value 20 ,on account of your big table .

if 0 then set full;

declare hash from_to(dataset:'full',hashexp:20,multidata:'y');

Classroom Training Available!