DATA Step, Macro, Functions and more

Create clusters from pairs

Accepted Solution Solved
Reply
Contributor
Posts: 22
Accepted Solution

Create clusters from pairs

I have a large dataset consisting of pairs (about 130 miljon rows). I would like to create clusters from all observations that are connected.

Example

1     2

1     3

4     5

5     2

6     7

Two groups should be created from the example data; group 1: 1,2,3,4,5 and group 2: 6,7.

Any ideas on an approach for this?

Thank you for any help!


Accepted Solutions
Solution
‎03-19-2015 06:07 PM
Respected Advisor
Posts: 4,641

Re: Create clusters from pairs

The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.

If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).

Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.

PG

PG

View solution in original post


All Replies
Super User
Posts: 10,483

Re: Create clusters from pairs

What is the rule that makes values of 1,2,3,4,5 go to group 1 and 6 and 7 go to group 2? "Connected" is very much dependent on your knowledge.

Are there two different variables involved (not quite obvious from your example) or more than 2?

Are you looking to add a variable that says what "group" a record belongs in?

Suppose a few thousand records below your example we have 7 2. What group would this record be in?

Contributor
Posts: 22

Re: Create clusters from pairs

Thank you for your question. I will try to make my self mor clear. The numbers are id:s and the connections are binary. All id:s that are connected, either directly or indirectly, should be clustered in the same group. 6 and 7 do not have any connections with 1-5, neither directly or indirectly so they do not cluster with that group. If we add a pair of 7 2 this will connect group 1 and 2 in the example and create one group.

Super User
Posts: 5,079

Re: Create clusters from pairs

Just a few questions to be able to think about this ... from most important to least important ...

How many different identifiers do you have (ballpark figure)?

It looks like your individual identifiers (1 through 7 in the example) will be numeric.  Is that correct?

Do you have any preconceptions about the cluster sizes?

If you end up with just 1 cluster, would that be acceptable or would you want to apply a different set of rules?

Ballardw, if there was a record with 7 2, it would belong in both clusters.  That means that the two clusters would get collapsed into one, and there would only be one cluster.

Contributor
Posts: 22

Re: Create clusters from pairs

There are about 1.8 million unique identifiers and they are numeric. The cluster sizes will probably be from 2 to about 2000. If only connected identifiers are clustered I will not end up in one cluster, since I know that most observations are not linked. There will probably be something about 5-10 observations in each cluster in average, but I expect a skewed distribution so most clusters will only consist of a few observations.

Super User
Posts: 10,483

Re: Create clusters from pairs

Astounding

I wanted to make sure that the data did not rely on contiguous order. The example data showed a potential for that as an unstated rule.

Contributor
Posts: 22

Re: Create clusters from pairs

I have restructured the indata so that the observations is divided into blocks, so the clustering can be done within each block which probably will faciliate the task.

block     id1     id2

a     1     2

a     1     3

a     4     5

a     5     2

a     6     7

b     1     2

b     1     3

ballardw - I was trying to make the data in countiguous order earlier on but I could not figure out a way to do that. It would probably be a good sollution.

Super User
Posts: 17,780

Re: Create clusters from pairs

Does order matter, can you switch around ID1/ID2 without any repercussions?

If so it might be worth sorting so that ID1<ID2.

If you have SAS/OR look into PROC BOM.

Otherwise I think the macro here will work for you, courtesy of

Solution
‎03-19-2015 06:07 PM
Respected Advisor
Posts: 4,641

Re: Create clusters from pairs

The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.

If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).

Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.

PG

PG
Contributor
Posts: 22

Re: Create clusters from pairs

Thank you for all answers helping me move forward in this process. I tried iterating over the data creating a dataset for each block and running the Subgraphs macro for each dataset. This was (of course) to slow, but it seemed to work. I will take a look at the script to see if I could add a BY-option to the SubGraphs code (any help in this matter would be greatly appreciated).

Astounding – That is an acceptable result. Unfortunately I am new to hash, so I would probably not be able to putting such a script together. I realize that I need to study the hash object.

Super User
Posts: 9,671

Re: Create clusters from pairs

PG,

"Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing."

I can do that .  Smiley Happy

Xia Keshan

Super User
Posts: 5,079

Re: Create clusters from pairs

Good news / bad news ... I have a plan that will get you most of the way there.  However, it involves too much in the way of hashing skills and someone else would have to put the program together.

By "most of the way there", here's the output I envision for your sample data.

ID   ClusterNum

------------------------

1             1

2             1

3             1

4             1

5             1

6             3

7             3

Each ID would end up assigned to a cluster number.  And, as illustrated, the possibility exists that some cluster numbers would be unused (due to collapsing of clusters into one another as the processing proceeds).

It would remain to use this output to actually assign each pair to a cluster, but that is relatively easy.

If that's an acceptable result, I can sketch out what the program needs to do.

Super User
Posts: 9,671

Re: Create clusters from pairs

Here is the code I wrote to find out who belong to the same household. But Your data is too big.

I don't know if you have enough memory to handle this :


data have;
infile cards ;
input from $  to $ ;
cards;
1     2
1     3
4     5
5     2
9     4
6     7
8     7
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; 
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;


Xia Keshan

Message was edited by: xia keshan

Contributor
Posts: 22

Re: Create clusters from pairs

I managed to do the clustering with the SubGraphs macro. I divided the data into sections of about 1 million records and iterated over these datasets. I used a script found here:

http://snipplr.com/view/6212/

The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level.

Many thanks for all the help I got on this issue!

Super User
Posts: 9,671

Re: Create clusters from pairs

That code is to split a table into lots of sub-tables according to a group variable . I can do it better than him  via Hash Table.

"The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level."

I think the following Hash should be set the maximize value 20 ,on account of your big table .

if 0 then set full;

declare hash from_to(dataset:'full',hashexp:20,multidata:'y');

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 484 views
  • 6 likes
  • 7 in conversation