I have a large dataset consisting of pairs (about 130 miljon rows). I would like to create clusters from all observations that are connected.
Example
1 2
1 3
4 5
5 2
6 7
Two groups should be created from the example data; group 1: 1,2,3,4,5 and group 2: 6,7.
Any ideas on an approach for this?
Thank you for any help!
The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.
If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).
Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.
PG
What is the rule that makes values of 1,2,3,4,5 go to group 1 and 6 and 7 go to group 2? "Connected" is very much dependent on your knowledge.
Are there two different variables involved (not quite obvious from your example) or more than 2?
Are you looking to add a variable that says what "group" a record belongs in?
Suppose a few thousand records below your example we have 7 2. What group would this record be in?
Thank you for your question. I will try to make my self mor clear. The numbers are id:s and the connections are binary. All id:s that are connected, either directly or indirectly, should be clustered in the same group. 6 and 7 do not have any connections with 1-5, neither directly or indirectly so they do not cluster with that group. If we add a pair of 7 2 this will connect group 1 and 2 in the example and create one group.
Just a few questions to be able to think about this ... from most important to least important ...
How many different identifiers do you have (ballpark figure)?
It looks like your individual identifiers (1 through 7 in the example) will be numeric. Is that correct?
Do you have any preconceptions about the cluster sizes?
If you end up with just 1 cluster, would that be acceptable or would you want to apply a different set of rules?
Ballardw, if there was a record with 7 2, it would belong in both clusters. That means that the two clusters would get collapsed into one, and there would only be one cluster.
There are about 1.8 million unique identifiers and they are numeric. The cluster sizes will probably be from 2 to about 2000. If only connected identifiers are clustered I will not end up in one cluster, since I know that most observations are not linked. There will probably be something about 5-10 observations in each cluster in average, but I expect a skewed distribution so most clusters will only consist of a few observations.
Astounding
I wanted to make sure that the data did not rely on contiguous order. The example data showed a potential for that as an unstated rule.
I have restructured the indata so that the observations is divided into blocks, so the clustering can be done within each block which probably will faciliate the task.
block id1 id2
a 1 2
a 1 3
a 4 5
a 5 2
a 6 7
b 1 2
b 1 3
ballardw - I was trying to make the data in countiguous order earlier on but I could not figure out a way to do that. It would probably be a good sollution.
The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.
If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).
Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.
PG
Thank you for all answers helping me move forward in this process. I tried iterating over the data creating a dataset for each block and running the Subgraphs macro for each dataset. This was (of course) to slow, but it seemed to work. I will take a look at the script to see if I could add a BY-option to the SubGraphs code (any help in this matter would be greatly appreciated).
Astounding – That is an acceptable result. Unfortunately I am new to hash, so I would probably not be able to putting such a script together. I realize that I need to study the hash object.
PG,
"Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing."
I can do that .  
Xia Keshan
Good news / bad news ... I have a plan that will get you most of the way there. However, it involves too much in the way of hashing skills and someone else would have to put the program together.
By "most of the way there", here's the output I envision for your sample data.
ID ClusterNum
------------------------
1 1
2 1
3 1
4 1
5 1
6 3
7 3
Each ID would end up assigned to a cluster number. And, as illustrated, the possibility exists that some cluster numbers would be unused (due to collapsing of clusters into one another as the processing proceeds).
It would remain to use this output to actually assign each pair to a cluster, but that is relatively easy.
If that's an acceptable result, I can sketch out what the program needs to do.
Here is the code I wrote to find out who belong to the same household. But Your data is too big.
I don't know if you have enough memory to handle this :
data have;
infile cards ;
input from $  to $ ;
cards;
1     2
1     3
4     5
5     2
9     4
6     7
8     7
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; 
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 
do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Xia Keshan
Message was edited by: xia keshan
I managed to do the clustering with the SubGraphs macro. I divided the data into sections of about 1 million records and iterated over these datasets. I used a script found here:
The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level.
Many thanks for all the help I got on this issue!
That code is to split a table into lots of sub-tables according to a group variable . I can do it better than him via Hash Table.
"The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level."
I think the following Hash should be set the maximize value 20 ,on account of your big table .
if 0 then set full;
declare hash from_to(dataset:'full',hashexp:20,multidata:'y');
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
