BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
peter_sjogarde
Fluorite | Level 6

I have a large dataset consisting of pairs (about 130 miljon rows). I would like to create clusters from all observations that are connected.

Example

1     2

1     3

4     5

5     2

6     7

Two groups should be created from the example data; group 1: 1,2,3,4,5 and group 2: 6,7.

Any ideas on an approach for this?

Thank you for any help!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.

If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).

Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.

PG

PG

View solution in original post

15 REPLIES 15
ballardw
Super User

What is the rule that makes values of 1,2,3,4,5 go to group 1 and 6 and 7 go to group 2? "Connected" is very much dependent on your knowledge.

Are there two different variables involved (not quite obvious from your example) or more than 2?

Are you looking to add a variable that says what "group" a record belongs in?

Suppose a few thousand records below your example we have 7 2. What group would this record be in?

peter_sjogarde
Fluorite | Level 6

Thank you for your question. I will try to make my self mor clear. The numbers are id:s and the connections are binary. All id:s that are connected, either directly or indirectly, should be clustered in the same group. 6 and 7 do not have any connections with 1-5, neither directly or indirectly so they do not cluster with that group. If we add a pair of 7 2 this will connect group 1 and 2 in the example and create one group.

Astounding
PROC Star

Just a few questions to be able to think about this ... from most important to least important ...

How many different identifiers do you have (ballpark figure)?

It looks like your individual identifiers (1 through 7 in the example) will be numeric.  Is that correct?

Do you have any preconceptions about the cluster sizes?

If you end up with just 1 cluster, would that be acceptable or would you want to apply a different set of rules?

Ballardw, if there was a record with 7 2, it would belong in both clusters.  That means that the two clusters would get collapsed into one, and there would only be one cluster.

peter_sjogarde
Fluorite | Level 6

There are about 1.8 million unique identifiers and they are numeric. The cluster sizes will probably be from 2 to about 2000. If only connected identifiers are clustered I will not end up in one cluster, since I know that most observations are not linked. There will probably be something about 5-10 observations in each cluster in average, but I expect a skewed distribution so most clusters will only consist of a few observations.

ballardw
Super User

Astounding

I wanted to make sure that the data did not rely on contiguous order. The example data showed a potential for that as an unstated rule.

peter_sjogarde
Fluorite | Level 6

I have restructured the indata so that the observations is divided into blocks, so the clustering can be done within each block which probably will faciliate the task.

block     id1     id2

a     1     2

a     1     3

a     4     5

a     5     2

a     6     7

b     1     2

b     1     3

ballardw - I was trying to make the data in countiguous order earlier on but I could not figure out a way to do that. It would probably be a good sollution.

Reeza
Super User

Does order matter, can you switch around ID1/ID2 without any repercussions?

If so it might be worth sorting so that ID1<ID2.

If you have SAS/OR look into PROC BOM.

Otherwise I think the macro here will work for you, courtesy of

PGStats
Opal | Level 21

The OPTNET procedure in SAS/OR will find connected components for both directed and undirected graphs. Look at the CONCOMP statement.

If you don't have a licence for SAS/OR and your graph is undirected, then use my macro, as mentioned by , and set the hash size factor fairly high (15-18, I would guess).

Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing.

PG

PG
peter_sjogarde
Fluorite | Level 6

Thank you for all answers helping me move forward in this process. I tried iterating over the data creating a dataset for each block and running the Subgraphs macro for each dataset. This was (of course) to slow, but it seemed to work. I will take a look at the script to see if I could add a BY-option to the SubGraphs code (any help in this matter would be greatly appreciated).

Astounding – That is an acceptable result. Unfortunately I am new to hash, so I would probably not be able to putting such a script together. I realize that I need to study the hash object.

Ksharp
Super User

PG,

"Unfortunately, neither OPTNET or the SubGraphs macro supports BY processing."

I can do that .  Smiley Happy

Xia Keshan

Astounding
PROC Star

Good news / bad news ... I have a plan that will get you most of the way there.  However, it involves too much in the way of hashing skills and someone else would have to put the program together.

By "most of the way there", here's the output I envision for your sample data.

ID   ClusterNum

------------------------

1             1

2             1

3             1

4             1

5             1

6             3

7             3

Each ID would end up assigned to a cluster number.  And, as illustrated, the possibility exists that some cluster numbers would be unused (due to collapsing of clusters into one another as the processing proceeds).

It would remain to use this output to actually assign each pair to a cluster, but that is relatively easy.

If that's an acceptable result, I can sketch out what the program needs to do.

Ksharp
Super User

Here is the code I wrote to find out who belong to the same household. But Your data is too big.

I don't know if you have enough memory to handle this :


data have;
infile cards ;
input from $  to $ ;
cards;
1     2
1     3
4     5
5     2
9     4
6     7
8     7
;
run;
data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node; 
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;


Xia Keshan

Message was edited by: xia keshan

peter_sjogarde
Fluorite | Level 6

I managed to do the clustering with the SubGraphs macro. I divided the data into sections of about 1 million records and iterated over these datasets. I used a script found here:

http://snipplr.com/view/6212/

The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level.

Many thanks for all the help I got on this issue!

Ksharp
Super User

That code is to split a table into lots of sub-tables according to a group variable . I can do it better than him  via Hash Table.

"The hash factor was set to 17. I do not know if that is optimal but the script was fast so that was probably a good level."

I think the following Hash should be set the maximize value 20 ,on account of your big table .

if 0 then set full;

declare hash from_to(dataset:'full',hashexp:20,multidata:'y');

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 3148 views
  • 6 likes
  • 7 in conversation