Dear All,
I need to group pairs into groups / clusters.
For instance, in the following table:
Pair1 | Pair2 |
1 | 2 |
1 | 3 |
2 | 4 |
5 | 6 |
The first 3 rows will be in the same cluster because they all have one pair in common and the cluster will then include pairs 1, 2, 3 and 4.
The last row will be in a different cluster because pairs 5 and 6 are not in any other row.
The final dataset will then be:
Pair1 | Pair2 | Cluster |
1 | 2 | 1 |
1 | 3 | 1 |
2 | 4 | 1 |
5 | 6 | 2 |
I’ve written a code that works well but it’s inefficient (it uses a lot of memory and it's slow): when the input dataset is bigger than 50,000 pairs it goes in memory overflow.
Any help would be really appreciated.
Thanks a lot in advance
looks like a job for a hash table
Thanks Peter,
I don't know hash table, could you please address me on the right topic or provide me an example?
Take a look at that different approaches suggested in the following thread: http://communities.sas.com/message/104329#104329
A hash approach is included in one of the suggestions.
It looks like you are doing graph analysis. You want to get the connected sub-graphs. Here is a simplified version of a macro I did years ago for this. Not sure how fast it is, but it will work as long as your disk is large enough. I think this version might be for directed graphs. You could either fix the selection logic to check both directions or just duplicate your data with the values of pair1/pair2 flipped.
%macro subnet(in=,out=,from=from,to=to,subnet=subnet);
/*----------------------------------------------------------------------
SUBNET - Build connected subnets from pairs of nodes.
Input Table :FROM TO pairs of rows
Output Table:input data with &subnet added
Work Tables:
NODES - List of all nodes in input.
NEW - List of new nodes to assign to current subnet.
Algorithm:
Pick next unassigned node and grow the subnet by adding all connected
nodes. Repeat until all unassigned nodes are put into a subnet.
----------------------------------------------------------------------*/
%local subnetid next getnext ;
%*----------------------------------------------------------------------
Put code to get next unassigned node into macro variable because it is
used in two places in the program.
-----------------------------------------------------------------------;
%let getnext= select node into :next from nodes where subnet=.;
%*----------------------------------------------------------------------
Initialize subnet id counter.
-----------------------------------------------------------------------;
%let subnetid=0;
proc sql noprint;
*----------------------------------------------------------------------;
* Get list of all nodes ;
*----------------------------------------------------------------------;
create table nodes as
select distinct . as subnet, &from as node from &in
union
select distinct . as subnet, &to as node from &in
;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Set subnet to next id ;
*----------------------------------------------------------------------;
%let subnetid=%eval(&subnetid+1);
update nodes set subnet=&subnetid where node=&next;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Get list of connected nodes for this subnet ;
*----------------------------------------------------------------------;
create table new as
select distinct a.&to as node
from &in a, nodes b, nodes c
where a.&from= b.node
and a.&to= c.node
and b.subnet = &subnetid
and c.subnet = .
;
*----------------------------------------------------------------------;
* Update subnet for these nodes ;
*----------------------------------------------------------------------;
update nodes set subnet=&subnetid
where node in
(select node from new )
;
%end;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%end;
*----------------------------------------------------------------------;
* Create output dataset by adding subnet number. ;
*----------------------------------------------------------------------;
create table &out as
select distinct a.*,b.subnet as &subnet
from &in a , nodes b
where a.&from = b.node
;
quit;
%mend subnet ;
data in;
input pair1 pair2 @@;
cards;
1 2 1 3 2 4 5 6
run;
%subnet(in=in,out=out,from=pair1,to=pair2,subnet=group);
proc print data=out;
run;
Obs pair1 pair2 group
1 1 2 1
2 1 3 1
3 2 4 1
4 5 6 2
I want to know whether these obs is populated in the dataset randomly? there some obs maybe at the end of dataset . look like:
Pair1 Pair2
1 2
1 3
5 6
2 4
If it were so. I have coded it for a user ------- sas_Forum.
http://communities.sas.com/message/104261#104261
For Art297. I found the link is not available any more . Why ? because sas_Forum has deleted it?
Fortunately I have a copy.
The following code is an example for your question.
data test; input ( pan1 pan2 pan3 add1 ) ($); datalines; aaa bbb ccc ddd . . . . qqq rrr www aaa rrr ppp mmm lll uuu zzz ffff ppp p l m n . . . . . . . . jjjj eee rrr ooo sss www . . . . . eee ; run; options compress=yes; data want(keep=pan1 pan2 pan3 add1 household); /*to make speed faster*/ declare hash ha(hashexp : 20,ordered : 'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('count','pan1','pan2','pan3','add1'); ha.definedone(); declare hash _ha(hashexp: 20,ordered : 'a'); _ha.definekey('key'); _ha.definedone(); do until(last); set test end=last; /*Remove obs which variable's are all missing firstly*/ if cmiss(pan1,pan2,pan3,add1) lt 4 then do; count+1; ha.add(); end; end; length key $ 40; array h{4} $ 40 pan1 pan2 pan3 add1; /*copy the first obs from Hash Table HA into PDV*/ _rc=hi.first(); do while(_rc eq 0); *until the end of Hash Table HA; /*assign a unique cluster flag(i.e. household)*/ household+1; do i=1 to 4; /*push not missing value of current obs into another Hash Table _HA*/ if not missing(h{i}) then do; key=h{i}; _ha.replace();end; end; /*start to run over Hash Table HA ,until can not find any more observation which is the same cluster with current observation*/ do until(x=1); x=1; /*copy the first obs from Hash Table HA into PDV*/ rc=hi.first(); do while(rc=0); found=0; do j=1 to 4; /*find whether any one of value is included in the current obs*/ key=h{j};rcc=_ha.check(); if rcc =0 then found=1; end; if found then do; /*if any one of value is included,then push the obs which is copied from Hash Table HA into Hash Tables _HA,flag it the same cluster with the current obs and output it into dataset*/ do k=1 to 4; if not missing(h{k}) then do; key=h{k};_ha.replace();end; end; output;x=0; _count=count;*keep this found obs's index; end; rc=hi.next(); /*remove the found obs from Hash Table HA,since it has been seared*/ if found then rx=ha.remove(key : _count); end; end; /*clear up all the index which is the same cluster with the current obs*/ _ha.clear(); /*copy the first obs from Hash Table HA into PDV*/ _rc=hi.first(); end; run;
Ksharp
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.