Dear All,
I need to group pairs into groups / clusters.
For instance, in the following table:
Pair1 | Pair2 |
1 | 2 |
1 | 3 |
2 | 4 |
5 | 6 |
The first 3 rows will be in the same cluster because they all have one pair in common and the cluster will then include pairs 1, 2, 3 and 4.
The last row will be in a different cluster because pairs 5 and 6 are not in any other row.
The final dataset will then be:
Pair1 | Pair2 | Cluster |
1 | 2 | 1 |
1 | 3 | 1 |
2 | 4 | 1 |
5 | 6 | 2 |
I’ve written a code that works well but it’s inefficient (it uses a lot of memory and it's slow): when the input dataset is bigger than 50,000 pairs it goes in memory overflow.
Any help would be really appreciated.
Thanks a lot in advance
looks like a job for a hash table
Thanks Peter,
I don't know hash table, could you please address me on the right topic or provide me an example?
Take a look at that different approaches suggested in the following thread: http://communities.sas.com/message/104329#104329
A hash approach is included in one of the suggestions.
It looks like you are doing graph analysis. You want to get the connected sub-graphs. Here is a simplified version of a macro I did years ago for this. Not sure how fast it is, but it will work as long as your disk is large enough. I think this version might be for directed graphs. You could either fix the selection logic to check both directions or just duplicate your data with the values of pair1/pair2 flipped.
%macro subnet(in=,out=,from=from,to=to,subnet=subnet);
/*----------------------------------------------------------------------
SUBNET - Build connected subnets from pairs of nodes.
Input Table :FROM TO pairs of rows
Output Table:input data with &subnet added
Work Tables:
NODES - List of all nodes in input.
NEW - List of new nodes to assign to current subnet.
Algorithm:
Pick next unassigned node and grow the subnet by adding all connected
nodes. Repeat until all unassigned nodes are put into a subnet.
----------------------------------------------------------------------*/
%local subnetid next getnext ;
%*----------------------------------------------------------------------
Put code to get next unassigned node into macro variable because it is
used in two places in the program.
-----------------------------------------------------------------------;
%let getnext= select node into :next from nodes where subnet=.;
%*----------------------------------------------------------------------
Initialize subnet id counter.
-----------------------------------------------------------------------;
%let subnetid=0;
proc sql noprint;
*----------------------------------------------------------------------;
* Get list of all nodes ;
*----------------------------------------------------------------------;
create table nodes as
select distinct . as subnet, &from as node from &in
union
select distinct . as subnet, &to as node from &in
;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Set subnet to next id ;
*----------------------------------------------------------------------;
%let subnetid=%eval(&subnetid+1);
update nodes set subnet=&subnetid where node=&next;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Get list of connected nodes for this subnet ;
*----------------------------------------------------------------------;
create table new as
select distinct a.&to as node
from &in a, nodes b, nodes c
where a.&from= b.node
and a.&to= c.node
and b.subnet = &subnetid
and c.subnet = .
;
*----------------------------------------------------------------------;
* Update subnet for these nodes ;
*----------------------------------------------------------------------;
update nodes set subnet=&subnetid
where node in
(select node from new )
;
%end;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%end;
*----------------------------------------------------------------------;
* Create output dataset by adding subnet number. ;
*----------------------------------------------------------------------;
create table &out as
select distinct a.*,b.subnet as &subnet
from &in a , nodes b
where a.&from = b.node
;
quit;
%mend subnet ;
data in;
input pair1 pair2 @@;
cards;
1 2 1 3 2 4 5 6
run;
%subnet(in=in,out=out,from=pair1,to=pair2,subnet=group);
proc print data=out;
run;
Obs pair1 pair2 group
1 1 2 1
2 1 3 1
3 2 4 1
4 5 6 2
I want to know whether these obs is populated in the dataset randomly? there some obs maybe at the end of dataset . look like:
Pair1 Pair2
1 2
1 3
5 6
2 4
If it were so. I have coded it for a user ------- sas_Forum.
http://communities.sas.com/message/104261#104261
For Art297. I found the link is not available any more . Why ? because sas_Forum has deleted it?
Fortunately I have a copy.
The following code is an example for your question.
data test;
input ( pan1 pan2 pan3 add1 ) ($);
datalines;
aaa bbb ccc ddd
. . . .
qqq rrr www aaa
rrr ppp mmm lll
uuu zzz ffff ppp
p l m n
. . . .
. . . .
jjjj eee rrr ooo
sss www . .
. . . eee
;
run;
options compress=yes;
data want(keep=pan1 pan2 pan3 add1 household);
/*to make speed faster*/
declare hash ha(hashexp : 20,ordered : 'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('count','pan1','pan2','pan3','add1');
ha.definedone();
declare hash _ha(hashexp: 20,ordered : 'a');
_ha.definekey('key');
_ha.definedone();
do until(last);
set test end=last;
/*Remove obs which variable's are all missing firstly*/
if cmiss(pan1,pan2,pan3,add1) lt 4 then do;
count+1;
ha.add();
end;
end;
length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
do while(_rc eq 0); *until the end of Hash Table HA;
/*assign a unique cluster flag(i.e. household)*/
household+1;
do i=1 to 4;
/*push not missing value of current obs into another Hash Table _HA*/
if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
end;
/*start to run over Hash Table HA ,until can not find any more
observation which is the same cluster with current observation*/
do until(x=1);
x=1;
/*copy the first obs from Hash Table HA into PDV*/
rc=hi.first();
do while(rc=0);
found=0;
do j=1 to 4;
/*find whether any one of value is included in the current obs*/
key=h{j};rcc=_ha.check();
if rcc =0 then found=1;
end;
if found then do;
/*if any one of value is included,then push the obs which is copied from
Hash Table HA into Hash Tables _HA,flag it the same cluster with the
current obs and output it into dataset*/
do k=1 to 4;
if not missing(h{k}) then do; key=h{k};_ha.replace();end;
end;
output;x=0; _count=count;*keep this found obs's index;
end;
rc=hi.next();
/*remove the found obs from Hash Table HA,since it has been seared*/
if found then rx=ha.remove(key : _count);
end;
end;
/*clear up all the index which is the same cluster with the current obs*/
_ha.clear();
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
end;
run;
Ksharp
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.