I have several duplicates that don't look like duplicates to SAS but I know they are. (They were determined by fuzzy matching.) The only way to know they are associated is by looking at their clustering. A is related to B, and B to E and E to A.
I need to ID each cluster, so I can run code by ClusterID. The next step will be to remove some records from each cluster based on additional requirements.
This is what I have:
data have; input left $ right $; cards; A B B E C D D C E A F G ; run; proc print data=have;run;
I want to create any unique ID per row, a count is fine but it doesnt need to be consecutive.
I need the data to look like this:
Want:
left right ClusterID
A B 1
B E 1
C D 2
D C 2
E A 1
F G 3Any ideas?
data have(rename=(left=from right=to));
infile cards expandtabs truncover;
input left $ right $;
cards;
A B
B E
C D
D C
E A
F G
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
data have(rename=(left=from right=to));
infile cards expandtabs truncover;
input left $ right $;
cards;
A B
B E
C D
D C
E A
F G
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Thank you so much!!
That's a great solution. I have been trying to understand it but I am struggling with some aspects.
I know what it does but don't understand why.
node=from; h.replace(); from=to; to=node; output; node=from; h.replace(); if last then h.output(dataset:'node');
Then the making of the want is a mystery. What is going on there? It definitely worked, even with more my more complex and much larger dataset.
This code create full path which is used to search a TREE. For example:
From To
A B
would exchange the position and get :
From To
A B
B A
And the code generated WANT table is a long story to explain.
I have no time to explain the details to you.
If you are familiar with Hash Table, I think you could understand the mystery as long as you read it with a lot of time .
Good Luck.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.