Solved: ID or Count each cluster /set of fuzzy duplicates

catnipper · Posted 02-11-2019 04:33 PM

I have several duplicates that don't look like duplicates to SAS but I know they are. (They were determined by fuzzy matching.) The only way to know they are associated is by looking at their clustering. A is related to B, and B to E and E to A.

I need to ID each cluster, so I can run code by ClusterID. The next step will be to remove some records from each cluster based on additional requirements.

This is what I have:

data have;
input left $ right $;
cards;
A	B
B	E
C	D
D	C
E	A
F	G 
;
run;
proc print data=have;run;

I want to create any unique ID per row, a count is fine but it doesnt need to be consecutive.

I need the data to look like this:

Want:
      left     right      ClusterID
	A	  B		 1
	B	  E		 1
	C	  D		 2
	D	  C		 2
	E	  A		 1
	F	  G 	         3

Any ideas?

Ksharp · Posted 02-12-2019 07:28 AM

data have(rename=(left=from right=to));
infile cards expandtabs truncover;
input left $ right $;
cards;
A	B
B	E
C	D
D	C
E	A
F	G 
;
run;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

View solution in original post

Ksharp · Posted 02-12-2019 07:28 AM

data have(rename=(left=from right=to));
infile cards expandtabs truncover;
input left $ right $;
cards;
A	B
B	E
C	D
D	C
E	A
F	G 
;
run;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

catnipper · Posted 02-14-2019 12:55 PM

Thank you so much!!

That's a great solution. I have been trying to understand it but I am struggling with some aspects.

I know what it does but don't understand why.

node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');

Then the making of the want is a mystery. What is going on there? It definitely worked, even with more my more complex and much larger dataset.

Ksharp · Posted 02-15-2019 06:32 AM

This code create full path which is used to search a TREE. For example:

From To

A B

would exchange the position and get :

From To

A B

B A

And the code generated WANT table is a long story to explain.

I have no time to explain the details to you.

If you are familiar with Hash Table, I think you could understand the mystery as long as you read it with a lot of time .

Good Luck.

ID or Count each cluster /set of fuzzy duplicates

Re: ID or Count each cluster /set of fuzzy duplicates

Re: ID or Count each cluster /set of fuzzy duplicates

Re: ID or Count each cluster /set of fuzzy duplicates

Re: ID or Count each cluster /set of fuzzy duplicates

Catch up on SAS Innovate 2026