topic Re: Identifying non-exact duplicates in the same dataset in SAS Programming

Identifying non-exact duplicates in the same dataset

Melk — Wed, 08 Jan 2020 01:29:44 GMT

I have some data with first name, last name, and address. I would like to flag all the observations that have the same first and last name, and also ~similar~ address. i.e. (p.s. I already used upcase and compress to fix case and formatting, just need to worry about the actual string characters)

First Last Address flag

BOB SMITH 100 SAS RD 1

BOB SMITH 100 N SAS RD 1

BOB JONES 1000 R RD 0

I know there are a lot of functions for fuzzy matching but that seems to only be between 2 datasets, not within the same column. Any advice on how to tackle this?

Re: Identifying non-exact duplicates in the same dataset

novinosrin — Wed, 08 Jan 2020 02:10:04 GMT

Hi @Melk How about experimenting SPEDIS with a penalty set at 15?


data have;
input First $   Last  $     Address   & $15.;              
cards;
BOB    SMITH  100 SAS RD            1
BOB    SMITH  100 N SAS RD        1
BOB    JONES  1000 R RD              0
;

data want ;
  set have;
  by first last notsorted;
  length _addr $15;
  retain flag _addr; 
  if  first.last or  spedis(_addr,address)<15 then do;
   _addr=address;
   Flag=1;
  end;
  else 	 if  spedis(_addr,address)>=15 then Flag=0;
  drop _addr;
run;

Re: Identifying non-exact duplicates in the same dataset

Reeza — Wed, 08 Jan 2020 03:38:06 GMT

It doesn't matter if it's the same column or not, you can self join or do a self look up.

Re: Identifying non-exact duplicates in the same dataset

PGStats — Wed, 08 Jan 2020 05:40:59 GMT

Inspired by @novinosrin , here is a solution using complev

data have;
input First $ Last $ Address &:$15.;              
cards;
BOB    SMITH  100 SAS RD     
BOB    SMITH  100 N SAS RD   
BOB    JONES  1000 R RD      
;

/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as 
select * 
from have
order by last, first, length(address) desc, address;
quit;
 
/* Use the Levenshtein distance, with a cutoff, to detect *real*
   changes in address spellings */    
data want ;
set temp; by last first;
if first.first then do;
	addr = address;
	id + 1;
	end;
else if complev(addr, address, 5) >= 5 then do;
	addr = address;
	id + 1;
	end;
retain addr;
run;

proc print data=want; run;

Obs. 	First 	Last 	Address 	addr 	id
1 	BOB 	JONES 	1000 R RD 	1000 R RD 	1
2 	BOB 	SMITH 	100 N SAS RD 	100 N SAS RD 	2
3 	BOB 	SMITH 	100 SAS RD 	100 N SAS RD 	2