I have some data with first name, last name, and address. I would like to flag all the observations that have the same first and last name, and also ~similar~ address. i.e. (p.s. I already used upcase and compress to fix case and formatting, just need to worry about the actual string characters)
First Last Address flag
BOB SMITH 100 SAS RD 1
BOB SMITH 100 N SAS RD 1
BOB JONES 1000 R RD 0
I know there are a lot of functions for fuzzy matching but that seems to only be between 2 datasets, not within the same column. Any advice on how to tackle this?
Inspired by @novinosrin , here is a solution using complev
data have;
input First $ Last $ Address &:$15.;
cards;
BOB SMITH 100 SAS RD
BOB SMITH 100 N SAS RD
BOB JONES 1000 R RD
;
/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as
select *
from have
order by last, first, length(address) desc, address;
quit;
/* Use the Levenshtein distance, with a cutoff, to detect *real*
changes in address spellings */
data want ;
set temp; by last first;
if first.first then do;
addr = address;
id + 1;
end;
else if complev(addr, address, 5) >= 5 then do;
addr = address;
id + 1;
end;
retain addr;
run;
proc print data=want; run;
Obs. First Last Address addr id 1 BOB JONES 1000 R RD 1000 R RD 1 2 BOB SMITH 100 N SAS RD 100 N SAS RD 2 3 BOB SMITH 100 SAS RD 100 N SAS RD 2
Hi @Melk How about experimenting SPEDIS with a penalty set at 15?
data have;
input First $ Last $ Address & $15.;
cards;
BOB SMITH 100 SAS RD 1
BOB SMITH 100 N SAS RD 1
BOB JONES 1000 R RD 0
;
data want ;
set have;
by first last notsorted;
length _addr $15;
retain flag _addr;
if first.last or spedis(_addr,address)<15 then do;
_addr=address;
Flag=1;
end;
else if spedis(_addr,address)>=15 then Flag=0;
drop _addr;
run;
Inspired by @novinosrin , here is a solution using complev
data have;
input First $ Last $ Address &:$15.;
cards;
BOB SMITH 100 SAS RD
BOB SMITH 100 N SAS RD
BOB JONES 1000 R RD
;
/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as
select *
from have
order by last, first, length(address) desc, address;
quit;
/* Use the Levenshtein distance, with a cutoff, to detect *real*
changes in address spellings */
data want ;
set temp; by last first;
if first.first then do;
addr = address;
id + 1;
end;
else if complev(addr, address, 5) >= 5 then do;
addr = address;
id + 1;
end;
retain addr;
run;
proc print data=want; run;
Obs. First Last Address addr id 1 BOB JONES 1000 R RD 1000 R RD 1 2 BOB SMITH 100 N SAS RD 100 N SAS RD 2 3 BOB SMITH 100 SAS RD 100 N SAS RD 2
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.