I have some data with first name, last name, and address. I would like to flag all the observations that have the same first and last name, and also ~similar~ address. i.e. (p.s. I already used upcase and compress to fix case and formatting, just need to worry about the actual string characters)
First Last Address flag
BOB SMITH 100 SAS RD 1
BOB SMITH 100 N SAS RD 1
BOB JONES 1000 R RD 0
I know there are a lot of functions for fuzzy matching but that seems to only be between 2 datasets, not within the same column. Any advice on how to tackle this?
Inspired by @novinosrin , here is a solution using complev
data have;
input First $ Last $ Address &:$15.;
cards;
BOB SMITH 100 SAS RD
BOB SMITH 100 N SAS RD
BOB JONES 1000 R RD
;
/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as
select *
from have
order by last, first, length(address) desc, address;
quit;
/* Use the Levenshtein distance, with a cutoff, to detect *real*
changes in address spellings */
data want ;
set temp; by last first;
if first.first then do;
addr = address;
id + 1;
end;
else if complev(addr, address, 5) >= 5 then do;
addr = address;
id + 1;
end;
retain addr;
run;
proc print data=want; run;
Obs. First Last Address addr id 1 BOB JONES 1000 R RD 1000 R RD 1 2 BOB SMITH 100 N SAS RD 100 N SAS RD 2 3 BOB SMITH 100 SAS RD 100 N SAS RD 2
Hi @Melk How about experimenting SPEDIS with a penalty set at 15?
data have;
input First $ Last $ Address & $15.;
cards;
BOB SMITH 100 SAS RD 1
BOB SMITH 100 N SAS RD 1
BOB JONES 1000 R RD 0
;
data want ;
set have;
by first last notsorted;
length _addr $15;
retain flag _addr;
if first.last or spedis(_addr,address)<15 then do;
_addr=address;
Flag=1;
end;
else if spedis(_addr,address)>=15 then Flag=0;
drop _addr;
run;
Inspired by @novinosrin , here is a solution using complev
data have;
input First $ Last $ Address &:$15.;
cards;
BOB SMITH 100 SAS RD
BOB SMITH 100 N SAS RD
BOB JONES 1000 R RD
;
/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as
select *
from have
order by last, first, length(address) desc, address;
quit;
/* Use the Levenshtein distance, with a cutoff, to detect *real*
changes in address spellings */
data want ;
set temp; by last first;
if first.first then do;
addr = address;
id + 1;
end;
else if complev(addr, address, 5) >= 5 then do;
addr = address;
id + 1;
end;
retain addr;
run;
proc print data=want; run;
Obs. First Last Address addr id 1 BOB JONES 1000 R RD 1000 R RD 1 2 BOB SMITH 100 N SAS RD 100 N SAS RD 2 3 BOB SMITH 100 SAS RD 100 N SAS RD 2
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.