BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Melk
Lapis Lazuli | Level 10

I have some data with first name, last name, and address. I would like to flag all the observations that have the same first and last name, and also ~similar~ address. i.e. (p.s. I already used upcase and compress to fix case and formatting, just need to worry about the actual string characters)

 

First    Last       Address                    flag

BOB    SMITH  100 SAS RD            1

BOB    SMITH  100 N SAS RD        1

BOB    JONES  1000 R RD              0

 

I know there are a lot of functions for fuzzy matching but that seems to only be between 2 datasets, not within the same column. Any advice on how to tackle this?

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Inspired by @novinosrin , here is a solution using complev

 

data have;
input First $ Last $ Address &:$15.;              
cards;
BOB    SMITH  100 SAS RD     
BOB    SMITH  100 N SAS RD   
BOB    JONES  1000 R RD      
;

/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as 
select * 
from have
order by last, first, length(address) desc, address;
quit;
 
/* Use the Levenshtein distance, with a cutoff, to detect *real*
   changes in address spellings */    
data want ;
set temp; by last first;
if first.first then do;
	addr = address;
	id + 1;
	end;
else if complev(addr, address, 5) >= 5 then do;
	addr = address;
	id + 1;
	end;
retain addr;
run;

proc print data=want; run;
Obs. 	First 	Last 	Address 	addr 	id
1 	BOB 	JONES 	1000 R RD 	1000 R RD 	1
2 	BOB 	SMITH 	100 N SAS RD 	100 N SAS RD 	2
3 	BOB 	SMITH 	100 SAS RD 	100 N SAS RD 	2
PG

View solution in original post

3 REPLIES 3
novinosrin
Tourmaline | Level 20

Hi @Melk  How about experimenting SPEDIS with a penalty set at 15?

 


data have;
input First $   Last  $     Address   & $15.;              
cards;
BOB    SMITH  100 SAS RD            1
BOB    SMITH  100 N SAS RD        1
BOB    JONES  1000 R RD              0
;

data want ;
  set have;
  by first last notsorted;
  length _addr $15;
  retain flag _addr; 
  if  first.last or  spedis(_addr,address)<15 then do;
   _addr=address;
   Flag=1;
  end;
  else 	 if  spedis(_addr,address)>=15 then Flag=0;
  drop _addr;
run;




 
Reeza
Super User
It doesn't matter if it's the same column or not, you can self join or do a self look up.
PGStats
Opal | Level 21

Inspired by @novinosrin , here is a solution using complev

 

data have;
input First $ Last $ Address &:$15.;              
cards;
BOB    SMITH  100 SAS RD     
BOB    SMITH  100 N SAS RD   
BOB    JONES  1000 R RD      
;

/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as 
select * 
from have
order by last, first, length(address) desc, address;
quit;
 
/* Use the Levenshtein distance, with a cutoff, to detect *real*
   changes in address spellings */    
data want ;
set temp; by last first;
if first.first then do;
	addr = address;
	id + 1;
	end;
else if complev(addr, address, 5) >= 5 then do;
	addr = address;
	id + 1;
	end;
retain addr;
run;

proc print data=want; run;
Obs. 	First 	Last 	Address 	addr 	id
1 	BOB 	JONES 	1000 R RD 	1000 R RD 	1
2 	BOB 	SMITH 	100 N SAS RD 	100 N SAS RD 	2
3 	BOB 	SMITH 	100 SAS RD 	100 N SAS RD 	2
PG
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1397 views
  • 5 likes
  • 4 in conversation