BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Melk
Lapis Lazuli | Level 10

I have some data with first name, last name, and address. I would like to flag all the observations that have the same first and last name, and also ~similar~ address. i.e. (p.s. I already used upcase and compress to fix case and formatting, just need to worry about the actual string characters)

 

First    Last       Address                    flag

BOB    SMITH  100 SAS RD            1

BOB    SMITH  100 N SAS RD        1

BOB    JONES  1000 R RD              0

 

I know there are a lot of functions for fuzzy matching but that seems to only be between 2 datasets, not within the same column. Any advice on how to tackle this?

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Inspired by @novinosrin , here is a solution using complev

 

data have;
input First $ Last $ Address &:$15.;              
cards;
BOB    SMITH  100 SAS RD     
BOB    SMITH  100 N SAS RD   
BOB    JONES  1000 R RD      
;

/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as 
select * 
from have
order by last, first, length(address) desc, address;
quit;
 
/* Use the Levenshtein distance, with a cutoff, to detect *real*
   changes in address spellings */    
data want ;
set temp; by last first;
if first.first then do;
	addr = address;
	id + 1;
	end;
else if complev(addr, address, 5) >= 5 then do;
	addr = address;
	id + 1;
	end;
retain addr;
run;

proc print data=want; run;
Obs. 	First 	Last 	Address 	addr 	id
1 	BOB 	JONES 	1000 R RD 	1000 R RD 	1
2 	BOB 	SMITH 	100 N SAS RD 	100 N SAS RD 	2
3 	BOB 	SMITH 	100 SAS RD 	100 N SAS RD 	2
PG

View solution in original post

3 REPLIES 3
novinosrin
Tourmaline | Level 20

Hi @Melk  How about experimenting SPEDIS with a penalty set at 15?

 


data have;
input First $   Last  $     Address   & $15.;              
cards;
BOB    SMITH  100 SAS RD            1
BOB    SMITH  100 N SAS RD        1
BOB    JONES  1000 R RD              0
;

data want ;
  set have;
  by first last notsorted;
  length _addr $15;
  retain flag _addr; 
  if  first.last or  spedis(_addr,address)<15 then do;
   _addr=address;
   Flag=1;
  end;
  else 	 if  spedis(_addr,address)>=15 then Flag=0;
  drop _addr;
run;




 
Reeza
Super User
It doesn't matter if it's the same column or not, you can self join or do a self look up.
PGStats
Opal | Level 21

Inspired by @novinosrin , here is a solution using complev

 

data have;
input First $ Last $ Address &:$15.;              
cards;
BOB    SMITH  100 SAS RD     
BOB    SMITH  100 N SAS RD   
BOB    JONES  1000 R RD      
;

/* Sort so that the longest address is kept as the *true* address */
proc sql;
create table temp as 
select * 
from have
order by last, first, length(address) desc, address;
quit;
 
/* Use the Levenshtein distance, with a cutoff, to detect *real*
   changes in address spellings */    
data want ;
set temp; by last first;
if first.first then do;
	addr = address;
	id + 1;
	end;
else if complev(addr, address, 5) >= 5 then do;
	addr = address;
	id + 1;
	end;
retain addr;
run;

proc print data=want; run;
Obs. 	First 	Last 	Address 	addr 	id
1 	BOB 	JONES 	1000 R RD 	1000 R RD 	1
2 	BOB 	SMITH 	100 N SAS RD 	100 N SAS RD 	2
3 	BOB 	SMITH 	100 SAS RD 	100 N SAS RD 	2
PG

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 790 views
  • 5 likes
  • 4 in conversation