I have an external file which contains mailing addresses (City, State, zip, etc ...) and I need a way to join that data with data in a table which contains the same information. Addresses are often misspelled, and this can happen with any aspect be it the state, zip, city, street, etc... I need a way to find the record in the database which most "likely" is the record represented in the datafile. A simple join or Proc Compare would only give exact matches or the mismatches, and that would be too limiting. Is there a process which would find the exact matches and the closest match based on matching variables (e.g. External.Zip - Internal.Zip, External.City_name - Internal.CIty, etc...). Also, any process which allows me to prioritize the variables and provides output on the level of the match (90%, 95%, ...) would be ideal.
You could concatenate them together:
where soundex(catx(",",a.city,a.state,a.street))=soundex(catx(",",b.city,b.state,b.street))
Or just and them together (which is probably better):
where soundex(a.city)=soundex(b.city) and soundex(a.state)=... and ...
First guess would be soundex() function. Have a look at this paper, has an example of matching cities where there is misspellings:
Is there a way to do this with multiple variables at once (e.g. City, State, and street)?
You could concatenate them together:
where soundex(catx(",",a.city,a.state,a.street))=soundex(catx(",",b.city,b.state,b.street))
Or just and them together (which is probably better):
where soundex(a.city)=soundex(b.city) and soundex(a.state)=... and ...
That second option may work...I'm just hoping the AND doesn't leave out too many records which have a missing state or city value in one data set. Ideally, if city is missing in one, then match on zip. If zip is missing then match on street, etc... Looks like it will have to be several SQL processes with unions, using one field at a time as the join. Thanks.
Just layer your where clause:
...
where (a.city is null and a.state is null and soundex(a.street)=soundex(b.street) ) or
(a.city is null and soundex(a.state)=soundex(b.state) and soundex(a.street)=soundex(b.street)) or
(soundex(a.city)=soundex(b.city) and soundex(...)
Hi. If you are planning to use the SOUNDEX+SQL approach, the SOUNDS LIKE operator (=*) does a SOUNDEX comparison. Also, when comparing the content of multiple variables, you might consider adding a score. For example, if you find that all three variables (street, city, state) match using SOUNDEX, the matched obervations would receive a higher score that if fewer than all three matched. By the way, there are no rules in the SOUNDEX routine to for numbers, so you cannot compare zip codes.
data test;
input (name1 name2) (:$10.) @@;
datalines;
SMITH SMITH SMITHE SMYTHE SMITH ZMITH 12203 12203 13345 12209
;
proc sql;
title "SOUNDEX FUNCTION";
select * from test where soundex(name1) = soundex(name2);
title "SOUNDS LIKE OPERATOR";
select * from test where name1 =* name2;
title "SOUNDS LIKE OPERATOR + ADD A NUMERIC SCORE";
select *, name1 =* name2 as score from test;
quit;
SOUNDEX FUNCTION
name1 name2
SMITH SMITH
SMITHE SMYTHE
12203 12203
13345 12209
SOUNDS LIKE OPERATOR
name1 name2
SMITH SMITH
SMITHE SMYTHE
12203 12203
13345 12209
SOUNDS LIKE OPERATOR + ADD A NUMERIC SCORE
name1 name2 score
SMITH SMITH 1
SMITHE SMYTHE 1
SMITH ZMITH 0
12203 12203 1
13345 12209 1
However, I think that you are taking a really easy way out by just using SOUNDEX to match records. There are a lot of SAS character functions (and CALL ROUTINES) that allow you to do string comparisons. A few are SPEDIS, COMPGED, and COMPLEV. Each of those provides you with a score that can be used to evaluatae just how close two strings. Modifying the online example in SAS help for rhe SPEDIS function ...
data words;
input (operation query keyword) ($);
spedis = spedis(query, keyword);
op_sndx = soundex(query);
ky_sndx = soundex(keyword);
compged = compged(query,keyword);
complev = complev(query,keyword);
datalines;
match fuzzy fuzzy
singlet fuzy fuzzy
doublet fuuzzy fuzzy
swap fzuzy fuzzy
truncate fuzz fuzzy
append fuzzys fuzzy
delete fzzy fuzzy
insert fluzzy fuzzy
replace fizzy fuzzy
firstdel uzzy fuzzy
firstins pfuzzy fuzzy
firstrep wuzzy fuzzy
several floozy fuzzy
;
SPEDIS, SOUNDEX, COMPGED, COMPLEV
operation query keyword spedis op_sndx ky_sndx compged complev
match fuzzy fuzzy 0 F2 F2 0 0
singlet fuzy fuzzy 6 F2 F2 20 1
doublet fuuzzy fuzzy 8 F2 F2 20 1
swap fzuzy fuzzy 10 F22 F2 20 2
truncate fuzz fuzzy 12 F2 F2 10 1
append fuzzys fuzzy 5 F22 F2 50 1
delete fzzy fuzzy 12 F2 F2 100 1
insert fluzzy fuzzy 16 F42 F2 100 1
replace fizzy fuzzy 20 F2 F2 100 1
firstdel uzzy fuzzy 25 U2 F2 200 1
firstins pfuzzy fuzzy 33 P2 F2 200 1
firstrep wuzzy fuzzy 40 W2 F2 200 1
several floozy fuzzy 50 F42 F2 300 3
Last, there's a wealth of material in SAS papers on matching character strings. It all comes down to how much work you'd like to invest in the process. Like I said, SOUNDEX seems (at least to me) like you are using something that is easy and fast but not necessarily the best approach. Try some reading ...
Fuzzy Matching
http://www.sascommunity.org/wiki/Fuzzy_Matching
Compged makes matches easy to see!
http://www.geocities.ws/nyasug2002/COMPGED.pdf
The Fuzzy Feeling SASâ Provides: Electronic Matching of Records without Common Keys (great paper, pre-SPEDIS/COMPGED/COMPLEV
http://ftp.sas.com/techsup/download/observations/obswww15/obswww15.pdf
Do a Google search on "sas fuzzy matching".
Do a Google search on "sas address matching" ... NOTE: one thing that you'll notice in a lot of address matching papers is the pre-processing of data sets to STANDARDIZE addresses according to some set of rules. A long time ago (it's at least 25 years now), I wrote a macro to standardize street types (AVE, AVENUE, STREET, STRT, ST, etc.) prior to matching ... still "out there" ...
http://www.albany.edu/~msz03/nesug/combined.zip
The above zip file also contains a macro for a SOUNDEX alternative named NYSIIS (developed by the New York State Identification and Intelligence System) ... SOUNDEX is not good for names with lots of vowels since the first rule in the SOUNDEX routine is ...
Retain the first letter in the argument and discard the following letters: A E H I O U W Y
NYSIIS rules are on Wikipedia (what isn't) ... https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
My reply is not intended to be a comprehensive overview of fuzzy matching. There are lots of SAS samples on the internet.
Soundex was designed by genealogists to deal with differences in spelling of names. It doesn't handle numbers well (for example, 1 is not interpreted as ONE, both 112 and 1122 are simply treated as numbers so the soundex function will miss the similarity entirely). It doesn't expand abbreviations (AVE is not expanded to AVENUE, FT ST JOHN is not interpreted as FORT SAINT JOHN), or vice versa.
So, soundex will work reasonably well on some variables (e.g. City or State), but less well on variables such as Street. However, some data cleansing may be required prior to attempting a match (e.g. ensuring State names are all abbreviated or vice versa).
There are other SAS Functions that can help with fuzzy matching, and that may be superior to soundex depending on the situation. (Example: http://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/)
Thanks for the additional help. I've successfully joined the two data sets using the =* function, and created a score for the match on city name using the COMPLEV function:
data COMPARE; set ADDRESS_ID;
compscore = max(1-(complev(city1, city2)/length(city1)), 0);
format compscore percent7.;
run;
I will use matches >=90% (the =* joined some cases which were not valid such as SAN FRANCISCO and SAN BERNARDINO, so a score of sorts is necessary).
After I ran my union query, I am left with duplicate rows (should be unique by ID and Code variables), after I did two comparisons: STATE CODE-CITY-ZIP, and STATE CODE - CITY (=*). The varaible I created to show the type of match (Match_type) has the values: CITY AND ZIP, CITY, NO MATCH. How do I remove the duplicate rows on ID AND Code, so that I don't have duplicate rows with different values (e.g. CITY AND ZIP, CITY)? If two records exist with ID = 50 and Code = FGHJ67, one of which has CITY AND ZIP and one which has CITY for the variable MATCH_TYPE, I want to retain the row which has CITY AND ZIP and remove the other.
Hi. I'm sure that there are lot of solutions, but I'd assign a numeric value to CITY AND ZIP (lvl = 2) and a lower value to CITY (lvl =1). Then you could use PROC SQL to keep only those observations with the highest vale of LVL ...
data test;
input id :$2. code :$6. lvl;
datalines;
50 FGHJ67 1
30 ABCD99 2
50 FGHJ67 2
30 ABCD99 1
44 ZXCV88 1
99 VCXZ66 2
;
title "MAX LVL WITHIN GROUPS";
proc sql;
select * from test
group by id, code
having lvl = max(lvl);
quit;
MAX LVL WITHIN GROUPS
id code lvl
30 ABCD99 2
44 ZXCV88 1
50 FGHJ67 2
99 VCXZ66 2
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.