Programming the statistical procedures from SAS

Help with finding "closely" matched records in data sets

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 12
Accepted Solution

Help with finding "closely" matched records in data sets

I have an external file which contains mailing addresses (City, State, zip, etc ...) and I need a way to join that data with data in a table which contains the same information.   Addresses are often misspelled, and this can happen with any aspect be it the state, zip, city, street, etc...   I need a way to find the record in the database which most "likely" is the record represented in the datafile.  A simple join or Proc Compare would only give exact matches or the mismatches, and that would be too limiting.  Is there a process which would find the exact matches and the closest match based on matching variables (e.g.  External.Zip - Internal.Zip, External.City_name - Internal.CIty, etc...).  Also, any process which allows me to prioritize the variables and provides output on the level of the match (90%, 95%, ...) would be ideal.   


Accepted Solutions
Solution
‎07-30-2015 09:08 AM
Super User
Super User
Posts: 7,565

Re: Help with finding "closely" matched records in data sets

You could concatenate them together:

where soundex(catx(",",a.city,a.state,a.street))=soundex(catx(",",b.city,b.state,b.street))


Or just and them together (which is probably better):

where soundex(a.city)=soundex(b.city) and soundex(a.state)=... and ...

View solution in original post


All Replies
Super User
Super User
Posts: 7,565

Re: Help with finding "closely" matched records in data sets

First guess would be soundex() function.  Have a look at this paper, has an example of matching cities where there is misspellings:

http://www2.sas.com/proceedings/sugi29/072-29.pdf

Occasional Contributor
Posts: 12

Re: Help with finding "closely" matched records in data sets

Is there a way to do this with multiple variables at once (e.g. City, State, and street)?

Solution
‎07-30-2015 09:08 AM
Super User
Super User
Posts: 7,565

Re: Help with finding "closely" matched records in data sets

You could concatenate them together:

where soundex(catx(",",a.city,a.state,a.street))=soundex(catx(",",b.city,b.state,b.street))


Or just and them together (which is probably better):

where soundex(a.city)=soundex(b.city) and soundex(a.state)=... and ...

Occasional Contributor
Posts: 12

Re: Help with finding "closely" matched records in data sets

That second option may work...I'm just hoping the AND doesn't leave out too many records which have a missing state or city value in one data set.  Ideally, if city is missing in one, then match on zip.  If zip is missing  then match on street, etc...  Looks like it will have to be several SQL processes with unions, using one field at a time as the join.  Thanks.

Super User
Super User
Posts: 7,565

Re: Help with finding "closely" matched records in data sets

Just layer your where clause:

...

where    (a.city is null and a.state is null and soundex(a.street)=soundex(b.street) ) or

               (a.city is null and soundex(a.state)=soundex(b.state) and soundex(a.street)=soundex(b.street)) or

               (soundex(a.city)=soundex(b.city) and soundex(...)

Valued Guide
Posts: 765

Re: Help with finding "closely" matched records in data sets

Hi.  If you are planning to use the SOUNDEX+SQL approach, the SOUNDS LIKE operator (=*) does a SOUNDEX comparison.  Also, when comparing the content of multiple variables, you might consider adding a score.  For example, if you find that all three variables (street, city, state) match using SOUNDEX, the matched obervations would receive a higher score that if fewer than all three matched.  By the way, there are no rules in the SOUNDEX routine to for numbers, so you cannot compare zip codes.

data test;

input (name1 name2) (:$10.) @@;

datalines;

SMITH SMITH SMITHE SMYTHE SMITH ZMITH 12203 12203 13345 12209

;

proc sql;

title "SOUNDEX FUNCTION";

select * from test where soundex(name1) = soundex(name2);

title "SOUNDS LIKE OPERATOR";

select * from test where name1 =* name2;

title "SOUNDS LIKE OPERATOR + ADD A NUMERIC SCORE";

select *, name1 =* name2 as score from test;

quit;

SOUNDEX FUNCTION

name1       name2

SMITH       SMITH

SMITHE      SMYTHE

12203       12203

13345       12209

SOUNDS LIKE OPERATOR

name1       name2

SMITH       SMITH

SMITHE      SMYTHE

12203       12203

13345       12209

SOUNDS LIKE OPERATOR + ADD A NUMERIC SCORE

name1       name2          score

SMITH       SMITH              1

SMITHE      SMYTHE             1

SMITH       ZMITH              0

12203       12203              1

13345       12209              1

However, I think that you are taking a really easy way out by just using SOUNDEX to match records.  There are a lot of SAS character functions (and CALL ROUTINES) that allow you to do string comparisons.  A few are SPEDIS, COMPGED, and COMPLEV.  Each of those provides you with a score that can be used to evaluatae just how close two strings.  Modifying the online example in SAS help for rhe SPEDIS function ...

data words;

input (operation query keyword) ($);

spedis  = spedis(query, keyword);

op_sndx = soundex(query);

ky_sndx = soundex(keyword);

compged = compged(query,keyword);

complev = complev(query,keyword);

datalines;

match       fuzzy        fuzzy

singlet     fuzy         fuzzy

doublet     fuuzzy       fuzzy

swap        fzuzy        fuzzy

truncate    fuzz         fuzzy

append      fuzzys       fuzzy

delete      fzzy         fuzzy

insert      fluzzy       fuzzy

replace     fizzy        fuzzy

firstdel    uzzy         fuzzy

firstins    pfuzzy       fuzzy

firstrep    wuzzy        fuzzy

several     floozy       fuzzy

;

SPEDIS, SOUNDEX, COMPGED, COMPLEV

operation    query     keyword    spedis    op_sndx    ky_sndx    compged    complev

match        fuzzy      fuzzy        0        F2         F2           0         0

singlet      fuzy       fuzzy        6        F2         F2          20         1

doublet      fuuzzy     fuzzy        8        F2         F2          20         1

swap         fzuzy      fuzzy       10        F22        F2          20         2

truncate     fuzz       fuzzy       12        F2         F2          10         1

append       fuzzys     fuzzy        5        F22        F2          50         1

delete       fzzy       fuzzy       12        F2         F2         100         1

insert       fluzzy     fuzzy       16        F42        F2         100         1

replace      fizzy      fuzzy       20        F2         F2         100         1

firstdel     uzzy       fuzzy       25        U2         F2         200         1

firstins     pfuzzy     fuzzy       33        P2         F2         200         1

firstrep     wuzzy      fuzzy       40        W2         F2         200         1

several      floozy     fuzzy       50        F42        F2         300         3

Last, there's a wealth of material in SAS papers on matching character strings.  It all comes down to how much work you'd like to invest in the process.  Like I said, SOUNDEX seems (at least to me) like you are using something that is easy and fast but not necessarily the best approach.  Try some reading ...

Fuzzy Matching

http://www.sascommunity.org/wiki/Fuzzy_Matching

Compged makes matches easy to see!

http://www.geocities.ws/nyasug2002/COMPGED.pdf

The Fuzzy Feeling SASâ Provides: Electronic Matching of Records without Common Keys (great paper, pre-SPEDIS/COMPGED/COMPLEV

http://ftp.sas.com/techsup/download/observations/obswww15/obswww15.pdf

Do a Google search on "sas fuzzy matching".

Do a Google search on "sas address matching" ... NOTE:  one thing that you'll notice in a lot of address matching papers is the pre-processing of data sets to STANDARDIZE addresses according to some set of rules.  A long time ago (it's at least 25 years now), I wrote a macro to standardize street types (AVE, AVENUE, STREET, STRT, ST, etc.) prior to matching ... still "out there" ...

http://www.albany.edu/~msz03/nesug/combined.zip

The above zip file also contains a macro for a SOUNDEX alternative named NYSIIS (developed by the New York State Identification and Intelligence System) ... SOUNDEX is not good for names with lots of vowels since the first rule in the SOUNDEX routine is ...

Retain the first letter in the argument and discard the following letters: A E H I O U W Y

NYSIIS rules are on Wikipedia (what isn't) ... https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System

Super Contributor
Posts: 307

Re: Help with finding "closely" matched records in data sets

My reply is not intended to be a comprehensive overview of fuzzy matching. There are lots of SAS samples on the internet.

Soundex was designed by genealogists to deal with differences in spelling of names. It doesn't handle numbers well (for example, 1 is not interpreted as ONE, both 112 and 1122 are simply treated as numbers so the soundex function will miss the similarity entirely). It doesn't expand abbreviations (AVE is not expanded to AVENUE, FT ST JOHN is not interpreted as FORT SAINT JOHN), or vice versa.

So, soundex will work reasonably well on some variables (e.g. City or State), but less well on variables such as Street. However, some data cleansing may be required prior to attempting a match (e.g. ensuring State names are all abbreviated or vice versa).

There are other SAS Functions that can help with fuzzy matching, and that may be superior to soundex depending on the situation. (Example: http://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/)

Occasional Contributor
Posts: 12

Re: Help with finding "closely" matched records in data sets

Thanks for the additional help.  I've successfully joined the two data sets using the =* function, and created a score for the match on city name using the COMPLEV function:

data COMPARE; set ADDRESS_ID;

compscore =  max(1-(complev(city1, city2)/length(city1)), 0);

format compscore percent7.;

run;

I will use matches >=90% (the =* joined some cases which were not valid such as SAN FRANCISCO and SAN BERNARDINO, so a score of sorts is necessary).

After I ran my union query, I am left with duplicate rows (should be unique by ID and Code variables), after I did two comparisons: STATE CODE-CITY-ZIP, and STATE CODE - CITY (=*).   The varaible I created to show the type of match (Match_type) has the values:  CITY AND ZIP, CITY, NO MATCH.  How do I remove the duplicate rows on ID AND Code, so that I don't have duplicate rows with different values (e.g. CITY AND ZIP, CITY)?  If two records exist with ID = 50 and Code = FGHJ67, one of which has CITY AND ZIP and one which has CITY for the variable MATCH_TYPE, I want to retain the row which has CITY AND ZIP and remove the other.

Valued Guide
Posts: 765

Re: Help with finding "closely" matched records in data sets

Hi.  I'm sure that there are lot of solutions, but I'd assign a numeric value to CITY AND ZIP (lvl =  2) and a lower value to CITY (lvl =1).  Then you could use PROC SQL to keep only those observations with the highest vale of LVL ...

data test;

input id :$2. code :$6. lvl;

datalines;

50 FGHJ67 1

30 ABCD99 2

50 FGHJ67 2

30 ABCD99 1

44 ZXCV88 1

99 VCXZ66 2

;

title "MAX LVL WITHIN GROUPS";

proc sql;

select * from test

group by id, code

having lvl = max(lvl);

quit;

MAX LVL WITHIN GROUPS

id  code         lvl

30  ABCD99         2

44  ZXCV88         1

50  FGHJ67         2

99  VCXZ66         2

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 753 views
  • 6 likes
  • 4 in conversation