Solved: Re: Help with finding "closely" matched records in data sets

WGE914 · Posted 07-30-2015 08:35 AM

I have an external file which contains mailing addresses (City, State, zip, etc ...) and I need a way to join that data with data in a table which contains the same information. Addresses are often misspelled, and this can happen with any aspect be it the state, zip, city, street, etc... I need a way to find the record in the database which most "likely" is the record represented in the datafile. A simple join or Proc Compare would only give exact matches or the mismatches, and that would be too limiting. Is there a process which would find the exact matches and the closest match based on matching variables (e.g. External.Zip - Internal.Zip, External.City_name - Internal.CIty, etc...). Also, any process which allows me to prioritize the variables and provides output on the level of the match (90%, 95%, ...) would be ideal.

RW9 · Posted 07-30-2015 09:08 AM

You could concatenate them together:

where soundex(catx(",",a.city,a.state,a.street))=soundex(catx(",",b.city,b.state,b.street))

Or just and them together (which is probably better):

where soundex(a.city)=soundex(b.city) and soundex(a.state)=... and ...

View solution in original post

RW9 · Posted 07-30-2015 08:49 AM

First guess would be soundex() function. Have a look at this paper, has an example of matching cities where there is misspellings:

http://www2.sas.com/proceedings/sugi29/072-29.pdf

WGE914 · Posted 07-30-2015 08:54 AM

Is there a way to do this with multiple variables at once (e.g. City, State, and street)?

RW9 · Posted 07-30-2015 09:08 AM

You could concatenate them together:

where soundex(catx(",",a.city,a.state,a.street))=soundex(catx(",",b.city,b.state,b.street))

Or just and them together (which is probably better):

where soundex(a.city)=soundex(b.city) and soundex(a.state)=... and ...

WGE914 · Posted 07-30-2015 09:17 AM

That second option may work...I'm just hoping the AND doesn't leave out too many records which have a missing state or city value in one data set. Ideally, if city is missing in one, then match on zip. If zip is missing then match on street, etc... Looks like it will have to be several SQL processes with unions, using one field at a time as the join. Thanks.

RW9 · Posted 07-30-2015 10:09 AM

Just layer your where clause:

...

where (a.city is null and a.state is null and soundex(a.street)=soundex(b.street) ) or

(a.city is null and soundex(a.state)=soundex(b.state) and soundex(a.street)=soundex(b.street)) or

(soundex(a.city)=soundex(b.city) and soundex(...)

MikeZdeb · Posted 07-30-2015 02:11 PM

Hi. If you are planning to use the SOUNDEX+SQL approach, the SOUNDS LIKE operator (=*) does a SOUNDEX comparison. Also, when comparing the content of multiple variables, you might consider adding a score. For example, if you find that all three variables (street, city, state) match using SOUNDEX, the matched obervations would receive a higher score that if fewer than all three matched. By the way, there are no rules in the SOUNDEX routine to for numbers, so you cannot compare zip codes.

data test;

input (name1 name2) (:$10.) @@;

datalines;

SMITH SMITH SMITHE SMYTHE SMITH ZMITH 12203 12203 13345 12209

;

proc sql;

title "SOUNDEX FUNCTION";

select * from test where soundex(name1) = soundex(name2);

title "SOUNDS LIKE OPERATOR";

select * from test where name1 =* name2;

title "SOUNDS LIKE OPERATOR + ADD A NUMERIC SCORE";

select *, name1 =* name2 as score from test;

quit;

SOUNDEX FUNCTION

name1 name2

SMITH SMITH

SMITHE SMYTHE

12203 12203

13345 12209

SOUNDS LIKE OPERATOR

name1 name2

SMITH SMITH

SMITHE SMYTHE

12203 12203

13345 12209

SOUNDS LIKE OPERATOR + ADD A NUMERIC SCORE

name1 name2 score

SMITH SMITH 1

SMITHE SMYTHE 1

SMITH ZMITH 0

12203 12203 1

13345 12209 1

However, I think that you are taking a really easy way out by just using SOUNDEX to match records. There are a lot of SAS character functions (and CALL ROUTINES) that allow you to do string comparisons. A few are SPEDIS, COMPGED, and COMPLEV. Each of those provides you with a score that can be used to evaluatae just how close two strings. Modifying the online example in SAS help for rhe SPEDIS function ...

data words;

input (operation query keyword) ($);

spedis = spedis(query, keyword);

op_sndx = soundex(query);

ky_sndx = soundex(keyword);

compged = compged(query,keyword);

complev = complev(query,keyword);

datalines;

match fuzzy fuzzy

singlet fuzy fuzzy

doublet fuuzzy fuzzy

swap fzuzy fuzzy

truncate fuzz fuzzy

append fuzzys fuzzy

delete fzzy fuzzy

insert fluzzy fuzzy

replace fizzy fuzzy

firstdel uzzy fuzzy

firstins pfuzzy fuzzy

firstrep wuzzy fuzzy

several floozy fuzzy

;

SPEDIS, SOUNDEX, COMPGED, COMPLEV

operation query keyword spedis op_sndx ky_sndx compged complev

match fuzzy fuzzy 0 F2 F2 0 0

singlet fuzy fuzzy 6 F2 F2 20 1

doublet fuuzzy fuzzy 8 F2 F2 20 1

swap fzuzy fuzzy 10 F22 F2 20 2

truncate fuzz fuzzy 12 F2 F2 10 1

append fuzzys fuzzy 5 F22 F2 50 1

delete fzzy fuzzy 12 F2 F2 100 1

insert fluzzy fuzzy 16 F42 F2 100 1

replace fizzy fuzzy 20 F2 F2 100 1

firstdel uzzy fuzzy 25 U2 F2 200 1

firstins pfuzzy fuzzy 33 P2 F2 200 1

firstrep wuzzy fuzzy 40 W2 F2 200 1

several floozy fuzzy 50 F42 F2 300 3

Last, there's a wealth of material in SAS papers on matching character strings. It all comes down to how much work you'd like to invest in the process. Like I said, SOUNDEX seems (at least to me) like you are using something that is easy and fast but not necessarily the best approach. Try some reading ...

Fuzzy Matching

http://www.sascommunity.org/wiki/Fuzzy_Matching

Compged makes matches easy to see!

http://www.geocities.ws/nyasug2002/COMPGED.pdf

The Fuzzy Feeling SASâ Provides: Electronic Matching of Records without Common Keys (great paper, pre-SPEDIS/COMPGED/COMPLEV

http://ftp.sas.com/techsup/download/observations/obswww15/obswww15.pdf

Do a Google search on "sas fuzzy matching".

Do a Google search on "sas address matching" ... NOTE: one thing that you'll notice in a lot of address matching papers is the pre-processing of data sets to STANDARDIZE addresses according to some set of rules. A long time ago (it's at least 25 years now), I wrote a macro to standardize street types (AVE, AVENUE, STREET, STRT, ST, etc.) prior to matching ... still "out there" ...

http://www.albany.edu/~msz03/nesug/combined.zip

The above zip file also contains a macro for a SOUNDEX alternative named NYSIIS (developed by the New York State Identification and Intelligence System) ... SOUNDEX is not good for names with lots of vowels since the first rule in the SOUNDEX routine is ...

Retain the first letter in the argument and discard the following letters: A E H I O U W Y

NYSIIS rules are on Wikipedia (what isn't) ... https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System

Fugue · Posted 07-30-2015 03:01 PM

My reply is not intended to be a comprehensive overview of fuzzy matching. There are lots of SAS samples on the internet.

Soundex was designed by genealogists to deal with differences in spelling of names. It doesn't handle numbers well (for example, 1 is not interpreted as ONE, both 112 and 1122 are simply treated as numbers so the soundex function will miss the similarity entirely). It doesn't expand abbreviations (AVE is not expanded to AVENUE, FT ST JOHN is not interpreted as FORT SAINT JOHN), or vice versa.

So, soundex will work reasonably well on some variables (e.g. City or State), but less well on variables such as Street. However, some data cleansing may be required prior to attempting a match (e.g. ensuring State names are all abbreviated or vice versa).

There are other SAS Functions that can help with fuzzy matching, and that may be superior to soundex depending on the situation. (Example: http://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/)

WGE914 · Posted 07-30-2015 03:36 PM

Thanks for the additional help. I've successfully joined the two data sets using the =* function, and created a score for the match on city name using the COMPLEV function:

data COMPARE; set ADDRESS_ID;
compscore =  max(1-(complev(city1, city2)/length(city1)), 0);
format compscore percent7.;
run;

I will use matches >=90% (the =* joined some cases which were not valid such as SAN FRANCISCO and SAN BERNARDINO, so a score of sorts is necessary).

After I ran my union query, I am left with duplicate rows (should be unique by ID and Code variables), after I did two comparisons: STATE CODE-CITY-ZIP, and STATE CODE - CITY (=*). The varaible I created to show the type of match (Match_type) has the values: CITY AND ZIP, CITY, NO MATCH. How do I remove the duplicate rows on ID AND Code, so that I don't have duplicate rows with different values (e.g. CITY AND ZIP, CITY)? If two records exist with ID = 50 and Code = FGHJ67, one of which has CITY AND ZIP and one which has CITY for the variable MATCH_TYPE, I want to retain the row which has CITY AND ZIP and remove the other.

MikeZdeb · Posted 07-30-2015 03:58 PM

Hi. I'm sure that there are lot of solutions, but I'd assign a numeric value to CITY AND ZIP (lvl = 2) and a lower value to CITY (lvl =1). Then you could use PROC SQL to keep only those observations with the highest vale of LVL ...

data test;

input id :$2. code :$6. lvl;

datalines;

50 FGHJ67 1

30 ABCD99 2

50 FGHJ67 2

30 ABCD99 1

44 ZXCV88 1

99 VCXZ66 2

;

title "MAX LVL WITHIN GROUPS";

proc sql;

select * from test

group by id, code

having lvl = max(lvl);

quit;

MAX LVL WITHIN GROUPS

id code lvl

30 ABCD99 2

44 ZXCV88 1

50 FGHJ67 2

99 VCXZ66 2