topic Re: Finding non-exact matches in one dataset in SAS Procedures

Finding non-exact matches in one dataset

Suzanne099 — Thu, 09 Aug 2012 16:43:54 GMT

Hello Everyone,

I am trying to find non-exact matches within the same dataset. For example, Susan Smith and Sue Smith, would be a match, etc. I have a unique ID variable for every observation, and I want to keep that and create a list of potential matches. Does anyone have any ideas?

Thank you.

Re: Finding non-exact matches in one dataset

Reeza — Thu, 09 Aug 2012 16:54:04 GMT

You can try the suggestions on this list:

Re: Finding non-exact matches in one dataset

art297 — Thu, 09 Aug 2012 16:54:30 GMT

Take a look at: http://www.sconsig.com/sastips/tip00000.htm

Re: Finding non-exact matches in one dataset

GeoffreyBrent — Fri, 10 Aug 2012 01:42:57 GMT

If you want a Rolls-Royce solution, there are some good commercial packages out there. But you can get a long way with a little bit of SQL and some knowledge of SOUNDEX and edit distance functions (COMPLEV, COMPGED).

SOUNDEX converts a character string to an expression that gives a rough idea of what it sounds like: vowels are omitted and similar-sounding consonants are lumped together. COMPLEV tells you how many single-character edits it takes to convert one string into another.

For instance, SOUNDEX("John") = J5. SOUNDEX("Johann") also equals J5, and SOUNDEX("Susan") = S25. Using COMPLEV to compare the SOUNDEX values tells us that "John" and "Johann" are very similar (score 0), but "John" and "Susan" are less similar (score 2).

COMPGED is a more sophisticated version of COMPLEV that accounts for the sorts of errors that are most commonly made: e.g. "Simon"->"Simmon" is more likely than "Simon"->"Simkon".

You can use these functions (and others of your choice) to generate a score for possible matches. Each pair of observations ends up with a score, and you use a cutoff to determine which ones should be considered as possible matches. Here's an example:

%let soundweight=20;

%let gedisweight=0.1;

%let cutoff=75;

data have;

id=_n_;

input firstname $ lastname $;

firstname=upcase(firstname); /* COMPGED is case-sensitive */

lastname=upcase(lastname);

firstnamesound=soundex(firstname);

lastnamesound=soundex(lastname);

cards;

Susan Smith

John Smith

Sue Smith

Johann Schmidt

Sue Jones

Sam Snell

Joe Johnson

Jae Johnston

;

run;

proc sql;

create table want as select

a.id as id1, b.id as id2,

a.firstname as firstname1, b.firstname as firstname2,

a.lastname as lastname1, b.lastname as lastname2,

a.firstnamesound as firstnamesound1, b.firstnamesound as firstnamesound2,

/* The next few variables don't need to be here - they are recalculated separately in the join condition below.

I've included them here so you can see what these intermediate functions look like before they're combined to

generate the overall match score. */

complev(a.firstnamesound,b.firstnamesound) as firstnamesoundscore,

complev(a.lastnamesound,b.lastnamesound) as lastnamesoundscore,

compged(a.firstname,b.firstname) as firstnameeditscore,

compged(a.lastname,b.lastname) as lastnameeditscore,

&soundweight*(calculated firstnamesoundscore + calculated lastnamesoundscore)

+&gedisweight*(calculated firstnameeditscore + calculated lastnameeditscore) as matchscore

from have as a inner join have as b on (a.id < b.id /* prevents duplicates and self-matches */ AND

(

&soundweight*(complev(a.firstnamesound,b.firstnamesound)+complev(a.lastnamesound,b.lastnamesound))

+&gedisweight*(compged(a.firstname,b.firstname)+compged(a.lastname,b.lastname))

)

<&cutoff);

quit;

You may want to play around with the weights and the scoring function, especially if you have other data that could be used to enhance the match. Raising the cutoff will increase the likelihood of accepting a match, so you'll get more false positives but fewer false negatives. Lowering it has the reverse effect.

Re: Finding non-exact matches in one dataset

Patrick — Fri, 10 Aug 2012 10:33:21 GMT

Check if your site has at least part of the DataFlux modules licensed (eg. part of it comes with DI Studio). If so then you could use the dqmatch() function within SAS which let's you create match codes for fuzzy matching.

If you have even more DataFlux licensed then it might be worth to also standardize your data - eg.in a way that ' Susan Smith', 'Sue Smith', 'Smith, Susan', Susan G. Smith' all are recognised as the same person and you tranform them all to "Susan Smith".

Re: Finding non-exact matches in one dataset

DaveHaans — Tue, 11 Sep 2012 20:10:20 GMT

I found this NESUG 2007 paper really helpful; the COMPGED function used here compares every observation of a string variable to all the others, as in Geoffrey's solution, above:

http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf

The examples worked perfectly for me (I used the data step as I'm less familiar with PROC SQL). Good luck!