08-09-2012 12:43 PM
I am trying to find non-exact matches within the same dataset. For example, Susan Smith and Sue Smith, would be a match, etc. I have a unique ID variable for every observation, and I want to keep that and create a list of potential matches. Does anyone have any ideas?
08-09-2012 09:42 PM
If you want a Rolls-Royce solution, there are some good commercial packages out there. But you can get a long way with a little bit of SQL and some knowledge of SOUNDEX and edit distance functions (COMPLEV, COMPGED).
SOUNDEX converts a character string to an expression that gives a rough idea of what it sounds like: vowels are omitted and similar-sounding consonants are lumped together. COMPLEV tells you how many single-character edits it takes to convert one string into another.
For instance, SOUNDEX("John") = J5. SOUNDEX("Johann") also equals J5, and SOUNDEX("Susan") = S25. Using COMPLEV to compare the SOUNDEX values tells us that "John" and "Johann" are very similar (score 0), but "John" and "Susan" are less similar (score 2).
COMPGED is a more sophisticated version of COMPLEV that accounts for the sorts of errors that are most commonly made: e.g. "Simon"->"Simmon" is more likely than "Simon"->"Simkon".
You can use these functions (and others of your choice) to generate a score for possible matches. Each pair of observations ends up with a score, and you use a cutoff to determine which ones should be considered as possible matches. Here's an example:
input firstname $ lastname $;
firstname=upcase(firstname); /* COMPGED is case-sensitive */
create table want as select
a.id as id1, b.id as id2,
a.firstname as firstname1, b.firstname as firstname2,
a.lastname as lastname1, b.lastname as lastname2,
a.firstnamesound as firstnamesound1, b.firstnamesound as firstnamesound2,
/* The next few variables don't need to be here - they are recalculated separately in the join condition below.
I've included them here so you can see what these intermediate functions look like before they're combined to
generate the overall match score. */
complev(a.firstnamesound,b.firstnamesound) as firstnamesoundscore,
complev(a.lastnamesound,b.lastnamesound) as lastnamesoundscore,
compged(a.firstname,b.firstname) as firstnameeditscore,
compged(a.lastname,b.lastname) as lastnameeditscore,
&soundweight*(calculated firstnamesoundscore + calculated lastnamesoundscore)
+&gedisweight*(calculated firstnameeditscore + calculated lastnameeditscore) as matchscore
from have as a inner join have as b on (a.id < b.id /* prevents duplicates and self-matches */ AND
You may want to play around with the weights and the scoring function, especially if you have other data that could be used to enhance the match. Raising the cutoff will increase the likelihood of accepting a match, so you'll get more false positives but fewer false negatives. Lowering it has the reverse effect.
08-10-2012 06:33 AM
Check if your site has at least part of the DataFlux modules licensed (eg. part of it comes with DI Studio). If so then you could use the dqmatch() function within SAS which let's you create match codes for fuzzy matching.
If you have even more DataFlux licensed then it might be worth to also standardize your data - eg.in a way that ' Susan Smith', 'Sue Smith', 'Smith, Susan', Susan G. Smith' all are recognised as the same person and you tranform them all to "Susan Smith".
09-11-2012 04:10 PM
I found this NESUG 2007 paper really helpful; the COMPGED function used here compares every observation of a string variable to all the others, as in Geoffrey's solution, above:
The examples worked perfectly for me (I used the data step as I'm less familiar with PROC SQL). Good luck!