BookmarkSubscribeRSS Feed
Suzanne099
Calcite | Level 5

Hello Everyone,

I am trying to find non-exact matches within the same dataset.   For example, Susan Smith and Sue Smith, would be a match, etc.    I have a unique ID variable for every observation, and I want to keep that and create a list of potential matches.   Does anyone have any ideas?

Thank you.

5 REPLIES 5
Reeza
Super User

You can try the suggestions on this list:

GeoffreyBrent
Calcite | Level 5

If you want a Rolls-Royce solution, there are some good commercial packages out there. But you can get a long way with a little bit of SQL and some knowledge of SOUNDEX and edit distance functions (COMPLEV, COMPGED).

SOUNDEX converts a character string to an expression that gives a rough idea of what it sounds like: vowels are omitted and similar-sounding consonants are lumped together. COMPLEV tells you how many single-character edits it takes to convert one string into another.

For instance, SOUNDEX("John") = J5. SOUNDEX("Johann") also equals J5, and SOUNDEX("Susan") = S25. Using COMPLEV to compare the SOUNDEX values tells us that "John" and "Johann" are very similar (score 0), but "John" and "Susan" are less similar (score 2).

COMPGED is a more sophisticated version of COMPLEV that accounts for the sorts of errors that are most commonly made: e.g. "Simon"->"Simmon" is more likely than "Simon"->"Simkon".

You can use these functions (and others of your choice) to generate a score for possible matches. Each pair of observations ends up with a score, and you use a cutoff to determine which ones should be considered as possible matches. Here's an example:

%let soundweight=20;

%let gedisweight=0.1;

%let cutoff=75;

data have;

id=_n_;

input firstname $ lastname $;

firstname=upcase(firstname); /* COMPGED is case-sensitive */

lastname=upcase(lastname);

firstnamesound=soundex(firstname);

lastnamesound=soundex(lastname);

cards;

Susan Smith

John Smith

Sue Smith

Johann Schmidt

Sue Jones

Sam Snell

Joe Johnson

Jae Johnston

;

run;

proc sql;

create table want as select

a.id as id1, b.id as id2,

a.firstname as firstname1, b.firstname as firstname2,

a.lastname as lastname1, b.lastname as lastname2,

a.firstnamesound as firstnamesound1, b.firstnamesound as firstnamesound2,

/* The next few variables don't need to be here - they are recalculated separately in the join condition below.

I've included them here so you can see what these intermediate functions look like before they're combined to

generate the overall match score. */

complev(a.firstnamesound,b.firstnamesound) as firstnamesoundscore,

complev(a.lastnamesound,b.lastnamesound) as lastnamesoundscore,

compged(a.firstname,b.firstname) as firstnameeditscore,

compged(a.lastname,b.lastname) as lastnameeditscore,

&soundweight*(calculated firstnamesoundscore + calculated lastnamesoundscore)

+&gedisweight*(calculated firstnameeditscore + calculated lastnameeditscore) as matchscore

from have as a inner join have as b on (a.id < b.id /* prevents duplicates and self-matches */ AND

(

&soundweight*(complev(a.firstnamesound,b.firstnamesound)+complev(a.lastnamesound,b.lastnamesound))

+&gedisweight*(compged(a.firstname,b.firstname)+compged(a.lastname,b.lastname))

)

<&cutoff);

quit;

You may want to play around with the weights and the scoring function, especially if you have other data that could be used to enhance the match. Raising the cutoff will increase the likelihood of accepting a match, so you'll get more false positives but fewer false negatives. Lowering it has the reverse effect.

Patrick
Opal | Level 21

Check if your site has at least part of the DataFlux modules licensed (eg. part of it comes with DI Studio). If so then you could use the dqmatch() function within SAS which let's you create match codes for fuzzy matching.

If you have even more DataFlux licensed then it might be worth to also standardize your data - eg.in a way that ' Susan Smith', 'Sue Smith', 'Smith, Susan', Susan G. Smith' all are recognised as the same person and you tranform them all to "Susan Smith".

DaveHaans
Calcite | Level 5

I found this NESUG 2007 paper really helpful; the COMPGED function used here compares every observation of a string variable to all the others, as in Geoffrey's solution, above:

http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf

The examples worked perfectly for me (I used the data step as I'm less familiar with PROC SQL). Good luck!

sas-innovate-2024.png

Today is the last day to save with the early bird rate! Register today for just $695 - $100 off the standard rate.

 

Plus, pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1905 views
  • 6 likes
  • 6 in conversation