Hello,
Does anyone have any fancy code for performing a fuzzy match by name AND date of birth? I want to perform a one-to-one match of two datasets by name and DOB, and account for errors like "John Doe" and "John Snoe" and DOBs that are only slightly off like 3/4/1987 compared to 3/5/1987 or 3/5/1986. I also have e-mails so I could potentially do a fuzzy match on that as well...
I have tried using "compged" but it is not so great for DOB (I used a cutoff of gedscore=40 for DOB) - see code below:
**Merge and keep fuzzy matches**; %let maxscore=201; %let maxscore2=40; proc sql; create table matched_appts3 as select conf.* , sched.*, compged(conf.fname,sched.fname_sch,&maxscore,'iL' ) as gedscore, compged(conf.conf_dob, sched.sch_dob,&maxscore2, 'iL') as gedscore2 from confirm2 as conf, scheduled2 as sched where calculated gedscore < &maxscore and calculated gedscore2 < &maxscore2 order by calculated gedscore, calculated gedscore2; quit;
For stuff like this I often use LINKPLUS from https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm
This is free and fairly easy to use though will want you to make text files.
The software will give you probability of matching. So you can set thresholds. The more fields you have , such as maybe location will get better matches.
Thanks! Unfortunately I am working with HIPAA-protected data so I won't be able to upload it to an outside software.
@claremc wrote:
Thanks! Unfortunately I am working with HIPAA-protected data so I won't be able to upload it to an outside software.
The reason I suggested LinkPlus, in addition to @Reeza's excellent summary, is that I have done such matching by code for smallish sets of data, only a couple hundred records. It takes several manual steps such as
1) identify exact matches
2) Move the exact matches (from both candidate sets) to a "matched" data set
3) use a fuzzy match one variable like name while requiring exact match on multiple other variables such as DOB, Gender, Location
4) Manually decide which of the possible matches are "real"
5) move the matches to the "matched" set
6) fuzzy match using fewer exact match variables
7) move the matches to the matched se
(repeat 6 and 7 with fewer variables as required matches)
until
😎 fuzzy match on just name (or whatever)
9) if any matches move them to the "matched" set.
Decide what to do with the left overs.
In my case my report is only for the matches as the incidence of matching is expected to be 5% or less.
But when I had a contractor change data systems with entirely new "unique" identifiers for 1000's of people where I had to link to data from the old system (and they could not do that) I turned to Link plus. And identified people with changed DOB, Gender, name spellings, race, and/or ethnicity. Which we sent back for confirmation and many changes to the database. Which would have taken me days if not weeks instead of the couple of hours involved with creating the input files for LinkPlus and setting up the options.
How many people are involved here? The more you have the more rigorous you need to be with your matching methodologies.
Also you should apply a range a different techniques in order of accuracy:
1. Exact name and Exact DOB
2. Fuzzy name and exact DOB
3. Fuzzy name and fuzzy DOB
Then afterwards choose the most accurate match.
You can also Google LinkKing, a software package to do fuzzy linkages.
SAS Data Quality also does this too but it would only be available if your company has it licensed.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.