- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Does anyone have any fancy code for performing a fuzzy match by name AND date of birth? I want to perform a one-to-one match of two datasets by name and DOB, and account for errors like "John Doe" and "John Snoe" and DOBs that are only slightly off like 3/4/1987 compared to 3/5/1987 or 3/5/1986. I also have e-mails so I could potentially do a fuzzy match on that as well...
I have tried using "compged" but it is not so great for DOB (I used a cutoff of gedscore=40 for DOB) - see code below:
**Merge and keep fuzzy matches**; %let maxscore=201; %let maxscore2=40; proc sql; create table matched_appts3 as select conf.* , sched.*, compged(conf.fname,sched.fname_sch,&maxscore,'iL' ) as gedscore, compged(conf.conf_dob, sched.sch_dob,&maxscore2, 'iL') as gedscore2 from confirm2 as conf, scheduled2 as sched where calculated gedscore < &maxscore and calculated gedscore2 < &maxscore2 order by calculated gedscore, calculated gedscore2; quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For stuff like this I often use LINKPLUS from https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm
This is free and fairly easy to use though will want you to make text files.
The software will give you probability of matching. So you can set thresholds. The more fields you have , such as maybe location will get better matches.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! Unfortunately I am working with HIPAA-protected data so I won't be able to upload it to an outside software.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Maybe there's a misunderstanding, for LinkPlus you would need to download the software, install it and run it on your system. Your data will not leave your networks.
LinkKing allows you to download the SAS code, though I vaguely recall requiring administration privileges as well to get it working. HIPAA doesn't apply here, your companies rules for installing software are your company limits but this doesn't involve HIPAA.
If you're doing it just with Base SAS, which can be done, it's definitely cumbersome and tedious though.
Here's a starter reference if you're going to do it brute force.
https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@claremc wrote:
Thanks! Unfortunately I am working with HIPAA-protected data so I won't be able to upload it to an outside software.
The reason I suggested LinkPlus, in addition to @Reeza's excellent summary, is that I have done such matching by code for smallish sets of data, only a couple hundred records. It takes several manual steps such as
1) identify exact matches
2) Move the exact matches (from both candidate sets) to a "matched" data set
3) use a fuzzy match one variable like name while requiring exact match on multiple other variables such as DOB, Gender, Location
4) Manually decide which of the possible matches are "real"
5) move the matches to the "matched" set
6) fuzzy match using fewer exact match variables
7) move the matches to the matched se
(repeat 6 and 7 with fewer variables as required matches)
until
😎 fuzzy match on just name (or whatever)
9) if any matches move them to the "matched" set.
Decide what to do with the left overs.
In my case my report is only for the matches as the incidence of matching is expected to be 5% or less.
But when I had a contractor change data systems with entirely new "unique" identifiers for 1000's of people where I had to link to data from the old system (and they could not do that) I turned to Link plus. And identified people with changed DOB, Gender, name spellings, race, and/or ethnicity. Which we sent back for confirmation and many changes to the database. Which would have taken me days if not weeks instead of the couple of hours involved with creating the input files for LinkPlus and setting up the options.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How many people are involved here? The more you have the more rigorous you need to be with your matching methodologies.
Also you should apply a range a different techniques in order of accuracy:
1. Exact name and Exact DOB
2. Fuzzy name and exact DOB
3. Fuzzy name and fuzzy DOB
Then afterwards choose the most accurate match.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can also Google LinkKing, a software package to do fuzzy linkages.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
SAS Data Quality also does this too but it would only be available if your company has it licensed.