BookmarkSubscribeRSS Feed
claremc
Obsidian | Level 7

Hello, 

 

Does anyone have any fancy code for performing a fuzzy match by name AND date of birth? I want to perform a one-to-one match of two datasets by name and DOB, and account for errors like "John Doe" and "John Snoe" and DOBs that are only slightly off like 3/4/1987 compared to 3/5/1987 or 3/5/1986. I also have e-mails so I could potentially do a fuzzy match on that as well...

 

I have tried using "compged" but it is not so great for DOB (I used a cutoff of gedscore=40 for DOB) - see code below: 

**Merge and keep fuzzy matches**; 
%let maxscore=201; 
%let maxscore2=40;
proc sql;
create table matched_appts3 as
select conf.* , sched.*,
compged(conf.fname,sched.fname_sch,&maxscore,'iL' ) as gedscore, 
compged(conf.conf_dob, sched.sch_dob,&maxscore2, 'iL') as gedscore2
from confirm2 as conf, scheduled2 as sched
where calculated gedscore < &maxscore and 
calculated gedscore2 < &maxscore2 
order by calculated gedscore, calculated gedscore2;
quit;
7 REPLIES 7
ballardw
Super User

For stuff like this I often use LINKPLUS from https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

This is free and fairly easy to use though will want you to make text files.

 

The software will give you probability of matching. So you can set thresholds. The more fields you have , such as maybe location will get better matches.

 

 

claremc
Obsidian | Level 7

Thanks! Unfortunately I am working with HIPAA-protected data so I won't be able to upload it to an outside software. 

Reeza
Super User
CDC data all falls under HIPAA and it's a CDC software application.

Maybe there's a misunderstanding, for LinkPlus you would need to download the software, install it and run it on your system. Your data will not leave your networks.
LinkKing allows you to download the SAS code, though I vaguely recall requiring administration privileges as well to get it working. HIPAA doesn't apply here, your companies rules for installing software are your company limits but this doesn't involve HIPAA.

If you're doing it just with Base SAS, which can be done, it's definitely cumbersome and tedious though.
Here's a starter reference if you're going to do it brute force.
https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780
ballardw
Super User

@claremc wrote:

Thanks! Unfortunately I am working with HIPAA-protected data so I won't be able to upload it to an outside software. 


The reason I suggested LinkPlus, in addition to @Reeza's excellent summary, is that I have done such matching by code for smallish sets of data, only a couple hundred records. It takes several manual steps such as

1) identify exact matches

2) Move the exact matches (from both candidate sets) to a "matched" data set

3) use a fuzzy match one variable like name while requiring exact match on multiple other variables such as DOB, Gender, Location

4) Manually decide which of the possible matches are "real"

5) move the matches to the "matched" set

6) fuzzy match using fewer exact match variables

7) move the matches to the matched se

(repeat 6 and 7 with fewer variables as required matches)

until

😎 fuzzy match on just name (or whatever)

9) if any matches move them to the "matched" set.

Decide what to do with the left overs.

 

In my case my report is only for the matches as the incidence of matching is expected to be 5% or less.

 

But when I had a contractor change data systems with entirely new "unique" identifiers for 1000's of people where I had to link to data from the old system (and they could not do that) I turned to Link plus. And identified people with changed DOB, Gender, name spellings, race, and/or ethnicity. Which we sent back for confirmation and many changes to the database. Which would have taken me days if not weeks instead of the couple of hours involved with creating the input files for LinkPlus and setting up the options.

SASKiwi
PROC Star

How many people are involved here? The more you have the more rigorous you need to be with your matching methodologies.

 

Also you should apply a range a different techniques in order of accuracy:

1. Exact name and Exact DOB

2. Fuzzy name and exact DOB

3. Fuzzy name and fuzzy DOB

 

Then afterwards choose the most accurate match. 

Reeza
Super User

You can also Google LinkKing, a software package to do fuzzy linkages.

 

SASKiwi
PROC Star

SAS Data Quality also does this too but it would only be available if your company has it licensed.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1395 views
  • 2 likes
  • 4 in conversation