New Contributor
Posts: 4

# Fuzzy Matching of a single dataset

Hello all,

I am attempting to "fuzzy match" first names and birthdays of individual patients with similar last names in a single dataset with over 14,000 records to determine if each record represents an unique patient.  I am aware that the spedis function is most likely my best bet but I have not seen a syntax example of how to do this with a single dataset and with a required primary conditional assumption.  In essence, I would like to create a program that would first find all records with the same last name and then determine which of these individuals have similar first names or birthdays that most likely constitute a misspelling or error in data coding.

I would appreciate any and all suggestions or ideas and thank you in advance for your assistance.

Sincerely,

Liz

Super User
Posts: 23,678

## Re: Fuzzy Matching of a single dataset

It doesn't matter if you're working with a single data set or multiple, you can treat your single data set as two by merging with itself either via a sql join or a data step merge.

Super User
Posts: 13,507

## Re: Fuzzy Matching of a single dataset

You should probably provide a few example records, fictional are quite all right and the expected final outcome.

Good luck with the date entry part though.

I tend to use the COMPGED function as it worked better for me.

A reflexive join might be a start:

Proc sql;

create table examine as

select a.*, b.firstname as bfname, b.dob as bdob,

compged(cats(a.firstname,a.lastname), cats(b.firstname,b.lastname),"IL" as LexicalDif,

(a.dob - b.dob) as datedif

from dataset as a inner join dataset as b on upcase(a.lastname)=b.lastname

order by a.lastname,a.firstname,a.dob, LexicalDif, datedif;

quit;

Lexicaldif=0 and datedif=0 is strong evidence of same record. NOTE: you better get at least one for every record since it will duplicate. It might help if there is a Record identifier in the data set. If so add to the join criteria: AND a.recordid ne b.recordid

New Contributor
Posts: 4

## Re: Fuzzy Matching of a single dataset

Thanks to you both.

ballardw, there is no unique ID in the dataset unfortunately.  Here is a fictional account of my data and what I am looking for more specifically.  My table has 4 variables overall as shown below and currently looks like (A).  There seems to be 5 patients admitted to hospital J with the last name of Burns in 2012 but on careful inspection obs 1 and 4 are likely to be the same person but the spelling of the first name differs.  As my dataset has over 14,000 records, I would need syntax that would mark many of these records for me as I cannot visually inspect the entire dataset in the time permitted.  When all is said and done, I would like to have a dataset like that of (B).

A:

obs     Last Name          First Name              Date of Birth     Number_of_Admissions

1      Burns                    Daisy                         5/3/2008                    16

2      Burns                    Dana                         1/1/1978                      2

3      Burns                    Daniel                        9/8/1998                      8

4      Burns                    Daysi                         5/3/2008                    4

5      Burns                    Dwayne                     5/3/2008                    20

B:

obs     Last Name          First Name              Date of Birth     Number_of_Admissions

1      Burns                    Daisy                         5/3/2008                    20

2      Burns                    Dana                         1/1/1978                      2

3      Burns                    Daniel                        9/8/1998                      8

4      Burns                    Dwayne                     5/3/2008                    20

Discussion stats
• 3 replies
• 296 views
• 0 likes
• 3 in conversation