Re: Fuzzy Matching of a single dataset

epiala · Posted 04-20-2015 11:08 AM

Hello all,

I am attempting to "fuzzy match" first names and birthdays of individual patients with similar last names in a single dataset with over 14,000 records to determine if each record represents an unique patient. I am aware that the spedis function is most likely my best bet but I have not seen a syntax example of how to do this with a single dataset and with a required primary conditional assumption. In essence, I would like to create a program that would first find all records with the same last name and then determine which of these individuals have similar first names or birthdays that most likely constitute a misspelling or error in data coding.

I would appreciate any and all suggestions or ideas and thank you in advance for your assistance.

Sincerely,

Liz

Reeza · Posted 04-20-2015 11:45 AM

It doesn't matter if you're working with a single data set or multiple, you can treat your single data set as two by merging with itself either via a sql join or a data step merge.

ballardw · Posted 04-20-2015 11:57 AM

You should probably provide a few example records, fictional are quite all right and the expected final outcome.

Good luck with the date entry part though.

I tend to use the COMPGED function as it worked better for me.

A reflexive join might be a start:

Proc sql;

create table examine as

select a.*, b.firstname as bfname, b.dob as bdob,

compged(cats(a.firstname,a.lastname), cats(b.firstname,b.lastname),"IL" as LexicalDif,

(a.dob - b.dob) as datedif

from dataset as a inner join dataset as b on upcase(a.lastname)=b.lastname

order by a.lastname,a.firstname,a.dob, LexicalDif, datedif;

quit;

Lexicaldif=0 and datedif=0 is strong evidence of same record. NOTE: you better get at least one for every record since it will duplicate. It might help if there is a Record identifier in the data set. If so add to the join criteria: AND a.recordid ne b.recordid

epiala · Posted 04-20-2015 03:12 PM

Thanks to you both.

ballardw, there is no unique ID in the dataset unfortunately. Here is a fictional account of my data and what I am looking for more specifically. My table has 4 variables overall as shown below and currently looks like (A). There seems to be 5 patients admitted to hospital J with the last name of Burns in 2012 but on careful inspection obs 1 and 4 are likely to be the same person but the spelling of the first name differs. As my dataset has over 14,000 records, I would need syntax that would mark many of these records for me as I cannot visually inspect the entire dataset in the time permitted. When all is said and done, I would like to have a dataset like that of (B).

A:

obs Last Name First Name Date of Birth Number_of_Admissions

1 Burns Daisy 5/3/2008 16

2 Burns Dana 1/1/1978 2

3 Burns Daniel 9/8/1998 8

4 Burns Daysi 5/3/2008 4

5 Burns Dwayne 5/3/2008 20

B:

obs Last Name First Name Date of Birth Number_of_Admissions

1 Burns Daisy 5/3/2008 20

2 Burns Dana 1/1/1978 2

3 Burns Daniel 9/8/1998 8

4 Burns Dwayne 5/3/2008 20

Fuzzy Matching of a single dataset