Hello all,
I am attempting to "fuzzy match" first names and birthdays of individual patients with similar last names in a single dataset with over 14,000 records to determine if each record represents an unique patient. I am aware that the spedis function is most likely my best bet but I have not seen a syntax example of how to do this with a single dataset and with a required primary conditional assumption. In essence, I would like to create a program that would first find all records with the same last name and then determine which of these individuals have similar first names or birthdays that most likely constitute a misspelling or error in data coding.
I would appreciate any and all suggestions or ideas and thank you in advance for your assistance.
Sincerely,
Liz
It doesn't matter if you're working with a single data set or multiple, you can treat your single data set as two by merging with itself either via a sql join or a data step merge.
You should probably provide a few example records, fictional are quite all right and the expected final outcome.
Good luck with the date entry part though.
I tend to use the COMPGED function as it worked better for me.
A reflexive join might be a start:
Proc sql;
create table examine as
select a.*, b.firstname as bfname, b.dob as bdob,
compged(cats(a.firstname,a.lastname), cats(b.firstname,b.lastname),"IL" as LexicalDif,
(a.dob - b.dob) as datedif
from dataset as a inner join dataset as b on upcase(a.lastname)=b.lastname
order by a.lastname,a.firstname,a.dob, LexicalDif, datedif;
quit;
Lexicaldif=0 and datedif=0 is strong evidence of same record. NOTE: you better get at least one for every record since it will duplicate. It might help if there is a Record identifier in the data set. If so add to the join criteria: AND a.recordid ne b.recordid
Thanks to you both.
ballardw, there is no unique ID in the dataset unfortunately. Here is a fictional account of my data and what I am looking for more specifically. My table has 4 variables overall as shown below and currently looks like (A). There seems to be 5 patients admitted to hospital J with the last name of Burns in 2012 but on careful inspection obs 1 and 4 are likely to be the same person but the spelling of the first name differs. As my dataset has over 14,000 records, I would need syntax that would mark many of these records for me as I cannot visually inspect the entire dataset in the time permitted. When all is said and done, I would like to have a dataset like that of (B).
A:
obs Last Name First Name Date of Birth Number_of_Admissions
1 Burns Daisy 5/3/2008 16
2 Burns Dana 1/1/1978 2
3 Burns Daniel 9/8/1998 8
4 Burns Daysi 5/3/2008 4
5 Burns Dwayne 5/3/2008 20
B:
obs Last Name First Name Date of Birth Number_of_Admissions
1 Burns Daisy 5/3/2008 20
2 Burns Dana 1/1/1978 2
3 Burns Daniel 9/8/1998 8
4 Burns Dwayne 5/3/2008 20
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.