BookmarkSubscribeRSS Feed
epiala
Calcite | Level 5

Hello all,

I am attempting to "fuzzy match" first names and birthdays of individual patients with similar last names in a single dataset with over 14,000 records to determine if each record represents an unique patient.  I am aware that the spedis function is most likely my best bet but I have not seen a syntax example of how to do this with a single dataset and with a required primary conditional assumption.  In essence, I would like to create a program that would first find all records with the same last name and then determine which of these individuals have similar first names or birthdays that most likely constitute a misspelling or error in data coding.

I would appreciate any and all suggestions or ideas and thank you in advance for your assistance.

Sincerely,

Liz

3 REPLIES 3
Reeza
Super User

It doesn't matter if you're working with a single data set or multiple, you can treat your single data set as two by merging with itself either via a sql join or a data step merge.

ballardw
Super User

You should probably provide a few example records, fictional are quite all right and the expected final outcome.

Good luck with the date entry part though.

I tend to use the COMPGED function as it worked better for me.

A reflexive join might be a start:

Proc sql;

     create table examine as

     select a.*, b.firstname as bfname, b.dob as bdob,

          compged(cats(a.firstname,a.lastname), cats(b.firstname,b.lastname),"IL" as LexicalDif,

          (a.dob - b.dob) as datedif

     from dataset as a inner join dataset as b on upcase(a.lastname)=b.lastname

     order by a.lastname,a.firstname,a.dob, LexicalDif, datedif;

quit;

Lexicaldif=0 and datedif=0 is strong evidence of same record. NOTE: you better get at least one for every record since it will duplicate. It might help if there is a Record identifier in the data set. If so add to the join criteria: AND a.recordid ne b.recordid

epiala
Calcite | Level 5

Thanks to you both.

ballardw, there is no unique ID in the dataset unfortunately.  Here is a fictional account of my data and what I am looking for more specifically.  My table has 4 variables overall as shown below and currently looks like (A).  There seems to be 5 patients admitted to hospital J with the last name of Burns in 2012 but on careful inspection obs 1 and 4 are likely to be the same person but the spelling of the first name differs.  As my dataset has over 14,000 records, I would need syntax that would mark many of these records for me as I cannot visually inspect the entire dataset in the time permitted.  When all is said and done, I would like to have a dataset like that of (B).

A:

obs     Last Name          First Name              Date of Birth     Number_of_Admissions

  1      Burns                    Daisy                         5/3/2008                    16

  2      Burns                    Dana                         1/1/1978                      2

  3      Burns                    Daniel                        9/8/1998                      8  

  4      Burns                    Daysi                         5/3/2008                    4

  5      Burns                    Dwayne                     5/3/2008                    20

B:

obs     Last Name          First Name              Date of Birth     Number_of_Admissions

  1      Burns                    Daisy                         5/3/2008                    20

  2      Burns                    Dana                         1/1/1978                      2

  3      Burns                    Daniel                        9/8/1998                      8  

  4      Burns                    Dwayne                     5/3/2008                    20

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1709 views
  • 0 likes
  • 3 in conversation