08-01-2016 03:15 PM
I'm working with a prenatal care dataset from a developing country. The data were abstracted from paper charts. Patient visits were recorded on a visit-by-visit basis, so there's no "patient file" per se. In this country, it's okay if names are spelled slightly differently, as long as it's in the ballpark phonetically. That means I now have the task of trying to link all these patient prenatal care records longitudinally while not having a consistent identifier. Here's what I do have:
For identifiers, I have their first names, middle names, last names, village, age (in years, no birthdate), last menstrual period date (LMP), expected delivery date (EDD), and parity. The problem is that no one identifier is consistently right. How do I sort these patients out and assign them a subject ID?
Thanks so much for your advice.
08-01-2016 03:45 PM
If you have access to the SAS text mining tools I think there are some additional tools there.
If you are restricted to base SAS then SAS has a function, SOUNDEX, which allows comparisons of sounds. However SAS specifically notes that non-English languages may not have good results.
If I were tackling this problem I would begin by identifying those individuals whose information appears exactly the same more than once.
I would assign them a base id. The I would use some of the other functions to attempt to match similar to those individuals. Such as start with the first and last name the same but the village is different. SPEDIS, COMPGED and COMPLEV give you several different approaches for finding "similar". Mark those identified as a match with the appropriate identifier value. Then look at those with last name and village identical and vary only by first name. Each step should have fewer unmatched records to look at.
Repeat until you have single name village combinations. Start comparing them against each other in a similar fashion.
I do have a project where I have to match names with additional information of gender and birth date. Luckily I don't have to look at more several hundred at a time.
08-01-2016 04:56 PM
Some kind of approximations ....
You have 5 variables to compare to make a decision. If 3 of them have a match, select those records and save them in one data set. Those matching with 2 will goto another data set and matching with 1 will goto 3rd data set. Then you try some of the SAS fuzzy-match functions on the files saved, the work will be smaller.