- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have two datasets that I need to match based on names. However, the names can be really messy.
I have two main issues.
First, for example,
Dataset A Dataset B
Name Name
De la Rosa Maria Rosa
De la Maria Rosa
Maria de la Rosa Maria de Rosa
Maria Laura Maria
Maria Gabriel Rosa Gabriel
The standard that I consider true match will be De la Rosa and Rosa, De la Rosa and Maria Rosa, Maria de la Rosa and Maria de Rosa, and Maria Laura and Maria. And Maria Gabriel Rosa would be a match with either Maria, Gabriel or Rosa. But Maria Laura and Maria Rosa will not be considered as a match.
So basically if a name has more than one part with meanings, eg. Maria Rosa (de la is not considered as meaningful), it would be a match with names of either of the individual part, eg. Maria. But it would not be a match with another name with two meaningful parts, like Maria Laura.
What I think may help is to get ride of strings like "de", " de la" (I have a bunch of others with no meanings in Hispanic names) and create new names like:
Dataset A Dataset B
name new_name name new_name
De la Rosa Rosa Maria Rosa Maria Rosa
De la Maria Maria Rosa Rosa
Maria de la Rosa Maria Rosa Maria de Rosa Maria Rosa
Maria Laura Maria Laura Maria Maria
Maria Gabriel Rosa Maria Gabriel Rosa Gabriel Gabriel
I think i can use prxchange to do that.
However, I still don't know how I can make sure all the matched are found, considering there are names with equal and more than two meaningful parts (I have names up to four meaningful parts).
The second issue is that there are a bunch of noises in the names, eg. Do not use, Duplicate, Gone, no existing file.
They do not always stand alone. It can be "Maria Do not use", "Jose (Gone)", [OLD]John. I'm thinking to use macro to get rid of them and add newly found ones to the macro for future years of data match. I wrote this:
%LET NOISE = "DO NOT USE","GONE", "DUPLICATE";
If find (name, "NOISE")>0, then do;
new_name=prxchange("s/(&NOISE)"/ /,name);
On top of that is that I have really large files with tens of thousands of records. I'm just really stuck.
Thank you for any suggestion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Generalized edit distance: The COMPGED() function is made for this type of thing. But this changes nothing, this fuzzy math is still difficult.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you have any other data in the two data sets in common such as date of birth, residence (city, state/province, postal code), telephone number or such?
If you have additional information such as above you may want to investigate Link Plus from the CDC. Website:
https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm
This is a probabilistic matching program that will use multiple pieces of information and provide a probability of a match between two records. It will report the probability that two records match so you might be able to reduce manual coding a bit and only spend time with the problem cases. I've never used it this tool with only names but have used it to identify when a data base was migrated and determined individuals whose names were spelled differently or had changes in date of birth, race, ethnicity or sex when the data was transferred between data systems. One thing is that the software is free. It does want text files with column headers (SAS makes those pretty easy with Proc Export) and you provide the rules for which fields to compare between two files. Example one file might have date of birth information in a column DOB and the other BirthDate. So you tell the program those columns are to be compared.
If you have tens of thousands of cases and only names I would call this a "job security" problem because there is so much garbage you are going to have to deal with. You haven't even addresses things like Robert, Bob and Bobby or similar nickname patterns.
One problem I have with eliminating practically any give word like "Old" from these fields are some of the actual names. Consider celebrity children names like Moon Unit Zappa, or anglicized Native American names. There are also parents out there that go a long way to get interesting names for children, Though matches on those name may be more obvious it might also mean you don't want to just throw out words.
A proactive approach would be to go the source of the data and say they have to watch their data entry standards a bit more closely.
I say this from dealing with things like Address data for geocoding that had addresses with text like "see the woman in the building behind the main house"
You may also want to make sure you aren't dealing with multiple character sets such as accented letters before attempting this.