topic Re: Matching data with messy names in SAS Programming

Matching data with messy names

GingerJJ — Thu, 19 Aug 2021 19:52:35 GMT

I have two datasets that I need to match based on names. However, the names can be really messy.

I have two main issues.

First, for example,

Dataset A Dataset B

Name Name

De la Rosa Maria Rosa

De la Maria Rosa

Maria de la Rosa Maria de Rosa

Maria Laura Maria

Maria Gabriel Rosa Gabriel

The standard that I consider true match will be De la Rosa and Rosa, De la Rosa and Maria Rosa, Maria de la Rosa and Maria de Rosa, and Maria Laura and Maria. And Maria Gabriel Rosa would be a match with either Maria, Gabriel or Rosa. But Maria Laura and Maria Rosa will not be considered as a match.

So basically if a name has more than one part with meanings, eg. Maria Rosa (de la is not considered as meaningful), it would be a match with names of either of the individual part, eg. Maria. But it would not be a match with another name with two meaningful parts, like Maria Laura.

What I think may help is to get ride of strings like "de", " de la" (I have a bunch of others with no meanings in Hispanic names) and create new names like:

Dataset A Dataset B

name new_name name new_name

De la Rosa Rosa Maria Rosa Maria Rosa

De la Maria Maria Rosa Rosa

Maria de la Rosa Maria Rosa Maria de Rosa Maria Rosa

Maria Laura Maria Laura Maria Maria

Maria Gabriel Rosa Maria Gabriel Rosa Gabriel Gabriel

I think i can use prxchange to do that.

However, I still don't know how I can make sure all the matched are found, considering there are names with equal and more than two meaningful parts (I have names up to four meaningful parts).

The second issue is that there are a bunch of noises in the names, eg. Do not use, Duplicate, Gone, no existing file.

They do not always stand alone. It can be "Maria Do not use", "Jose (Gone)", [OLD]John. I'm thinking to use macro to get rid of them and add newly found ones to the macro for future years of data match. I wrote this:

%LET NOISE = "DO NOT USE","GONE", "DUPLICATE";

If find (name, "NOISE")>0, then do;

new_name=prxchange("s/(&NOISE)"/ /,name);

On top of that is that I have really large files with tens of thousands of records. I'm just really stuck.

Thank you for any suggestion.

Re: Matching data with messy names

PhilC — Thu, 19 Aug 2021 19:57:36 GMT

Generalized edit distance: The COMPGED() function is made for this type of thing. But this changes nothing, this fuzzy math is still difficult.

SAS Help Center: COMPGED Function

Re: Matching data with messy names

ballardw — Thu, 19 Aug 2021 21:02:26 GMT

Do you have any other data in the two data sets in common such as date of birth, residence (city, state/province, postal code), telephone number or such?

If you have additional information such as above you may want to investigate Link Plus from the CDC. Website:

https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

This is a probabilistic matching program that will use multiple pieces of information and provide a probability of a match between two records. It will report the probability that two records match so you might be able to reduce manual coding a bit and only spend time with the problem cases. I've never used it this tool with only names but have used it to identify when a data base was migrated and determined individuals whose names were spelled differently or had changes in date of birth, race, ethnicity or sex when the data was transferred between data systems. One thing is that the software is free. It does want text files with column headers (SAS makes those pretty easy with Proc Export) and you provide the rules for which fields to compare between two files. Example one file might have date of birth information in a column DOB and the other BirthDate. So you tell the program those columns are to be compared.

If you have tens of thousands of cases and only names I would call this a "job security" problem because there is so much garbage you are going to have to deal with. You haven't even addresses things like Robert, Bob and Bobby or similar nickname patterns.

One problem I have with eliminating practically any give word like "Old" from these fields are some of the actual names. Consider celebrity children names like Moon Unit Zappa, or anglicized Native American names. There are also parents out there that go a long way to get interesting names for children, Though matches on those name may be more obvious it might also mean you don't want to just throw out words.

A proactive approach would be to go the source of the data and say they have to watch their data entry standards a bit more closely.

I say this from dealing with things like Address data for geocoding that had addresses with text like "see the woman in the building behind the main house"

You may also want to make sure you aren't dealing with multiple character sets such as accented letters before attempting this.