SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
GingerJJ
Obsidian | Level 7

I have two datasets that I need to match based on names. However, the names can be really messy.

I have two main issues.

First, for example, 

Dataset A                 Dataset B

Name                        Name

De la Rosa               Maria Rosa

De la Maria              Rosa

Maria de la Rosa     Maria de Rosa

Maria Laura             Maria

Maria Gabriel Rosa   Gabriel

 

The standard that I consider true match will be De la Rosa and Rosa, De la Rosa and Maria Rosa, Maria de la Rosa and Maria de Rosa, and Maria Laura and Maria. And Maria Gabriel Rosa would be a match with either Maria, Gabriel or Rosa. But Maria Laura and Maria Rosa will not be considered as a match.

 

So basically if a name has more than one part with meanings, eg. Maria Rosa (de la is not considered as meaningful), it would be a match with names of either of the individual part, eg. Maria. But it would not be a match with another name with two meaningful parts, like Maria Laura

 

What I think may help is to get ride of strings like "de", " de la" (I have a bunch of others with no meanings in Hispanic names) and create new names like:

Dataset A                                  Dataset B

name                    new_name           name                 new_name

De la Rosa             Rosa                    Maria Rosa       Maria Rosa

De la Maria            Maria                  Rosa                   Rosa

Maria de la Rosa    Maria Rosa        Maria de Rosa    Maria Rosa

Maria Laura            Maria Laura        Maria                  Maria

Maria Gabriel Rosa Maria Gabriel Rosa   Gabriel        Gabriel

 

I think i can use prxchange to do that.

However, I still don't know how I can make sure all the matched are found, considering there are names with equal and more than two meaningful parts (I have names up to four meaningful parts).

 

The second issue is that there are a bunch of noises in the names, eg. Do not use, Duplicate, Gone, no existing file.

They do not always stand alone. It can be "Maria Do not use", "Jose (Gone)", [OLD]John. I'm thinking to use macro to get rid of them and add newly found ones to the macro for future years of data match. I wrote this:

%LET NOISE = "DO NOT USE","GONE", "DUPLICATE";

If find (name, "NOISE")>0, then do;

new_name=prxchange("s/(&NOISE)"/ /,name);

 

On top of that is that I have really large files with tens of thousands of records. I'm just really stuck. 

Thank you for any suggestion.

 

2 REPLIES 2
PhilC
Rhodochrosite | Level 12

Generalized edit distance:  The COMPGED() function is made for this type of thing.  But this changes nothing, this fuzzy math is still difficult.

 

SAS Help Center: COMPGED Function

ballardw
Super User

Do you have any other data in the two data sets in common such as date of birth, residence (city, state/province, postal code), telephone number or such?

 

If you have additional information such as above you may want to investigate Link Plus from the CDC. Website:

https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

 

This is a probabilistic matching program that will use multiple pieces of information and provide a probability of a match between two records. It will report the probability that two records match so you might be able to reduce manual coding a bit and only spend time with the problem cases. I've never used it this tool with only names but have used it to identify when a data base was migrated and determined individuals whose names were spelled differently or had changes in date of birth, race, ethnicity or sex when the data was transferred between data systems. One thing is that the software is free. It does want text files with column headers (SAS makes those pretty easy with Proc Export) and you provide the rules for which fields to compare between two files. Example one file might have date of birth information in a column DOB and the other BirthDate. So you tell the program those columns are to be compared.

 

If you have tens of thousands of cases and only names I would call this a "job security" problem because there is so much garbage you are going to have to deal with. You haven't even addresses things like Robert, Bob and Bobby or similar nickname patterns.

 

One problem I have with eliminating practically any give word like "Old" from these fields are some of the actual names. Consider celebrity children names like Moon Unit Zappa, or anglicized Native American names. There are also parents out there that go a long way to get interesting names for children, Though matches on those name may be more obvious it might also mean you don't want to just throw out words.

 

A proactive approach would be to go the source of the data and say they have to watch their data entry standards a bit more closely.

 

I say this from dealing with things like Address data for geocoding that had addresses with text like "see the woman in the building behind the main house"

 

You may also want to make sure you aren't dealing with multiple character sets such as accented letters before attempting this.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 1844 views
  • 0 likes
  • 3 in conversation