I have two datasets that I need to match based on names. However, the names can be really messy. I have two main issues. First, for example, Dataset A Dataset B Name Name De la Rosa Maria Rosa De la Maria Rosa Maria de la Rosa Maria de Rosa Maria Laura Maria Maria Gabriel Rosa Gabriel The standard that I consider true match will be De la Rosa and Rosa, De la Rosa and Maria Rosa, Maria de la Rosa and Maria de Rosa, and Maria Laura and Maria. And Maria Gabriel Rosa would be a match with either Maria, Gabriel or Rosa. But Maria Laura and Maria Rosa will not be considered as a match. So basically if a name has more than one part with meanings, eg. Maria Rosa (de la is not considered as meaningful), it would be a match with names of either of the individual part, eg. Maria. But it would not be a match with another name with two meaningful parts, like Maria Laura. What I think may help is to get ride of strings like "de", " de la" (I have a bunch of others with no meanings in Hispanic names) and create new names like: Dataset A Dataset B name new_name name new_name De la Rosa Rosa Maria Rosa Maria Rosa De la Maria Maria Rosa Rosa Maria de la Rosa Maria Rosa Maria de Rosa Maria Rosa Maria Laura Maria Laura Maria Maria Maria Gabriel Rosa Maria Gabriel Rosa Gabriel Gabriel I think i can use prxchange to do that. However, I still don't know how I can make sure all the matched are found, considering there are names with equal and more than two meaningful parts (I have names up to four meaningful parts). The second issue is that there are a bunch of noises in the names, eg. Do not use, Duplicate, Gone, no existing file. They do not always stand alone. It can be "Maria Do not use", "Jose (Gone)", [OLD]John. I'm thinking to use macro to get rid of them and add newly found ones to the macro for future years of data match. I wrote this: %LET NOISE = "DO NOT USE","GONE", "DUPLICATE"; If find (name, "NOISE")>0, then do; new_name=prxchange("s/(&NOISE)"/ /,name); On top of that is that I have really large files with tens of thousands of records. I'm just really stuck. Thank you for any suggestion.
... View more