I have two datasets that I am working with that need to be merged. The potential problem is that neither dataset has what I would call an obvious choice for a sort / by variable. These are a REALD data set of individuals who are identified by REALD measures and Medicaid program data set (like TANF and SNAP) I will only give a fictionalized couple of rows from each:
Orpheus dataset (data set with respondents to disability questions on
survey)
FirstNm LastNm DOB Sex MiddleNm City
James Eastwood 09/19/75 M A Medford
Alishay Connell 03/06/91 F Eugene
ONE dataset (Medicaid program)
FirstNm LastNm DOB Sex MiddleNm City
Daniel Hart 07/18/80 M Patrick Baker City
Jade 01/02/88 F Portland
There are missing values (like LastNm) in the second field above. Also the raw some of the raw data has extraneous entries like commas and
single and double quotes especially around names.
I have looked over some SAS articles LexJansen.com but not quite what I was hoping to find. What I'd like to find is a straightforward way to merge these two datasets that have some problem data entries and missing values.
What would be the best method to achieve a merge of these datasets?
Concatenating two fields then sorting and using the concatenated variable to perform the merge? or Adding some type of indicator variable to each dataset to serve the sorting and merging by requirements?
Your thoughts and help are much valued, thank you.
wklierman
I've done a little of this type of "fuzzy" matching where you don't have a "hard" identifier like SSN or HIC number using the COMPGED function. You kind of have to experiment and do some trial and error. No matter what you do, there are going to be some false matches and also some that should be matched but are not.
This paper might be helpful: https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/
Jim
I've done a little of this type of "fuzzy" matching where you don't have a "hard" identifier like SSN or HIC number using the COMPGED function. You kind of have to experiment and do some trial and error. No matter what you do, there are going to be some false matches and also some that should be matched but are not.
This paper might be helpful: https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/
Jim
Thank you for the link to the paper - looks to be helpful.
Also I appreciate your iterative approach to exploring this challenge. That iterative approach is very closely related to the response from SASkiwi.
Thank you for your help. I really appreciate it.
wklierman
You may get more answers, if you
I've found using multiple matching strategies is a good approach. Start by using exact matches between ALL common variables, although I'd uppercase all of the character variables first. Then I'd look at what is left and then match on first name, surname, DOB and sex and see how that works out, then maybe surname, DOB and sex. You then choose the match with the highest accuracy - most variables matching. Leave fuzzy matches to last and only apply them to the remaining unmatched data.
By chipping away at the problem starting with the most accurate matching strategy and working your way down to the least accurate you can often make surprising progress.
I really like the iterative approach. That sounds like it could provide a very complete approach - one that I can share with the researchers who eventually will use the data.
Thank you.
wklierman
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.