Solved: Merging two datasets with no obvious identifier/s

wlierman · Posted 06-08-2021 10:17 PM

I have two datasets that I am working with that need to be merged. The potential problem is that neither dataset has what I would call an obvious choice for a sort / by variable. These are a REALD data set of individuals who are identified by REALD measures and Medicaid program data set (like TANF and SNAP) I will only give a fictionalized couple of rows from each:

Orpheus dataset (data set with respondents to disability questions on

survey)

FirstNm LastNm DOB Sex MiddleNm City

James Eastwood 09/19/75 M A Medford

Alishay Connell 03/06/91 F Eugene

ONE dataset (Medicaid program)

FirstNm LastNm DOB Sex MiddleNm City

Daniel Hart 07/18/80 M Patrick Baker City

Jade 01/02/88 F Portland

There are missing values (like LastNm) in the second field above. Also the raw some of the raw data has extraneous entries like commas and

single and double quotes especially around names.

I have looked over some SAS articles LexJansen.com but not quite what I was hoping to find. What I'd like to find is a straightforward way to merge these two datasets that have some problem data entries and missing values.

What would be the best method to achieve a merge of these datasets?

Concatenating two fields then sorting and using the concatenated variable to perform the merge? or Adding some type of indicator variable to each dataset to serve the sorting and merging by requirements?

Your thoughts and help are much valued, thank you.

wklierman

jimbarbour · Posted 06-08-2021 10:35 PM

I've done a little of this type of "fuzzy" matching where you don't have a "hard" identifier like SSN or HIC number using the COMPGED function. You kind of have to experiment and do some trial and error. No matter what you do, there are going to be some false matches and also some that should be matched but are not.

This paper might be helpful: https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/

Jim

View solution in original post

jimbarbour · Posted 06-08-2021 10:35 PM

I've done a little of this type of "fuzzy" matching where you don't have a "hard" identifier like SSN or HIC number using the COMPGED function. You kind of have to experiment and do some trial and error. No matter what you do, there are going to be some false matches and also some that should be matched but are not.

This paper might be helpful: https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/

Jim

wlierman · Posted 06-09-2021 01:49 AM

Thank you for the link to the paper - looks to be helpful.

Also I appreciate your iterative approach to exploring this challenge. That iterative approach is very closely related to the response from SASkiwi.

Thank you for your help. I really appreciate it.

wklierman

andreas_lds · Posted 06-09-2021 01:14 AM

You may get more answers, if you

post data in usable form (dataset with datalines, no attachments)
and show what you expect as result

SASKiwi · Posted 06-09-2021 01:24 AM

I've found using multiple matching strategies is a good approach. Start by using exact matches between ALL common variables, although I'd uppercase all of the character variables first. Then I'd look at what is left and then match on first name, surname, DOB and sex and see how that works out, then maybe surname, DOB and sex. You then choose the match with the highest accuracy - most variables matching. Leave fuzzy matches to last and only apply them to the remaining unmatched data.

By chipping away at the problem starting with the most accurate matching strategy and working your way down to the least accurate you can often make surprising progress.

wlierman · Posted 06-09-2021 01:40 AM

I really like the iterative approach. That sounds like it could provide a very complete approach - one that I can share with the researchers who eventually will use the data.

Thank you.

wklierman

Merging two datasets with no obvious identifier/s

Re: Merging two datasets with no obvious identifier/s

Re: Merging two datasets with no obvious identifier/s

Re: Merging two datasets with no obvious identifier/s

Re: Merging two datasets with no obvious identifier/s

Re: Merging two datasets with no obvious identifier/s

Re: Merging two datasets with no obvious identifier/s

Classroom Training Available!