BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wlierman
Lapis Lazuli | Level 10

I have two datasets that I am working with that need to be merged.  The potential problem is that neither dataset has what I would call an obvious choice for a sort / by variable.  These are a REALD data set of individuals who are identified  by REALD measures and Medicaid program data set (like TANF and SNAP) I will only give a fictionalized couple of rows from each:

 

Orpheus dataset (data set with respondents to disability questions on

                              survey)

 

FirstNm        LastNm          DOB      Sex    MiddleNm      City

James         Eastwood     09/19/75    M            A           Medford

Alishay          Connell       03/06/91    F                          Eugene

 

 

ONE dataset (Medicaid program)

 

FirstNm        LastNm          DOB       Sex    MiddleNm      City

Daniel          Hart              07/18/80     M         Patrick      Baker City

Jade                                 01/02/88     F                           Portland

 

There are missing values (like LastNm) in the second field above.  Also the raw some of the raw data has extraneous entries like commas and

single and double quotes especially around names.

 

I have looked over some SAS articles LexJansen.com but not quite what I was hoping to find.  What I'd like to find is a straightforward way to merge these two datasets that have some problem data entries and missing values.

 

What would be the best method to achieve a merge of these datasets?

Concatenating two fields then sorting and using the concatenated variable to perform the merge?  or Adding some type of indicator variable to each dataset to serve the sorting and merging by requirements?

 

Your thoughts and help are much valued, thank you.

wklierman

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

I've done a little of this type of "fuzzy" matching where you don't have a "hard" identifier like SSN or HIC number using the COMPGED function.  You kind of have to experiment and do some trial and error.  No matter what you do, there are going to be some false matches and also some that should be matched but are not.

 

This paper might be helpful:  https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/

 

Jim

View solution in original post

5 REPLIES 5
jimbarbour
Meteorite | Level 14

I've done a little of this type of "fuzzy" matching where you don't have a "hard" identifier like SSN or HIC number using the COMPGED function.  You kind of have to experiment and do some trial and error.  No matter what you do, there are going to be some false matches and also some that should be matched but are not.

 

This paper might be helpful:  https://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/

 

Jim

wlierman
Lapis Lazuli | Level 10

Thank you for the link to the paper - looks to be helpful.

Also I appreciate your iterative approach to exploring this challenge. That iterative approach is very closely related to the response from SASkiwi.

 

Thank you for your help. I really appreciate it.

 

wklierman

andreas_lds
Jade | Level 19

You may get more answers, if you

  • post data in usable form (dataset with datalines, no attachments)
  • and show what you expect as result
SASKiwi
PROC Star

I've found using multiple matching strategies is a good approach. Start by using exact matches between ALL common variables, although I'd uppercase all of the character variables first. Then I'd look at what is left and then match on first name, surname, DOB and sex and see how that works out, then maybe surname, DOB and sex. You then choose the match with the highest accuracy - most variables matching. Leave fuzzy matches to last and only apply them to the remaining unmatched data. 

 

By chipping away at the problem starting with the most accurate matching strategy and working your way down to the least accurate you can often make surprising progress.

wlierman
Lapis Lazuli | Level 10

I really like the iterative approach.  That sounds like it could provide a very complete approach - one that I can share with the researchers who eventually will use the data.

 

Thank you.

 

wklierman

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 669 views
  • 4 likes
  • 4 in conversation