BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Kiko
Fluorite | Level 6

Hello

 

I am trying to merge Data A (approx 90,000 obs and 50 variables) and Data B (approx 600 obs and 20 variables) and my merged dataset should have everyone from Data B.  There is no unique var that links the two datasets so I am trying to create one. Theoretically, everyone in Data B should be in Data A. The outcome of interest is in Data B (outcome=0 or 1) and most of the info I need for my analysis are in Data A. I've tried to create a unique var using different combinations, but cannot seem to get it right the way I want. For instance, I've tried something like:

 

firsttname x lastname x 1st line of address x dob (txt): Too conservative as some of the obs in Data A had first name as 'man' or 'resident' instead of the actual names. Last name seems like a good var to use as they were relatively consistent in both datasets.  

 

lastname x dob (txt): This returned some incorrect matching because the merged dataset had more than 700 records.

 

I am curious if there are any good ways to get this right. The sample size is relatively small so if nothing works I could probably go through them manually, say use the second combo then try to find the ones that do not actually match the records in Data B. Does anybody has any suggestions or tips on this?

 

Thank you!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
SASKiwi
PROC Star

I would start by matching using the most reliable combinations of columns. For example last name, first name, DOB. I would then remove the exact matches produced by this combination, then try matching by other less reliable combinations. By progressively building up a set of matching rules that go from most reliable to least reliable you can get to the point of deciding what the cutoff will be for acceptable matches versus unacceptable matches.

 

The very nature of your data suggests that any matching strategy is not going to be 100% reliable. The question is what is an acceptable level of reliability? Perhaps somewhere between 95% and 99%?  

 

View solution in original post

2 REPLIES 2
Shmuel
Garnet | Level 18

You haven't given any information about the variables in the two datasets.

In such case, where there are no common variables, probably there are some common information that

within some decoding can help or give hints how to match observations of the two datasets.

You mentioned firstname, lastname, address - how is this data presented in variables in each dataset?

 

It may help if you subset manually few (up to 10) observations from each datset, where some of them

can match manually, and mark them for test and check.

SASKiwi
PROC Star

I would start by matching using the most reliable combinations of columns. For example last name, first name, DOB. I would then remove the exact matches produced by this combination, then try matching by other less reliable combinations. By progressively building up a set of matching rules that go from most reliable to least reliable you can get to the point of deciding what the cutoff will be for acceptable matches versus unacceptable matches.

 

The very nature of your data suggests that any matching strategy is not going to be 100% reliable. The question is what is an acceptable level of reliability? Perhaps somewhere between 95% and 99%?  

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 762 views
  • 1 like
  • 3 in conversation