2 weeks ago
I am trying to merge Data A (approx 90,000 obs and 50 variables) and Data B (approx 600 obs and 20 variables) and my merged dataset should have everyone from Data B. There is no unique var that links the two datasets so I am trying to create one. Theoretically, everyone in Data B should be in Data A. The outcome of interest is in Data B (outcome=0 or 1) and most of the info I need for my analysis are in Data A. I've tried to create a unique var using different combinations, but cannot seem to get it right the way I want. For instance, I've tried something like:
firsttname x lastname x 1st line of address x dob (txt): Too conservative as some of the obs in Data A had first name as 'man' or 'resident' instead of the actual names. Last name seems like a good var to use as they were relatively consistent in both datasets.
lastname x dob (txt): This returned some incorrect matching because the merged dataset had more than 700 records.
I am curious if there are any good ways to get this right. The sample size is relatively small so if nothing works I could probably go through them manually, say use the second combo then try to find the ones that do not actually match the records in Data B. Does anybody has any suggestions or tips on this?
2 weeks ago
You haven't given any information about the variables in the two datasets.
In such case, where there are no common variables, probably there are some common information that
within some decoding can help or give hints how to match observations of the two datasets.
You mentioned firstname, lastname, address - how is this data presented in variables in each dataset?
It may help if you subset manually few (up to 10) observations from each datset, where some of them
can match manually, and mark them for test and check.
2 weeks ago
I would start by matching using the most reliable combinations of columns. For example last name, first name, DOB. I would then remove the exact matches produced by this combination, then try matching by other less reliable combinations. By progressively building up a set of matching rules that go from most reliable to least reliable you can get to the point of deciding what the cutoff will be for acceptable matches versus unacceptable matches.
The very nature of your data suggests that any matching strategy is not going to be 100% reliable. The question is what is an acceptable level of reliability? Perhaps somewhere between 95% and 99%?