04-07-2017 02:06 AM - edited 04-07-2017 02:10 AM
I have two datasets, each of about 20000 observations, and 10 variables, including first name, last name, birth year month day, etc. Now I want to match the two dataset and output a table with all the overlapping entries (meaning if one person appears in both data set, it's a match. (allowing a small extent of name spelling error)) My approach on this problem is to join the two sets by cartesian product, so that I'll get about 4*10^10 entries. So each entry will have first name twice (one from entry x in dataset 1, and one from entry y in dataset 2) and every variable twice. Then I'll create a new variable and use compged command to measure the "relative distance" of first_name_1 with first_name_2, plus last_name_1 with last_name_2, so that I keep all the entries with low "relative distance." However, this only worked for smaller datasets, and for sets of this size it cannot process fast enough. Is there a better way to match these two datasets? Thanks in advance!
04-07-2017 11:30 AM
One thing may be to reduce the size of your data. If you have the same names in the data multiple times then reduce to one observation per name. Depending on your data this may be more effective if set the case for all letters to upper or lower. Then you can be matching BOB SMITH to BOB SMITH instead of Bob SMITH to Bob Smith to BOB Smith to Bob SmIth (depending on quality of initial entry).
One option might be to look into additional tools. https://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm has a link to a FREE program developed by CDC to do probabilistic matches for data similar to yours.
Providing the text files for input is fairly simple from SAS. The program also looks at other data that you may have available such as address or other common field that may help to match.