I have two datasets, each of about 20000 observations, and 10 variables, including first name, last name, birth year month day, etc. Now I want to match the two dataset and output a table with all the overlapping entries (meaning if one person appears in both data set, it's a match. (allowing a small extent of name spelling error)) My approach on this problem is to join the two sets by cartesian product, so that I'll get about 4*10^10 entries. So each entry will have first name twice (one from entry x in dataset 1, and one from entry y in dataset 2) and every variable twice. Then I'll create a new variable and use compged command to measure the "relative distance" of first_name_1 with first_name_2, plus last_name_1 with last_name_2, so that I keep all the entries with low "relative distance." However, this only worked for smaller datasets, and for sets of this size it cannot process fast enough. Is there a better way to match these two datasets? Thanks in advance!
... View more