stevennevets

‎04-14-2017

Calcite | Level 5

Member since

1 Posts
0 Likes Given
0 Solutions
0 Likes Received

Follow stevennevets

Data matching problem

1546

‎04-07-2017 02:06 AM

Activity Feed for stevennevets

Posted Data matching problem on SAS Studio. ‎04-07-2017 02:06 AM

I have two datasets, each of about 20000 observations, and 10 variables, including first name, last name, birth year month day, etc. Now I want to match the two dataset and output a table with all the overlapping entries (meaning if one person appears in both data set, it's a match. (allowing a small extent of name spelling error)) My approach on this problem is to join the two sets by cartesian product, so that I'll get about 4*10^10 entries. So each entry will have first name twice (one from entry x in dataset 1, and one from entry y in dataset 2) and every variable twice. Then I'll create a new variable and use compged command to measure the "relative distance" of first_name_1 with first_name_2, plus last_name_1 with last_name_2, so that I keep all the entries with low "relative distance." However, this only worked for smaller datasets, and for sets of this size it cannot process fast enough. Is there a better way to match these two datasets? Thanks in advance!

Online Status	Offline
Date Last Visited	‎04-14-2017 01:22 AM