- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I need some help and I'm not sure where to begin. I have two datasets:
Dataset One
last_name | first name | DOB | address | City | State | Zip |
Billy | Bob | 8/15/1979 | 702 Park Avenue St. | Wilmington | NC | 57896 |
Will Jr. | King | 5/11/1993 | 55556 Martin Lane | Charlotte | NC | 12589 |
And dataset two
last_name | first name | DOB | address | City | State | Zip |
Billy | Bob | 8/15/1979 | 702 Park Avenue St. | Wilmington | NC | 57896 |
Will | King | 5/11/1993 | 55556 Martin Ln. | Charlotte | NC | 12589 |
I would like to match on dob, and the first 9 characters of the address since the Street could be spelled Street or St. etc.
What I am trying to achieve is to find duplicates in the two datasets where the name maybe different by matching on dob and street address.
For instance, Will, King is in the both datasets but dataset one has him listed as Will Jr., King. Then I would like to output all instances where
the DOB and address match but there is a possibility that the name maybe different.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you want to match on the first 8 characters you can use SUBSTR() to create new variables for the address and merge them. But if you want to fuzzy match both names and addresses it's not an easy process to be honest. Entity resolution is computationally intensive.
If you have access to SAS Data Flux it does one of the best jobs I've seen. Otherwise, the source code for this package is available but they're all fairly complex code sets. https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/020-30.pdf
General idea though : first do exact matches and remove those samples.
Then do exact on name/substr of address and remove these and then essentially iterate with all your algorithms. Removing the matched cases is important as these comparisons are computationally intensive as mentioned as you need to basically compare every value to every other value, so if you have 100 records in one data set you're matching to 100 in another. If you can limit the matches by say Birthdate first, that can really increase the efficiency of your process.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just some thoughts ...
Assuming the following data is in your second table:
last_name | first name | DOB | address | City | State | Zip |
Billy | Bob | 8/15/1979 | 702 Park Avenue St. | Wilmington | DE | 77580 |
Would you consider this Bob Billy the same as the Bob Billy in your first table?
If not start by comparing last_name, first_name, dob, city and zip (i think "state" can be ignored, because the state is already identified by zip, clean data assumed). For all matches found, compare the cleaned address. With "cleaned" i mean: fixing problems like "street" sometimes written as "ST." or "St." or "Street" or "street". There will be many more similar problems, i don't know us-addresses good enough to provide sufficient input on this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How many rows are in each of your tables? How many distinct addresses and birth dates do you have - both separately and in combination? If you get large numbers of people with 1 Jan birth dates you know it will be unreliable.
If your row counts are low then a limited tactical solution might work. If you have millions of rows then a more systemic and robust solution is required. SAS Data Quality contains tools specifically for this purpose but if you don't already have it you can only really justify it for large projects.