Hello,
I need some help and I'm not sure where to begin. I have two datasets:
Dataset One
last_name | first name | DOB | address | City | State | Zip |
Billy | Bob | 8/15/1979 | 702 Park Avenue St. | Wilmington | NC | 57896 |
Will Jr. | King | 5/11/1993 | 55556 Martin Lane | Charlotte | NC | 12589 |
And dataset two
last_name | first name | DOB | address | City | State | Zip |
Billy | Bob | 8/15/1979 | 702 Park Avenue St. | Wilmington | NC | 57896 |
Will | King | 5/11/1993 | 55556 Martin Ln. | Charlotte | NC | 12589 |
I would like to match on dob, and the first 9 characters of the address since the Street could be spelled Street or St. etc.
What I am trying to achieve is to find duplicates in the two datasets where the name maybe different by matching on dob and street address.
For instance, Will, King is in the both datasets but dataset one has him listed as Will Jr., King. Then I would like to output all instances where
the DOB and address match but there is a possibility that the name maybe different.
Just some thoughts ...
Assuming the following data is in your second table:
last_name | first name | DOB | address | City | State | Zip |
Billy | Bob | 8/15/1979 | 702 Park Avenue St. | Wilmington | DE | 77580 |
Would you consider this Bob Billy the same as the Bob Billy in your first table?
If not start by comparing last_name, first_name, dob, city and zip (i think "state" can be ignored, because the state is already identified by zip, clean data assumed). For all matches found, compare the cleaned address. With "cleaned" i mean: fixing problems like "street" sometimes written as "ST." or "St." or "Street" or "street". There will be many more similar problems, i don't know us-addresses good enough to provide sufficient input on this.
How many rows are in each of your tables? How many distinct addresses and birth dates do you have - both separately and in combination? If you get large numbers of people with 1 Jan birth dates you know it will be unreliable.
If your row counts are low then a limited tactical solution might work. If you have millions of rows then a more systemic and robust solution is required. SAS Data Quality contains tools specifically for this purpose but if you don't already have it you can only really justify it for large projects.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.