SAS Programming

luvscandy27 · Posted 12-20-2021 05:51 PM

Hello,

I need some help and I'm not sure where to begin. I have two datasets:

Dataset One

last_name	first name	DOB	address	City	State	Zip
Billy	Bob	8/15/1979	702 Park Avenue St.	Wilmington	NC	57896
Will Jr.	King	5/11/1993	55556 Martin Lane	Charlotte	NC	12589

And dataset two

last_name	first name	DOB	address	City	State	Zip
Billy	Bob	8/15/1979	702 Park Avenue St.	Wilmington	NC	57896
Will	King	5/11/1993	55556 Martin Ln.	Charlotte	NC	12589

I would like to match on dob, and the first 9 characters of the address since the Street could be spelled Street or St. etc.

What I am trying to achieve is to find duplicates in the two datasets where the name maybe different by matching on dob and street address.

For instance, Will, King is in the both datasets but dataset one has him listed as Will Jr., King. Then I would like to output all instances where

the DOB and address match but there is a possibility that the name maybe different.

Reeza · Posted 12-20-2021 06:17 PM

Fuzzy matching is a very computationally intensive calculation. How are you merging, SQL or data step?

If you want to match on the first 8 characters you can use SUBSTR() to create new variables for the address and merge them. But if you want to fuzzy match both names and addresses it's not an easy process to be honest. Entity resolution is computationally intensive.

If you have access to SAS Data Flux it does one of the best jobs I've seen. Otherwise, the source code for this package is available but they're all fairly complex code sets. https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/020-30.pdf

General idea though : first do exact matches and remove those samples.
Then do exact on name/substr of address and remove these and then essentially iterate with all your algorithms. Removing the matched cases is important as these comparisons are computationally intensive as mentioned as you need to basically compare every value to every other value, so if you have 100 records in one data set you're matching to 100 in another. If you can limit the matches by say Birthdate first, that can really increase the efficiency of your process.

andreas_lds · Posted 12-21-2021 03:04 AM

Just some thoughts ...

Assuming the following data is in your second table:

last_name	first name	DOB	address	City	State	Zip
Billy	Bob	8/15/1979	702 Park Avenue St.	Wilmington	DE	77580

Would you consider this Bob Billy the same as the Bob Billy in your first table?

If not start by comparing last_name, first_name, dob, city and zip (i think "state" can be ignored, because the state is already identified by zip, clean data assumed). For all matches found, compare the cleaned address. With "cleaned" i mean: fixing problems like "street" sometimes written as "ST." or "St." or "Street" or "street". There will be many more similar problems, i don't know us-addresses good enough to provide sufficient input on this.

SASKiwi · Posted 12-21-2021 04:20 PM

How many rows are in each of your tables? How many distinct addresses and birth dates do you have - both separately and in combination? If you get large numbers of people with 1 Jan birth dates you know it will be unreliable.

If your row counts are low then a limited tactical solution might work. If you have millions of rows then a more systemic and robust solution is required. SAS Data Quality contains tools specifically for this purpose but if you don't already have it you can only really justify it for large projects.

SAS Programming

Match on address and DOB to find Duplicates

Re: Match on address and DOB to find Duplicates

Re: Match on address and DOB to find Duplicates

Re: Match on address and DOB to find Duplicates

Follow Us

What is...

SAS Programming

Match on address and DOB to find Duplicates

Re: Match on address and DOB to find Duplicates

Re: Match on address and DOB to find Duplicates

Re: Match on address and DOB to find Duplicates

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...