BookmarkSubscribeRSS Feed
luvscandy27
Quartz | Level 8

Hello, 

I need some help and I'm not sure where to begin. I have two datasets:

Dataset One

last_name  first name  DOB address  City State  Zip 
Billy  Bob  8/15/1979 702 Park Avenue St.  Wilmington  NC  57896
Will Jr. King  5/11/1993 55556 Martin Lane  Charlotte  NC  12589

 

And dataset two

last_name  first name  DOB address  City State  Zip 
Billy  Bob  8/15/1979 702 Park Avenue St.  Wilmington  NC  57896
Will  King  5/11/1993 55556 Martin Ln.  Charlotte  NC  12589

 

I would like to match on dob, and the first 9 characters of the address since the Street could be spelled Street or St. etc. 

What I am trying to achieve is to find duplicates in the two datasets where the name maybe different by matching on dob and street address.

For instance, Will, King is in the both datasets but dataset one has him listed as Will Jr., King. Then I would like to output all instances where 

the DOB and address match but there is a possibility that the name maybe different. 

3 REPLIES 3
Reeza
Super User
Fuzzy matching is a very computationally intensive calculation. How are you merging, SQL or data step?

If you want to match on the first 8 characters you can use SUBSTR() to create new variables for the address and merge them. But if you want to fuzzy match both names and addresses it's not an easy process to be honest. Entity resolution is computationally intensive.

If you have access to SAS Data Flux it does one of the best jobs I've seen. Otherwise, the source code for this package is available but they're all fairly complex code sets. https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/020-30.pdf

General idea though : first do exact matches and remove those samples.
Then do exact on name/substr of address and remove these and then essentially iterate with all your algorithms. Removing the matched cases is important as these comparisons are computationally intensive as mentioned as you need to basically compare every value to every other value, so if you have 100 records in one data set you're matching to 100 in another. If you can limit the matches by say Birthdate first, that can really increase the efficiency of your process.
andreas_lds
Jade | Level 19

Just some thoughts ...

Assuming the following data is in your second table:

last_name  first name  DOB address  City State  Zip 
Billy  Bob  8/15/1979 702 Park Avenue St.  Wilmington  DE 77580

Would you consider this Bob Billy the same as the Bob Billy in your first table?

If not start by comparing last_name, first_name, dob, city and zip (i think "state" can be ignored, because the state is already identified by zip, clean data assumed). For all matches found, compare the cleaned address. With "cleaned" i mean: fixing problems like "street" sometimes written as "ST." or "St." or "Street" or "street". There will be many more similar problems, i don't know us-addresses good enough to provide sufficient input on this.

SASKiwi
PROC Star

How many rows are in each of your tables? How many distinct addresses and birth dates do you have - both separately and in combination? If you get large numbers of people with 1 Jan birth dates you know it will be unreliable.

 

If your row counts are low then a limited tactical solution might work. If you have millions of rows then a more systemic and robust solution is required. SAS Data Quality contains tools specifically for this purpose but if you don't already have it you can only really justify it for large projects.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 560 views
  • 0 likes
  • 4 in conversation