BookmarkSubscribeRSS Feed
stevennevets
Calcite | Level 5

I have two datasets, each of about 20000 observations, and 10 variables, including first name, last name, birth year month day, etc. Now I want to match the two dataset and output a table with all the overlapping entries (meaning if one person appears in both data set, it's a match. (allowing a small extent of name spelling error)) My approach on this problem is to join the two sets by cartesian product, so that I'll get about 4*10^10 entries. So each entry will have first name twice (one from entry x in dataset 1, and one from entry y in dataset 2) and every variable twice. Then I'll create a new variable and use compged command to measure the "relative distance" of first_name_1 with first_name_2, plus last_name_1 with last_name_2, so that I keep all the entries with low "relative distance." However, this only worked for smaller datasets, and for sets of this size it cannot process fast enough. Is there a better way to match these two datasets? Thanks in advance! 

2 REPLIES 2
ballardw
Super User

One thing may be to reduce the size of your data. If you have the same names in the data multiple times then reduce to one observation per name. Depending on your data this may be more effective if set the case for all letters to upper or lower. Then you can be matching BOB SMITH to BOB SMITH instead of Bob SMITH to Bob Smith to BOB Smith to Bob SmIth (depending on quality of initial entry).

 

 

One option might be to look into additional tools. https://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm has a link to a FREE program developed by CDC to do probabilistic matches for data similar to yours.

Providing the text files for input is fairly simple from SAS. The program also looks at other data that you may have available such as address or other common field that may help to match.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 808 views
  • 0 likes
  • 3 in conversation