BookmarkSubscribeRSS Feed
alw12194
Calcite | Level 5

Hello, 

I am matching 2 relatively large datasets (both over 30,000 observations) based on three variables: facility name, facility state, and facility zip. The facility names are not always a perfect match between the 2 sets. For example:

Set 1

Saint Joseph's

Care at Boardwalk and Center for Health

Hospital at Mercy

Set 2

St Joseph's Hospital

Care at Boardwalk

Mercy

 

As you can see, the names may have different spellings (Saint vs St) or include/exclude certain words (Care at Boardwalk and Center for Health vs Care at Boardwalk, Hospital at Mercy vs Mercy). There is no particular rhyme or reason for the differences in name so using a substring function to extract certain words would not be ideal. Using a simple proc sort and match-merge does match the ones with exact matches but I still have 20,000 observations that are not matching. Are there any functions that may be particularly useful for this kind of problem? I would greatly appreciate any help in getting SAS to do the matching for me. 

6 REPLIES 6
PeterClemmensen
Tourmaline | Level 20

Can you post a small sample of your two data sets to illutrate your problem? Makes it much easier to provide usable code.

alw12194
Calcite | Level 5

Hi there, 

Unfortunately I cannot post my data as it is sensitive information. Instead, I have created two datasets that represent the common problems I am seeing in my data (different spellings, including/excluding words). Sample_main is the dataset that I have currently. Sample_matches is the dataset I have been given that I am supposed to match to sample_main. I am matching on the three variables shown in the dataset. 

PeterClemmensen
Tourmaline | Level 20

Ok. So your data looks something like this. Please explain what you want to do from here?

 

data work.sample_main;
  infile datalines dsd truncover;
  input facility_name:$42. facility_state:$2. facility_zip:32.;
  label facility_name="facility_name" facility_state="facility_state" facility_zip="facility_zip";
datalines4;
saint joseph,wa,90909
mercy,ga,80808
aftercare center at boardwalk and hospital,wv,30303
north hospital - filament,al,50505
partners at work/70707,,
;;;;


data work.samples_matches;
  infile datalines dsd truncover;
  input facility_name:$29. facility_state:$2. facility_zip:32.;
  label facility_name="facility_name" facility_state="facility_state" facility_zip="facility_zip";
datalines4;
st. joseph's hospital,wa,90909
mercy center,ga,80808
aftercare center at boardwalk,wv,30303
north hospital,al,50505
partners,,70707
;;;;
Patrick
Opal | Level 21

30,000 rows are not that much. What happens if you just merge by facility_state and facility_zip? How many rows where this key combination isn't unique do you get?

If there aren't too many "duplicates" then some "manual" coding for the remaining cases might be the quickest way to get you what you want.

ballardw
Super User

I might suggest a visit to https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

There is CDC developed tool for matching data probabilisticly based on common fields. The software is free to download and use. Basic input is text files and you point the software to "match" columns. The output will give a match probability along with the compared values. You can set a threshold for that probability.

 

Not perfect but the price is right and is pretty robust plus gets around you having to try to setup hundreds of rules or attempted data "fixes".

 

gamotte
Rhodochrosite | Level 12

Hello,

 

I would proceed this way :

1/ Standardize names in both datasets (case, suppress special characters, ...)

2/ merge rows of the datasets if one standardized name contains the other

3/ Flag unmatched rows and matches for which names differ and examine flagged rows to see if some additional criterion

could improve the results

4/ Repeat from 1 with an enriched standardization marco until the remaining flagged rows can be handled manually.

 

Of course, if the differences between names in both datasets are important this can be a huge task.

 

%macro stdz(name);
    prxchange("s/[^\w]//",-1, tranwrd(upcase(&name.),"SAINT","ST"))
%mend;

data main;
    id=_N_;
    set sample_main;
run;

proc sql;
    CREATE TABLE matches AS
    SELECT main.id, match.*
    FROM samples_matches match
    LEFT JOIN main
    ON (find(%stdz(main.facility_name), %stdz(match.facility_name))
     OR find(%stdz(match.facility_name), %stdz(main.facility_name)))
    AND main.facility_state=match.facility_state
    AND main.facility_zip=match.facility_zip 
    ORDER BY main.id
    ;
quit;

data want;
    merge main (in=inmain) matches (rename=(facility_name=match_name) in=inmatch);
    by id;

    FLAG_MAIN_ONLY=not inmatch;
    FLAG_MATCH_ONLY=not inmain;
    FLAG_FUZZY_MATCH=inmain and inmatch and facility_name ne match_name;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 5809 views
  • 1 like
  • 5 in conversation