Hello,
I am matching 2 relatively large datasets (both over 30,000 observations) based on three variables: facility name, facility state, and facility zip. The facility names are not always a perfect match between the 2 sets. For example:
Set 1
Saint Joseph's
Care at Boardwalk and Center for Health
Hospital at Mercy
Set 2
St Joseph's Hospital
Care at Boardwalk
Mercy
As you can see, the names may have different spellings (Saint vs St) or include/exclude certain words (Care at Boardwalk and Center for Health vs Care at Boardwalk, Hospital at Mercy vs Mercy). There is no particular rhyme or reason for the differences in name so using a substring function to extract certain words would not be ideal. Using a simple proc sort and match-merge does match the ones with exact matches but I still have 20,000 observations that are not matching. Are there any functions that may be particularly useful for this kind of problem? I would greatly appreciate any help in getting SAS to do the matching for me.
Can you post a small sample of your two data sets to illutrate your problem? Makes it much easier to provide usable code.
Hi there,
Unfortunately I cannot post my data as it is sensitive information. Instead, I have created two datasets that represent the common problems I am seeing in my data (different spellings, including/excluding words). Sample_main is the dataset that I have currently. Sample_matches is the dataset I have been given that I am supposed to match to sample_main. I am matching on the three variables shown in the dataset.
Ok. So your data looks something like this. Please explain what you want to do from here?
data work.sample_main;
infile datalines dsd truncover;
input facility_name:$42. facility_state:$2. facility_zip:32.;
label facility_name="facility_name" facility_state="facility_state" facility_zip="facility_zip";
datalines4;
saint joseph,wa,90909
mercy,ga,80808
aftercare center at boardwalk and hospital,wv,30303
north hospital - filament,al,50505
partners at work/70707,,
;;;;
data work.samples_matches;
infile datalines dsd truncover;
input facility_name:$29. facility_state:$2. facility_zip:32.;
label facility_name="facility_name" facility_state="facility_state" facility_zip="facility_zip";
datalines4;
st. joseph's hospital,wa,90909
mercy center,ga,80808
aftercare center at boardwalk,wv,30303
north hospital,al,50505
partners,,70707
;;;;
30,000 rows are not that much. What happens if you just merge by facility_state and facility_zip? How many rows where this key combination isn't unique do you get?
If there aren't too many "duplicates" then some "manual" coding for the remaining cases might be the quickest way to get you what you want.
I might suggest a visit to https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm
There is CDC developed tool for matching data probabilisticly based on common fields. The software is free to download and use. Basic input is text files and you point the software to "match" columns. The output will give a match probability along with the compared values. You can set a threshold for that probability.
Not perfect but the price is right and is pretty robust plus gets around you having to try to setup hundreds of rules or attempted data "fixes".
Hello,
I would proceed this way :
1/ Standardize names in both datasets (case, suppress special characters, ...)
2/ merge rows of the datasets if one standardized name contains the other
3/ Flag unmatched rows and matches for which names differ and examine flagged rows to see if some additional criterion
could improve the results
4/ Repeat from 1 with an enriched standardization marco until the remaining flagged rows can be handled manually.
Of course, if the differences between names in both datasets are important this can be a huge task.
%macro stdz(name);
prxchange("s/[^\w]//",-1, tranwrd(upcase(&name.),"SAINT","ST"))
%mend;
data main;
id=_N_;
set sample_main;
run;
proc sql;
CREATE TABLE matches AS
SELECT main.id, match.*
FROM samples_matches match
LEFT JOIN main
ON (find(%stdz(main.facility_name), %stdz(match.facility_name))
OR find(%stdz(match.facility_name), %stdz(main.facility_name)))
AND main.facility_state=match.facility_state
AND main.facility_zip=match.facility_zip
ORDER BY main.id
;
quit;
data want;
merge main (in=inmain) matches (rename=(facility_name=match_name) in=inmatch);
by id;
FLAG_MAIN_ONLY=not inmatch;
FLAG_MATCH_ONLY=not inmain;
FLAG_FUZZY_MATCH=inmain and inmatch and facility_name ne match_name;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.