Re: Matching/Merging Observations with Similar but Different Values

alw12194 · Posted 12-28-2020 01:09 PM

Hello,

I am matching 2 relatively large datasets (both over 30,000 observations) based on three variables: facility name, facility state, and facility zip. The facility names are not always a perfect match between the 2 sets. For example:

Set 1

Saint Joseph's

Care at Boardwalk and Center for Health

Hospital at Mercy

Set 2

St Joseph's Hospital

Care at Boardwalk

Mercy

As you can see, the names may have different spellings (Saint vs St) or include/exclude certain words (Care at Boardwalk and Center for Health vs Care at Boardwalk, Hospital at Mercy vs Mercy). There is no particular rhyme or reason for the differences in name so using a substring function to extract certain words would not be ideal. Using a simple proc sort and match-merge does match the ones with exact matches but I still have 20,000 observations that are not matching. Are there any functions that may be particularly useful for this kind of problem? I would greatly appreciate any help in getting SAS to do the matching for me.

PeterClemmensen · Posted 12-29-2020 03:42 AM

Can you post a small sample of your two data sets to illutrate your problem? Makes it much easier to provide usable code.

The DATA to DATA Step Macro
Blog: SASnrd

alw12194 · Posted 12-29-2020 12:46 PM

Hi there,

Unfortunately I cannot post my data as it is sensitive information. Instead, I have created two datasets that represent the common problems I am seeing in my data (different spellings, including/excluding words). Sample_main is the dataset that I have currently. Sample_matches is the dataset I have been given that I am supposed to match to sample_main. I am matching on the three variables shown in the dataset.

PeterClemmensen · Posted 12-30-2020 02:42 AM

Ok. So your data looks something like this. Please explain what you want to do from here?

data work.sample_main;
  infile datalines dsd truncover;
  input facility_name:$42. facility_state:$2. facility_zip:32.;
  label facility_name="facility_name" facility_state="facility_state" facility_zip="facility_zip";
datalines4;
saint joseph,wa,90909
mercy,ga,80808
aftercare center at boardwalk and hospital,wv,30303
north hospital - filament,al,50505
partners at work/70707,,
;;;;


data work.samples_matches;
  infile datalines dsd truncover;
  input facility_name:$29. facility_state:$2. facility_zip:32.;
  label facility_name="facility_name" facility_state="facility_state" facility_zip="facility_zip";
datalines4;
st. joseph's hospital,wa,90909
mercy center,ga,80808
aftercare center at boardwalk,wv,30303
north hospital,al,50505
partners,,70707
;;;;

The DATA to DATA Step Macro
Blog: SASnrd

Patrick · Posted 12-30-2020 08:21 PM

30,000 rows are not that much. What happens if you just merge by facility_state and facility_zip? How many rows where this key combination isn't unique do you get?

If there aren't too many "duplicates" then some "manual" coding for the remaining cases might be the quickest way to get you what you want.

ballardw · Posted 12-29-2020 11:41 AM

I might suggest a visit to https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

There is CDC developed tool for matching data probabilisticly based on common fields. The software is free to download and use. Basic input is text files and you point the software to "match" columns. The output will give a match probability along with the compared values. You can set a threshold for that probability.

Not perfect but the price is right and is pretty robust plus gets around you having to try to setup hundreds of rules or attempted data "fixes".

gamotte · Posted 12-30-2020 11:21 AM

Hello,

I would proceed this way :

1/ Standardize names in both datasets (case, suppress special characters, ...)

2/ merge rows of the datasets if one standardized name contains the other

3/ Flag unmatched rows and matches for which names differ and examine flagged rows to see if some additional criterion

could improve the results

4/ Repeat from 1 with an enriched standardization marco until the remaining flagged rows can be handled manually.

Of course, if the differences between names in both datasets are important this can be a huge task.

%macro stdz(name);
    prxchange("s/[^\w]//",-1, tranwrd(upcase(&name.),"SAINT","ST"))
%mend;

data main;
    id=_N_;
    set sample_main;
run;

proc sql;
    CREATE TABLE matches AS
    SELECT main.id, match.*
    FROM samples_matches match
    LEFT JOIN main
    ON (find(%stdz(main.facility_name), %stdz(match.facility_name))
     OR find(%stdz(match.facility_name), %stdz(main.facility_name)))
    AND main.facility_state=match.facility_state
    AND main.facility_zip=match.facility_zip 
    ORDER BY main.id
    ;
quit;

data want;
    merge main (in=inmain) matches (rename=(facility_name=match_name) in=inmatch);
    by id;

    FLAG_MAIN_ONLY=not inmatch;
    FLAG_MATCH_ONLY=not inmain;
    FLAG_FUZZY_MATCH=inmain and inmatch and facility_name ne match_name;
run;

Matching/Merging Observations with Similar but Different Values