<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Matching/Merging Observations with Similar but Different Values in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708604#M217792</link>
    <description>&lt;P&gt;Hi there,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Unfortunately I cannot post my data as it is sensitive information. Instead, I have created two datasets that represent the common problems I am seeing in my data (different spellings, including/excluding words). Sample_main is the dataset that I have currently. Sample_matches is the dataset I have been given that I am supposed to match to sample_main. I am matching on the three variables shown in the dataset.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 29 Dec 2020 17:46:30 GMT</pubDate>
    <dc:creator>alw12194</dc:creator>
    <dc:date>2020-12-29T17:46:30Z</dc:date>
    <item>
      <title>Matching/Merging Observations with Similar but Different Values</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708456#M217721</link>
      <description>&lt;P&gt;Hello,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am matching 2 relatively large datasets (both over 30,000 observations) based on three variables: facility name, facility state, and facility zip. The facility names are not always a perfect match between the 2 sets. For example:&lt;/P&gt;&lt;P&gt;&lt;U&gt;Set 1&lt;/U&gt;&lt;/P&gt;&lt;P&gt;Saint Joseph's&lt;/P&gt;&lt;P&gt;Care at Boardwalk and Center for Health&lt;/P&gt;&lt;P&gt;Hospital at Mercy&lt;/P&gt;&lt;P&gt;&lt;U&gt;Set 2&lt;/U&gt;&lt;/P&gt;&lt;P&gt;St Joseph's Hospital&lt;/P&gt;&lt;P&gt;Care at Boardwalk&lt;/P&gt;&lt;P&gt;Mercy&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As you can see, the names may have different spellings (Saint vs St) or include/exclude certain words (Care at Boardwalk and Center for Health vs Care at Boardwalk, Hospital at Mercy vs Mercy). There is no particular rhyme or reason for the differences in name so using a substring function to extract certain words would not be ideal. Using a simple proc sort and match-merge does match the ones with exact matches but I still have 20,000 observations that are not matching. Are there any functions that may be particularly useful for this kind of problem? I would greatly appreciate any help in getting SAS to do the matching for me.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Dec 2020 18:09:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708456#M217721</guid>
      <dc:creator>alw12194</dc:creator>
      <dc:date>2020-12-28T18:09:48Z</dc:date>
    </item>
    <item>
      <title>Re: Matching/Merging Observations with Similar but Different Values</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708529#M217762</link>
      <description>&lt;P&gt;Can you post a small sample of your two data sets to illutrate your problem? Makes it much easier to provide usable code.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 08:42:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708529#M217762</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2020-12-29T08:42:01Z</dc:date>
    </item>
    <item>
      <title>Re: Matching/Merging Observations with Similar but Different Values</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708590#M217786</link>
      <description>&lt;P&gt;I might suggest a visit to &lt;A href="https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm" target="_blank"&gt;https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;There is CDC developed tool for matching data probabilisticly based on common fields. The software is free to download and use. Basic input is text files and you point the software to "match" columns. The output will give a match probability along with the compared values. You can set a threshold for that probability.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Not perfect but the price is right and is pretty robust plus gets around you having to try to setup hundreds of rules or attempted data "fixes".&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 16:41:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708590#M217786</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2020-12-29T16:41:09Z</dc:date>
    </item>
    <item>
      <title>Re: Matching/Merging Observations with Similar but Different Values</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708604#M217792</link>
      <description>&lt;P&gt;Hi there,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Unfortunately I cannot post my data as it is sensitive information. Instead, I have created two datasets that represent the common problems I am seeing in my data (different spellings, including/excluding words). Sample_main is the dataset that I have currently. Sample_matches is the dataset I have been given that I am supposed to match to sample_main. I am matching on the three variables shown in the dataset.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 17:46:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708604#M217792</guid>
      <dc:creator>alw12194</dc:creator>
      <dc:date>2020-12-29T17:46:30Z</dc:date>
    </item>
    <item>
      <title>Re: Matching/Merging Observations with Similar but Different Values</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708719#M217841</link>
      <description>&lt;P&gt;Ok. So your data looks something like this. Please explain what you want to do from here?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data work.sample_main;
  infile datalines dsd truncover;
  input facility_name:$42. facility_state:$2. facility_zip:32.;
  label facility_name="facility_name" facility_state="facility_state" facility_zip="facility_zip";
datalines4;
saint joseph,wa,90909
mercy,ga,80808
aftercare center at boardwalk and hospital,wv,30303
north hospital - filament,al,50505
partners at work/70707,,
;;;;


data work.samples_matches;
  infile datalines dsd truncover;
  input facility_name:$29. facility_state:$2. facility_zip:32.;
  label facility_name="facility_name" facility_state="facility_state" facility_zip="facility_zip";
datalines4;
st. joseph's hospital,wa,90909
mercy center,ga,80808
aftercare center at boardwalk,wv,30303
north hospital,al,50505
partners,,70707
;;;;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 30 Dec 2020 07:42:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708719#M217841</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2020-12-30T07:42:06Z</dc:date>
    </item>
    <item>
      <title>Re: Matching/Merging Observations with Similar but Different Values</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708783#M217866</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I would proceed this way :&lt;/P&gt;
&lt;P&gt;1/ Standardize names in both datasets (case, suppress special characters, ...)&lt;/P&gt;
&lt;P&gt;2/ merge rows of the datasets if one standardized name contains the other&lt;/P&gt;
&lt;P&gt;3/ Flag unmatched rows and matches for which names differ and examine flagged rows to see if some additional criterion&lt;/P&gt;
&lt;P&gt;could improve the results&lt;/P&gt;
&lt;P&gt;4/ Repeat from 1 with an enriched standardization marco until the remaining flagged rows can be handled manually.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Of course, if the differences between names in both datasets are important this can be a huge task.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;%macro stdz(name);
    prxchange("s/[^\w]//",-1, tranwrd(upcase(&amp;amp;name.),"SAINT","ST"))
%mend;

data main;
    id=_N_;
    set sample_main;
run;

proc sql;
    CREATE TABLE matches AS
    SELECT main.id, match.*
    FROM samples_matches match
    LEFT JOIN main
    ON (find(%stdz(main.facility_name), %stdz(match.facility_name))
     OR find(%stdz(match.facility_name), %stdz(main.facility_name)))
    AND main.facility_state=match.facility_state
    AND main.facility_zip=match.facility_zip 
    ORDER BY main.id
    ;
quit;

data want;
    merge main (in=inmain) matches (rename=(facility_name=match_name) in=inmatch);
    by id;

    FLAG_MAIN_ONLY=not inmatch;
    FLAG_MATCH_ONLY=not inmain;
    FLAG_FUZZY_MATCH=inmain and inmatch and facility_name ne match_name;
run;

&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 30 Dec 2020 16:21:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708783#M217866</guid>
      <dc:creator>gamotte</dc:creator>
      <dc:date>2020-12-30T16:21:17Z</dc:date>
    </item>
    <item>
      <title>Re: Matching/Merging Observations with Similar but Different Values</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708843#M217874</link>
      <description>&lt;P&gt;30,000 rows are not that much. What happens if you just merge by&amp;nbsp;facility_state and facility_zip? How many rows where this key combination isn't unique do you get?&lt;/P&gt;
&lt;P&gt;If there aren't too many "duplicates" then some "manual" coding for the remaining cases might be the quickest way to get you what you want.&lt;/P&gt;</description>
      <pubDate>Thu, 31 Dec 2020 01:21:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Matching-Merging-Observations-with-Similar-but-Different-Values/m-p/708843#M217874</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2020-12-31T01:21:01Z</dc:date>
    </item>
  </channel>
</rss>

