<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Fuzzy match 2 datasets in SAS Data Management</title>
    <link>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466274#M14547</link>
    <description>&lt;P&gt;You're doing 1.7 billion combinations of records, given that this is disk-intensive I would absolutely expect it to take quite a while. I wouldn't say it's "too many" records, but it is a very intensive process. I suggest when you do something like this that you test with a few dozen records to get the logic right.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Tom&lt;/P&gt;</description>
    <pubDate>Wed, 30 May 2018 22:24:17 GMT</pubDate>
    <dc:creator>TomKari</dc:creator>
    <dc:date>2018-05-30T22:24:17Z</dc:date>
    <item>
      <title>Fuzzy match 2 datasets</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466269#M14546</link>
      <description>&lt;P&gt;I have the following script for fuzzy match. Dataset groupA has about 22,000 records and groupB&amp;nbsp;&lt;SPAN&gt;has about 77,000 records.&amp;nbsp;The&lt;/SPAN&gt;&amp;nbsp;program is&amp;nbsp;still running after 30 minutes, so i wonder if that's because they have too many records or if there is something wrong with my script.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;data group_AandB; &lt;BR /&gt; set groupA; &lt;BR /&gt; tmp_carf_name=soundex(Company_Name); &lt;BR /&gt; tmp_carf_address=soundex(Address_1);&lt;BR /&gt; tmp_carf_city=soundex(City);&lt;BR /&gt; tmp_carf_state=soundex(State);&lt;BR /&gt; do i=1 to nobs; &lt;BR /&gt; set groupB(rename=(ADDRESS_1=ADDRESS CITY=CITY1)) point=i nobs=nobs; &lt;BR /&gt; tmp_pdr_name=soundex(ORG_NAME_ACTUAL);&lt;BR /&gt; tmp_pdr_address=soundex(ADDRESS); &lt;BR /&gt; tmp_pdr_city=soundex(CITY1); &lt;BR /&gt; tmp_pdr_state=soundex(STATE_CODE);&lt;/P&gt;
&lt;P&gt;dif1=compged(tmp_carf_name, tmp_pdr_name);&lt;BR /&gt; dif2=compged(tmp_carf_address, tmp_pdr_address);&lt;BR /&gt; dif3=compged(tmp_carf_city, tmp_pdr_city);&lt;BR /&gt; dif4=compged(tmp_carf_state, tmp_pdr_state);&lt;BR /&gt; &lt;BR /&gt; if dif1&amp;lt;=100 and dif2&amp;lt;=100 and dif3&amp;lt;=100 and dif4&amp;lt;=1 then do;&lt;BR /&gt; drop tmp_carf_name tmp_pdr_name tmp_carf_address tmp_pdr_address&lt;BR /&gt; tmp_carf_city tmp_pdr_city tmp_carf_state tmp_pdr_state&lt;BR /&gt; dif1 dif2 dif3; &lt;BR /&gt; output;&lt;BR /&gt; end;end;&lt;BR /&gt;run;&lt;/P&gt;</description>
      <pubDate>Wed, 30 May 2018 22:01:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466269#M14546</guid>
      <dc:creator>ernie86</dc:creator>
      <dc:date>2018-05-30T22:01:19Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match 2 datasets</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466274#M14547</link>
      <description>&lt;P&gt;You're doing 1.7 billion combinations of records, given that this is disk-intensive I would absolutely expect it to take quite a while. I wouldn't say it's "too many" records, but it is a very intensive process. I suggest when you do something like this that you test with a few dozen records to get the logic right.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Tom&lt;/P&gt;</description>
      <pubDate>Wed, 30 May 2018 22:24:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466274#M14547</guid>
      <dc:creator>TomKari</dc:creator>
      <dc:date>2018-05-30T22:24:17Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match 2 datasets</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466297#M14549</link>
      <description>&lt;P&gt;If your objective is to find some sort of "closest match", I would suggest that you do not output every combination.&amp;nbsp; Remember, if you output 1.7B observations, your next step has to process all of them.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Instead, remove the OUTPUT statement.&amp;nbsp; In its place, calculate the "distance" and whenever the distance is closer than the previous "best" distance, replace a set of variables so that those variables hold the best match found so far.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Without an OUTPUT statement, that best match will be part of the observation when the looping ends, and will be output automatically.&lt;/P&gt;</description>
      <pubDate>Thu, 31 May 2018 01:19:08 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466297#M14549</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2018-05-31T01:19:08Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match 2 datasets</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466310#M14550</link>
      <description>&lt;UL&gt;
&lt;LI&gt;soundex in groupB should be precalculated in a previous step instead of recalculated for every obs&amp;nbsp;from groupA.Give them lengths identical to the original strings.&lt;/LI&gt;
&lt;LI&gt;compged should&amp;nbsp;specify a cutoff argument.&lt;/LI&gt;
&lt;LI&gt;In your distance test, you could calculate the second compged (dif2)&amp;nbsp;only if the first one (dif1)&amp;nbsp;is less than 100, and so on for the other distances.&lt;/LI&gt;
&lt;LI&gt;Another approach that could save a lot of comparisons would compare only cases where the first letters of all 4 fields match.&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 31 May 2018 05:01:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466310#M14550</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2018-05-31T05:01:03Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match 2 datasets</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466329#M14551</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/50296"&gt;@ernie86&lt;/a&gt;&lt;/P&gt;
&lt;P&gt;If you've got the SAS data quality server license (or any license which gives you access to DataFlux and DF... SAS data step functions) then I would first create match codes for your address data and then run all additional logic only for records with the same match codes.&lt;/P&gt;</description>
      <pubDate>Thu, 31 May 2018 06:26:41 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466329#M14551</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2018-05-31T06:26:41Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match 2 datasets</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466505#M14569</link>
      <description>&lt;P&gt;Yes, I'm trying to get the closet match. Can you tell me how to tweet my code to compare the distances and the get best distance?&lt;/P&gt;</description>
      <pubDate>Thu, 31 May 2018 15:13:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466505#M14569</guid>
      <dc:creator>ernie86</dc:creator>
      <dc:date>2018-05-31T15:13:26Z</dc:date>
    </item>
    <item>
      <title>Re: Fuzzy match 2 datasets</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466554#M14570</link>
      <description>&lt;P&gt;Not really, no.&amp;nbsp; That part is up to you, determining which is the closest.&amp;nbsp; However, I can help.&amp;nbsp; Once you have decided upon a formula to measure the distance, I can show you how to save just 22,000 observations (each with its closest match) instead of 1.7B observations.&lt;/P&gt;</description>
      <pubDate>Thu, 31 May 2018 16:56:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Fuzzy-match-2-datasets/m-p/466554#M14570</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2018-05-31T16:56:44Z</dc:date>
    </item>
  </channel>
</rss>

