<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Match datasets based on the likelihood of strings in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663815#M198236</link>
    <description>I am unsure what's unclear in my reply.  Sorry.&lt;BR /&gt;&lt;BR /&gt;&amp;gt; Do you know how to match all obs in B to each obs in A?&lt;BR /&gt;Unsure what this means either. &lt;BR /&gt;</description>
    <pubDate>Sun, 21 Jun 2020 08:44:15 GMT</pubDate>
    <dc:creator>ChrisNZ</dc:creator>
    <dc:date>2020-06-21T08:44:15Z</dc:date>
    <item>
      <title>Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663778#M198217</link>
      <description>&lt;P&gt;I have 2 datasets A and B containing company names. B contains correct names whereas A contains slightly wrong names. How can I ask SAS to find match obs in A with obs in B that are similar? an Exambple would be:&lt;/P&gt;
&lt;P&gt;- match "AGL ENRGY LTD" in A to&amp;nbsp; "AGL ENERGY LTD" in B; or&lt;/P&gt;
&lt;P&gt;- match "AMER CAP" in A to "AMERICAN CAPITAL" in B; or&lt;/P&gt;
&lt;P&gt;- match "EMPIRE CO' in A to "EMPIRE COMPANY" in B&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have been manually finding abbreviations and change them to full such as CO to COMPANY, or CORP to CORPORATION but there are still obs with missing letters in name. One way I can think of is to match all obs in B to each obs in A, and then use COMPGED or COMPLEV to get a similarity score and use the one with highest score. However, this would create a very large dataset. And how do I match all obs in B to each obs in A?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jun 2020 01:04:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663778#M198217</guid>
      <dc:creator>somebody</dc:creator>
      <dc:date>2020-06-21T01:04:39Z</dc:date>
    </item>
    <item>
      <title>Re: Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663783#M198220</link>
      <description>&lt;P&gt;Functions such as COMPGED perfectly answer your needs, but they are expensive.&lt;/P&gt;
&lt;P&gt;You are right to clean your data before using them: LTD/LIMITED, etc.&lt;/P&gt;
&lt;P&gt;This is an iterative process.&lt;/P&gt;
&lt;P&gt;First try to &lt;U&gt;also&lt;/U&gt; match on something else. For example&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; where first(NAME1)=first(NAME2) and compged(NAME1,NAME2) &amp;lt; &lt;EM&gt;some small value&lt;/EM&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;As you match more and more, you can loosen the criteria on the reduced volume of unmatched names.&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jun 2020 01:50:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663783#M198220</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-06-21T01:50:35Z</dc:date>
    </item>
    <item>
      <title>Re: Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663785#M198222</link>
      <description>&lt;P&gt;I have been matching using the first 3 words in the names, and then 2 and then 1. But if there are some errors in the first word then they don't match.&lt;/P&gt;
&lt;P&gt;Do you know how to match all obs in B to each obs in A?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jun 2020 01:58:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663785#M198222</guid>
      <dc:creator>somebody</dc:creator>
      <dc:date>2020-06-21T01:58:51Z</dc:date>
    </item>
    <item>
      <title>Re: Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663787#M198224</link>
      <description>You can use compged like chris suggested or there are whole other suite like spedis etc. you can create a separate scoring set to do the mapping.</description>
      <pubDate>Sun, 21 Jun 2020 02:17:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663787#M198224</guid>
      <dc:creator>smantha</dc:creator>
      <dc:date>2020-06-21T02:17:53Z</dc:date>
    </item>
    <item>
      <title>Re: Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663814#M198235</link>
      <description>&lt;P&gt;On top of what others suggested: Do you have the &lt;A href="https://go.documentation.sas.com/api/docsets/dqclref/9.4_3.4/content/dqclref.pdf" target="_self"&gt;SAS Data Quality Server&lt;/A&gt; licensed? If so then this would allow you to standardize company names and then join over the standardized names.&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jun 2020 08:47:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663814#M198235</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2020-06-21T08:47:49Z</dc:date>
    </item>
    <item>
      <title>Re: Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663815#M198236</link>
      <description>I am unsure what's unclear in my reply.  Sorry.&lt;BR /&gt;&lt;BR /&gt;&amp;gt; Do you know how to match all obs in B to each obs in A?&lt;BR /&gt;Unsure what this means either. &lt;BR /&gt;</description>
      <pubDate>Sun, 21 Jun 2020 08:44:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663815#M198236</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-06-21T08:44:15Z</dc:date>
    </item>
    <item>
      <title>Re: Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663895#M198283</link>
      <description>&lt;P&gt;Do you know how to check is my SAS has the licence?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2020 01:08:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663895#M198283</guid>
      <dc:creator>somebody</dc:creator>
      <dc:date>2020-06-22T01:08:32Z</dc:date>
    </item>
    <item>
      <title>Re: Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663896#M198284</link>
      <description>&lt;P&gt;I would like to create a new dataset that has all observations in B for every observation in A. For exambple, if A has 5 observation and B has 10 obs, then the new merged dataset would have 50 observations. How can I perform this merge?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2020 01:10:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663896#M198284</guid>
      <dc:creator>somebody</dc:creator>
      <dc:date>2020-06-22T01:10:12Z</dc:date>
    </item>
    <item>
      <title>Re: Match datasets based on the likelihood of strings</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663918#M198298</link>
      <description>&lt;P&gt;&lt;SPAN&gt;&amp;gt;&amp;nbsp; if A has 5 observation and B has 10 obs, then the new merged dataset would have 50 observations&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Use:&amp;nbsp;&amp;nbsp;&lt;FONT face="courier new,courier"&gt; from TABLE1, TABLE2&amp;nbsp;&lt;/FONT&gt;&amp;nbsp; without a&amp;nbsp; &amp;nbsp;&lt;FONT face="courier new,courier"&gt;where&amp;nbsp;&lt;/FONT&gt; clause to create such a join.&lt;/P&gt;
&lt;P&gt;That's a called a Cartesian join. Why would you do that?&lt;/P&gt;
&lt;P&gt;Your current method is correct:&lt;/P&gt;
&lt;P&gt;1. Standardise the data&lt;/P&gt;
&lt;P&gt;2. Join on increasingly looser criteria. Only try to match the unmatched data.&lt;BR /&gt;&amp;nbsp; - Straight equality&lt;/P&gt;
&lt;P&gt;&amp;nbsp; - Almost equal (this can be many steps)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; - Not quite the same (this can be many steps)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; - Quite different (this can be many steps)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; Keep track of what criterion was used when you achieve a match.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2020 06:20:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Match-datasets-based-on-the-likelihood-of-strings/m-p/663918#M198298</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-06-22T06:20:50Z</dc:date>
    </item>
  </channel>
</rss>

