<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cluster similar strings into group in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Cluster-similar-strings-into-group/m-p/738831#M230522</link>
    <description>Or try Cosine similarity :&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://blogs.sas.com/content/iml/2019/09/05/cosine-similarity-recommendations.html" target="_blank"&gt;https://blogs.sas.com/content/iml/2019/09/05/cosine-similarity-recommendations.html&lt;/A&gt;&lt;BR /&gt;&lt;A href="https://blogs.sas.com/content/iml/2019/09/03/cosine-similarity.html" target="_blank"&gt;https://blogs.sas.com/content/iml/2019/09/03/cosine-similarity.html&lt;/A&gt;</description>
    <pubDate>Tue, 04 May 2021 10:15:46 GMT</pubDate>
    <dc:creator>Ksharp</dc:creator>
    <dc:date>2021-05-04T10:15:46Z</dc:date>
    <item>
      <title>Cluster similar strings into group</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Cluster-similar-strings-into-group/m-p/738752#M230490</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am working on something like the following.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;For a dataset, the original dataset looks like this:&lt;/P&gt;
&lt;TABLE width="444"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="191"&gt;Group&lt;/TD&gt;
&lt;TD width="253"&gt;Name&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Cox &amp;amp; Wilson, PC, CPA&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Cox &amp;amp; Wilson, PC, CPAs&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Cox &amp;amp; Wilson, PC, CPAs&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Jonathan D. Liner&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Jonathan Liner&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;B&lt;/TD&gt;
&lt;TD&gt;Memphis Light, Gas &amp;amp; Water&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;B&lt;/TD&gt;
&lt;TD&gt;Memphis Light, Gas &amp;amp; Water&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;B&lt;/TD&gt;
&lt;TD&gt;Memphis, Light, Gas and Water&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;C&lt;/TD&gt;
&lt;TD&gt;Homer Electric Assn&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;C&lt;/TD&gt;
&lt;TD&gt;Homer Electric Association&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;C&lt;/TD&gt;
&lt;TD&gt;Homer Electric Association&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;What I want to do is to categorize similar names within each group.&amp;nbsp; The dataset I want looks like this:&lt;/P&gt;
&lt;TABLE width="632"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="191"&gt;Group&lt;/TD&gt;
&lt;TD width="253"&gt;Name&lt;/TD&gt;
&lt;TD width="188"&gt;NameGroup&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Cox &amp;amp; Wilson, PC, CPA&lt;/TD&gt;
&lt;TD&gt;1&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Cox &amp;amp; Wilson, PC, CPAs&lt;/TD&gt;
&lt;TD&gt;1&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Cox &amp;amp; Wilson, PC, CPAs&lt;/TD&gt;
&lt;TD&gt;1&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Jonathan D. Liner&lt;/TD&gt;
&lt;TD&gt;2&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;A&lt;/TD&gt;
&lt;TD&gt;Jonathan Liner&lt;/TD&gt;
&lt;TD&gt;2&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;B&lt;/TD&gt;
&lt;TD&gt;Memphis Light, Gas &amp;amp; Water&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;B&lt;/TD&gt;
&lt;TD&gt;Memphis Light, Gas &amp;amp; Water&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;B&lt;/TD&gt;
&lt;TD&gt;Memphis, Light, Gas and Water&lt;/TD&gt;
&lt;TD&gt;3&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;C&lt;/TD&gt;
&lt;TD&gt;Homer Electric Assn&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;C&lt;/TD&gt;
&lt;TD&gt;Homer Electric Association&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;C&lt;/TD&gt;
&lt;TD&gt;Homer Electric Association&lt;/TD&gt;
&lt;TD&gt;4&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My initial thought is to compute string distance (&lt;SPAN&gt;Levenshtein&lt;/SPAN&gt;) between each pair of two words and then use some cluster methods.&amp;nbsp; Then I realized there might be some difficulties:&lt;/P&gt;
&lt;P&gt;1. First, I have a lot of observations.&amp;nbsp; Computing each pair even within the group can be time-consuming.&lt;/P&gt;
&lt;P&gt;2. The cluster method I am familiar with is K-means.&amp;nbsp; However, K-means requires prespecifying number of groups.&amp;nbsp; I am wondering if there are any cluster methods that work better for this problem.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am wondering if there are any more convenient ways to do that.&lt;/P&gt;
&lt;P&gt;That would be great if someone can help out here.&lt;/P&gt;</description>
      <pubDate>Tue, 04 May 2021 00:34:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Cluster-similar-strings-into-group/m-p/738752#M230490</guid>
      <dc:creator>daradanye</dc:creator>
      <dc:date>2021-05-04T00:34:38Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster similar strings into group</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Cluster-similar-strings-into-group/m-p/738791#M230507</link>
      <description>&lt;P&gt;If you do have a lot of obs anything will be time-consuming.&lt;/P&gt;
&lt;P&gt;The idea of comparing pairs is interesting, but i would start by cleaning the strings and use complev() on the name variable.&lt;/P&gt;
&lt;P&gt;Cleaning = use compress() + lowcase() + compbl()&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here is an idea:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
   length Group $ 1 Name $ 100;
   input Group Name &amp;amp;;

   datalines;
A  Cox &amp;amp; Wilson, PC, CPA
A  Cox &amp;amp; Wilson, PC, CPAs
A  Cox &amp;amp; Wilson, PC, CPAs
A  Jonathan D. Liner
A  Jonathan Liner
B  Memphis Light, Gas &amp;amp; Water
B  Memphis Light, Gas &amp;amp; Water
B  Memphis, Light, Gas and Water
C  Homer Electric Assn
C  Homer Electric Association
C  Homer Electric Association
;

data cleaned;
   set have;
   
   length NameCompare $ 100;
   
   NameCompare = lowcase(compbl(compress(Name,, 'p')));
run;


data want;
   set cleaned;
   by Group;
   
   length 
      NameGroup 8
      LastName $ 100
      Distance 8
   ;
   
   retain NameGroup 0;
      
   LastName = lag(NameCompare);
   
   if first.Group then do;
      NameGroup = NameGroup + 1;
      LastName = ' ';      
   end;
   else do;
      Distance = complev(NameCompare, LastName);
      
      /* finding the right threshold value will be interesting */
      if Distance &amp;gt; 10 then do;
         NameGroup = NameGroup + 1;
      end;
   end;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 04 May 2021 06:49:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Cluster-similar-strings-into-group/m-p/738791#M230507</guid>
      <dc:creator>andreas_lds</dc:creator>
      <dc:date>2021-05-04T06:49:13Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster similar strings into group</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Cluster-similar-strings-into-group/m-p/738831#M230522</link>
      <description>Or try Cosine similarity :&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://blogs.sas.com/content/iml/2019/09/05/cosine-similarity-recommendations.html" target="_blank"&gt;https://blogs.sas.com/content/iml/2019/09/05/cosine-similarity-recommendations.html&lt;/A&gt;&lt;BR /&gt;&lt;A href="https://blogs.sas.com/content/iml/2019/09/03/cosine-similarity.html" target="_blank"&gt;https://blogs.sas.com/content/iml/2019/09/03/cosine-similarity.html&lt;/A&gt;</description>
      <pubDate>Tue, 04 May 2021 10:15:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Cluster-similar-strings-into-group/m-p/738831#M230522</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2021-05-04T10:15:46Z</dc:date>
    </item>
  </channel>
</rss>

