<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Search Project List for Similar Names (Fuzzy match and Search) in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799465#M314373</link>
    <description>&lt;P&gt;I might suggest describing "similar" in some more detail. Since your example happens to list 4 types of cookies, I might say that they are all similar.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Without a more concrete example, I might look to see if there common contractions or spelling differences and try standardizing those.&lt;/P&gt;
&lt;P&gt;I used to deal with some of our local department of education data and had to match schools from year to year. Amazingly even when the name didn't change the spelling in the records available would. "HS" "High School" "High" "Sr High" just for a short list. Making a new new name to replace all of those with standard spelling such as "HS" improved the match rates.&lt;/P&gt;
&lt;P&gt;Order of changes can matter since some of the data would have "Junior High" so replacing "High" alone would be wrong.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With standard spellings in both you might improve the "exact" match rate as &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13879"&gt;@Reeza&lt;/a&gt; suggested and then have much smaller fuzzy match pool to play with.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You might also look into COMPLEV as it supposedly executes quicker than COMPGED&lt;/P&gt;</description>
    <pubDate>Tue, 01 Mar 2022 23:16:00 GMT</pubDate>
    <dc:creator>ballardw</dc:creator>
    <dc:date>2022-03-01T23:16:00Z</dc:date>
    <item>
      <title>Search Project List for Similar Names (Fuzzy match and Search)</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799447#M314359</link>
      <description>&lt;P&gt;Hello - I've checked the forum for an answer to my question but no luck so far.&amp;nbsp; Thank you all in advance for any help you can provide and my apologies if a post exists already that I missed.&amp;nbsp; I have a list of project names, and I want to search for&amp;nbsp;&lt;EM&gt;similar&lt;/EM&gt; project names in the same dataset.&amp;nbsp; My current method involves using a macro and a CALL EXECUTE with the list of projects, and a proc append to keep only those records with close matches.&amp;nbsp; I've put simplified code below and have a simplified HAVE table.&amp;nbsp; I have BASE and STAT SAS.&amp;nbsp; While my current method does work, it takes a long time to execute (100k unique projects being searched).&amp;nbsp; Does anyone have any recommendations to create more efficient code?&amp;nbsp; Current method takes HOURS to run.&amp;nbsp; From the searches I've done, HASH tables might be an option, or some kind of indexing, but I'm not sure how to create that properly.&amp;nbsp; Thank you for your help!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;Simplified Current Code:&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;%MACRO MNAME (PROEJCT);&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;DATA TEST;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;SET FULL_LIST;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;FUZZY = COMPGED("&amp;amp;project.",PROJ_NAME);&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;IF FUZZY &amp;lt; 1000;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;RUN;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;%MEND;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;DATA _NULL_;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;SET full_list;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;CALL EXECUTE ('%MNAME ('||TRIM(project_name||');');&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT face="arial,helvetica,sans-serif"&gt;&lt;STRONG&gt;&lt;FONT color="#3366FF"&gt;RUN;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;HAVE:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;data have;&lt;BR /&gt;infile datalines dlm ='09'x;&lt;BR /&gt;input company $ proj_name $25.;&lt;BR /&gt;datalines;&lt;BR /&gt;A CHOC CHIP RECIPE&lt;BR /&gt;B CHOCOLATE CHIP RECIPE&lt;BR /&gt;C SUGAR&lt;BR /&gt;D OATMEAL RAISIN&lt;BR /&gt;;&lt;BR /&gt;RUN;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;WANT:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Lines 1 / 2 are similar&lt;/P&gt;
&lt;P&gt;C - no project matches&lt;/P&gt;
&lt;P&gt;D - no project matches&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Mar 2022 20:31:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799447#M314359</guid>
      <dc:creator>adornodj</dc:creator>
      <dc:date>2022-03-01T20:31:33Z</dc:date>
    </item>
    <item>
      <title>Re: Search Project List for Similar Names (Fuzzy match and Search)</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799451#M314362</link>
      <description>When you do fuzzy matching you compare every observation against every other observation. &lt;BR /&gt;That's 100,000*100,000 comparisons or 1*10^10 or about 10 billion if my math is right (it may not be). Point being - fuzzy matching is time intensive. &lt;BR /&gt;&lt;BR /&gt;Usually the recommendation is to first do an exact match and remove those records to simplify the analysis. &lt;BR /&gt;Then go into fuzzy matching. &lt;BR /&gt;&lt;BR /&gt;FYI - do you have SAS DQ studio? It does a good job at this type of problems.</description>
      <pubDate>Tue, 01 Mar 2022 20:48:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799451#M314362</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2022-03-01T20:48:54Z</dc:date>
    </item>
    <item>
      <title>Re: Search Project List for Similar Names (Fuzzy match and Search)</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799452#M314363</link>
      <description>&lt;P&gt;Unfortunately I do not have DQ Studio.&amp;nbsp; Yes, this would be an insane amount of comparisons.&amp;nbsp; Before posting I added a bit of code that would remove records I've already searched from the population, but that removes 1 record from my search list at a time.&amp;nbsp; Not the most efficient either.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Mar 2022 21:00:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799452#M314363</guid>
      <dc:creator>adornodj</dc:creator>
      <dc:date>2022-03-01T21:00:58Z</dc:date>
    </item>
    <item>
      <title>Re: Search Project List for Similar Names (Fuzzy match and Search)</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799459#M314368</link>
      <description>&lt;P&gt;Would the SQL version run any faster?:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
input company $ proj_name $25.;
datalines;
A CHOC_CHIP_RECIPE
B CHOCOLATE_CHIP_RECIPE
C SUGAR
D OATMEAL_RAISIN
;
RUN;

proc sql;
	select a.proj_name as list1, b.proj_name as list2, compged(a.proj_name, b.proj_name) as fuzzy_match
	from have a, have b
	where a.proj_name &amp;lt;&amp;gt; b.proj_name and compged(a.proj_name, b.proj_name) &amp;lt;600;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Output is&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;TABLE width="463"&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD width="186"&gt;list1&lt;/TD&gt;
&lt;TD width="186"&gt;list2&lt;/TD&gt;
&lt;TD width="91"&gt;fuzzy_match&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;CHOC_CHIP_RECIPE&lt;/TD&gt;
&lt;TD&gt;CHOCOLATE_CHIP_RECIPE&lt;/TD&gt;
&lt;TD&gt;500&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;CHOCOLATE_CHIP_RECIPE&lt;/TD&gt;
&lt;TD&gt;CHOC_CHIP_RECIPE&lt;/TD&gt;
&lt;TD&gt;500&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;</description>
      <pubDate>Tue, 01 Mar 2022 21:54:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799459#M314368</guid>
      <dc:creator>HB</dc:creator>
      <dc:date>2022-03-01T21:54:17Z</dc:date>
    </item>
    <item>
      <title>Re: Search Project List for Similar Names (Fuzzy match and Search)</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799460#M314369</link>
      <description>SQL will likely run out of memory with 100,000 rows and a cross join.</description>
      <pubDate>Tue, 01 Mar 2022 21:55:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799460#M314369</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2022-03-01T21:55:12Z</dc:date>
    </item>
    <item>
      <title>Re: Search Project List for Similar Names (Fuzzy match and Search)</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799463#M314371</link>
      <description>&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13879"&gt;@Reeza&lt;/a&gt;  Probably.</description>
      <pubDate>Tue, 01 Mar 2022 22:35:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799463#M314371</guid>
      <dc:creator>HB</dc:creator>
      <dc:date>2022-03-01T22:35:20Z</dc:date>
    </item>
    <item>
      <title>Re: Search Project List for Similar Names (Fuzzy match and Search)</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799465#M314373</link>
      <description>&lt;P&gt;I might suggest describing "similar" in some more detail. Since your example happens to list 4 types of cookies, I might say that they are all similar.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Without a more concrete example, I might look to see if there common contractions or spelling differences and try standardizing those.&lt;/P&gt;
&lt;P&gt;I used to deal with some of our local department of education data and had to match schools from year to year. Amazingly even when the name didn't change the spelling in the records available would. "HS" "High School" "High" "Sr High" just for a short list. Making a new new name to replace all of those with standard spelling such as "HS" improved the match rates.&lt;/P&gt;
&lt;P&gt;Order of changes can matter since some of the data would have "Junior High" so replacing "High" alone would be wrong.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With standard spellings in both you might improve the "exact" match rate as &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13879"&gt;@Reeza&lt;/a&gt; suggested and then have much smaller fuzzy match pool to play with.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You might also look into COMPLEV as it supposedly executes quicker than COMPGED&lt;/P&gt;</description>
      <pubDate>Tue, 01 Mar 2022 23:16:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799465#M314373</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2022-03-01T23:16:00Z</dc:date>
    </item>
    <item>
      <title>Re: Search Project List for Similar Names (Fuzzy match and Search)</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799481#M314379</link>
      <description>&lt;P&gt;I want to thank those who have posted so far.&amp;nbsp; It's been a helpful discussion.&amp;nbsp; I'm going to look more into breaking the population down more to reduce the size of the search and see how that can help.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Mar 2022 03:12:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Search-Project-List-for-Similar-Names-Fuzzy-match-and-Search/m-p/799481#M314379</guid>
      <dc:creator>adornodj</dc:creator>
      <dc:date>2022-03-02T03:12:09Z</dc:date>
    </item>
  </channel>
</rss>

