<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Find possible matches between two lists: SOUNDEX or something else? in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-matches-between-two-lists-SOUNDEX-or-something/m-p/887475#M350623</link>
    <description>Thanks, you've given me some good ideas.</description>
    <pubDate>Wed, 02 Aug 2023 12:05:20 GMT</pubDate>
    <dc:creator>Doug_in_STL</dc:creator>
    <dc:date>2023-08-02T12:05:20Z</dc:date>
    <item>
      <title>Find possible matches between two lists: SOUNDEX or something else?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-matches-between-two-lists-SOUNDEX-or-something/m-p/887347#M350583</link>
      <description>&lt;P&gt;SAS 9.4(TS1M8)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've got lists of clinical laboratory tests from two different sources in datasets ListA and ListB. I need to evaluate each entry in ListA and find probable/possible/likely matches from ListB.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here's a very simple example, one obs in each dataset.&lt;/P&gt;&lt;P&gt;dsn=ListA&lt;/P&gt;&lt;P&gt;varname=TestNameListA&lt;/P&gt;&lt;P&gt;COMPLETE CBC W/AUTO DIFF WBC&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;dsn=ListB&lt;/P&gt;&lt;P&gt;varname=TestNameListB&lt;/P&gt;&lt;P&gt;CBC with Differential&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Visually and intuitively, I know these are both the same. How can SAS tell me these are the same or have a high probability of being the same?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've tried SOUNDEX, but I'm sure what to do with the results.&lt;/P&gt;&lt;P&gt;word=soundex('COMPLETE CBC W/AUTO DIFF WBC');&lt;BR /&gt;put 'SoundEx1=' word;&lt;BR /&gt;word=soundex('CBC with Differential');&lt;BR /&gt;put 'SoundEx2=' word;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;results:&lt;/P&gt;&lt;P&gt;SoundEx1=C5143212333112&lt;BR /&gt;SoundEx2=C123316534&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I can see the substring "1233" is in common between the two, but does this tell me anything?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there a better function to use?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 01 Aug 2023 17:06:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Find-possible-matches-between-two-lists-SOUNDEX-or-something/m-p/887347#M350583</guid>
      <dc:creator>Doug_in_STL</dc:creator>
      <dc:date>2023-08-01T17:06:12Z</dc:date>
    </item>
    <item>
      <title>Re: Find possible matches between two lists: SOUNDEX or something else?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-matches-between-two-lists-SOUNDEX-or-something/m-p/887378#M350585</link>
      <description>&lt;P&gt;Any approach attempting to match strings of considerably different length is going to have some difficulty just because of the encoding.&lt;/P&gt;
&lt;P&gt;Soundex for your problem may have another weekness. From the documention:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;DIV class="xis-refDictEntry"&gt;
&lt;DIV class="xis-details"&gt;
&lt;DIV id="p0my7cb1fvoekun1uu50h5bqkiqk" class="xis-subTopic"&gt;
&lt;DIV id="p0uexrf5wtsq7ln1rycds3vpxa82" class="xis-topicContent"&gt;
&lt;DIV id="p0hxndqruheb7xn1t2nox2lyc11a" class="xis-paragraph"&gt;Note that the &lt;FONT style="background-color: #fcdec0;"&gt;SOUNDEX&lt;/FONT&gt; algorithm is English-biased and is less useful for languages other than English.&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;While the language may be English abbreviations and likely not-common words appearing in clinical data may be treated by the algorithm as "not exactly English"&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I might look to COMPGED and COMPLEV which basically return a numeric value related to spelling changes that is larger the more changes are necessary to get to the other string. There options to allow ignoring case in comparisons.&lt;/P&gt;
&lt;P&gt;Smaller values are more likely to "match". So you can sort by the score returned and get a feel for how well any of this is doing.&lt;/P&gt;
&lt;P&gt;Maybe:&lt;/P&gt;
&lt;PRE&gt;proc sql;
   create table score as
   select a.TestNameListA, b.TestNameListB
         ,compged( a.TestNameListA, b.TestNameListB) as score
   from (select distinct TestNameListA from ListA)
       ,(select distinct TestNameListB from ListB)
   order by score
   ;
quit;&lt;/PRE&gt;
&lt;P&gt;Look in the options for Compged to see which may be applicable. Case insensitive may be likely with lots of acronyms.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You might be better off looking for Key values first though. You may have ideas what to look for.&amp;nbsp; Other variables may also give you more information to use on even making the comparisons. I have a project where I have see if children from two sets are "matches". I have last name (usually), first name, date of birth and gender. So first I look for complete matches on all 4 variables. Then match lastname, date of birth and gender and compare the spelling of First name.&amp;nbsp; That turns up differences like "Marie" and "Maria" or "Mark" and "Marc" as likely&amp;nbsp; matches (low compged scores) but "Xavier" and "Robert" not so much.&lt;/P&gt;
&lt;P&gt;Then after identifying those as matches go on to look at Firstname Lastname combinations (James Smyth vs Jim Smith) for example with matching dates of birth and gender.&lt;/P&gt;
&lt;P&gt;Then look at matching dates of birth alone and name.&lt;/P&gt;
&lt;P&gt;Then matching gender only.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But this only involves a couple 100 records. This is labor intensive.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Aug 2023 19:38:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Find-possible-matches-between-two-lists-SOUNDEX-or-something/m-p/887378#M350585</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2023-08-01T19:38:21Z</dc:date>
    </item>
    <item>
      <title>Re: Find possible matches between two lists: SOUNDEX or something else?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-matches-between-two-lists-SOUNDEX-or-something/m-p/887475#M350623</link>
      <description>Thanks, you've given me some good ideas.</description>
      <pubDate>Wed, 02 Aug 2023 12:05:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Find-possible-matches-between-two-lists-SOUNDEX-or-something/m-p/887475#M350623</guid>
      <dc:creator>Doug_in_STL</dc:creator>
      <dc:date>2023-08-02T12:05:20Z</dc:date>
    </item>
  </channel>
</rss>

