<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Finding non-exact matches in one dataset in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89011#M25353</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Take a look at: &lt;A href="http://www.sconsig.com/sastips/tip00000.htm" title="http://www.sconsig.com/sastips/tip00000.htm"&gt;http://www.sconsig.com/sastips/tip00000.htm&lt;/A&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Thu, 09 Aug 2012 16:54:30 GMT</pubDate>
    <dc:creator>art297</dc:creator>
    <dc:date>2012-08-09T16:54:30Z</dc:date>
    <item>
      <title>Finding non-exact matches in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89009#M25351</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello Everyone,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am trying to find non-exact matches within the same dataset.&amp;nbsp;&amp;nbsp; For example, Susan Smith and Sue Smith, would be a match, etc.&amp;nbsp;&amp;nbsp;&amp;nbsp; I have a unique ID variable for every observation, and I want to keep that and create a list of potential matches.&amp;nbsp;&amp;nbsp; Does anyone have any ideas?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Aug 2012 16:43:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89009#M25351</guid>
      <dc:creator>Suzanne099</dc:creator>
      <dc:date>2012-08-09T16:43:54Z</dc:date>
    </item>
    <item>
      <title>Re: Finding non-exact matches in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89010#M25352</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;You can try the suggestions on this list:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A __default_attr="133940" __jive_macro_name="message" class="jive_macro jive_macro_message" href="https://communities.sas.com/"&gt;&lt;/A&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Aug 2012 16:54:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89010#M25352</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2012-08-09T16:54:04Z</dc:date>
    </item>
    <item>
      <title>Re: Finding non-exact matches in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89011#M25353</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Take a look at: &lt;A href="http://www.sconsig.com/sastips/tip00000.htm" title="http://www.sconsig.com/sastips/tip00000.htm"&gt;http://www.sconsig.com/sastips/tip00000.htm&lt;/A&gt;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 09 Aug 2012 16:54:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89011#M25353</guid>
      <dc:creator>art297</dc:creator>
      <dc:date>2012-08-09T16:54:30Z</dc:date>
    </item>
    <item>
      <title>Re: Finding non-exact matches in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89012#M25354</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If you want a Rolls-Royce solution, there are some good commercial packages out there. But you can get a long way with a little bit of SQL and some knowledge of SOUNDEX and edit distance functions (COMPLEV, COMPGED).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;SOUNDEX converts a character string to an expression that gives a rough idea of what it sounds like: vowels are omitted and similar-sounding consonants are lumped together. COMPLEV tells you how many single-character edits it takes to convert one string into another.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For instance, SOUNDEX("John") = J5. SOUNDEX("Johann") also equals J5, and SOUNDEX("Susan") = S25. Using COMPLEV to compare the SOUNDEX values tells us that "John" and "Johann" are very similar (score 0), but "John" and "Susan" are less similar (score 2).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;COMPGED is a more sophisticated version of COMPLEV that accounts for the sorts of errors that are most commonly made: e.g. "Simon"-&amp;gt;"Simmon" is more likely than "Simon"-&amp;gt;"Simkon".&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You can use these functions (and others of your choice) to generate a score for possible matches. Each pair of observations ends up with a score, and you use a cutoff to determine which ones should be considered as possible matches. Here's an example:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;%let soundweight=20;&lt;/P&gt;&lt;P&gt;%let gedisweight=0.1;&lt;/P&gt;&lt;P&gt;%let cutoff=75;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data have;&lt;/P&gt;&lt;P&gt;id=_n_;&lt;/P&gt;&lt;P&gt;input firstname $ lastname $;&lt;/P&gt;&lt;P&gt;firstname=upcase(firstname); /* COMPGED is case-sensitive */&lt;/P&gt;&lt;P&gt;lastname=upcase(lastname);&lt;/P&gt;&lt;P&gt;firstnamesound=soundex(firstname);&lt;/P&gt;&lt;P&gt;lastnamesound=soundex(lastname);&lt;/P&gt;&lt;P&gt;cards;&lt;/P&gt;&lt;P&gt;Susan Smith&lt;/P&gt;&lt;P&gt;John Smith&lt;/P&gt;&lt;P&gt;Sue Smith&lt;/P&gt;&lt;P&gt;Johann Schmidt&lt;/P&gt;&lt;P&gt;Sue Jones&lt;/P&gt;&lt;P&gt;Sam Snell&lt;/P&gt;&lt;P&gt;Joe Johnson&lt;/P&gt;&lt;P&gt;Jae Johnston&lt;/P&gt;&lt;P&gt;;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;proc sql;&lt;/P&gt;&lt;P&gt;create table want as select&lt;/P&gt;&lt;P&gt;a.id as id1, b.id as id2,&lt;/P&gt;&lt;P&gt;a.firstname as firstname1, b.firstname as firstname2,&lt;/P&gt;&lt;P&gt;a.lastname as lastname1, b.lastname as lastname2,&lt;/P&gt;&lt;P&gt;a.firstnamesound as firstnamesound1, b.firstnamesound as firstnamesound2,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;/* The next few variables don't need to be here - they are recalculated separately in the join condition below.&lt;/P&gt;&lt;P&gt;I've included them here so you can see what these intermediate functions look like before they're combined to&lt;/P&gt;&lt;P&gt;generate the overall match score. */&lt;/P&gt;&lt;P&gt;complev(a.firstnamesound,b.firstnamesound) as firstnamesoundscore,&lt;/P&gt;&lt;P&gt;complev(a.lastnamesound,b.lastnamesound) as lastnamesoundscore,&lt;/P&gt;&lt;P&gt;compged(a.firstname,b.firstname) as firstnameeditscore,&lt;/P&gt;&lt;P&gt;compged(a.lastname,b.lastname) as lastnameeditscore,&lt;/P&gt;&lt;P&gt;&amp;amp;soundweight*(calculated firstnamesoundscore + calculated lastnamesoundscore)&lt;/P&gt;&lt;P&gt;+&amp;amp;gedisweight*(calculated firstnameeditscore + calculated lastnameeditscore) as matchscore&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;from have as a inner join have as b on (a.id &amp;lt; b.id /* prevents duplicates and self-matches */ AND&lt;/P&gt;&lt;P&gt;(&lt;/P&gt;&lt;P&gt;&amp;amp;soundweight*(complev(a.firstnamesound,b.firstnamesound)+complev(a.lastnamesound,b.lastnamesound))&lt;/P&gt;&lt;P&gt;+&amp;amp;gedisweight*(compged(a.firstname,b.firstname)+compged(a.lastname,b.lastname))&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;&amp;lt;&amp;amp;cutoff);&lt;/P&gt;&lt;P&gt;quit;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You may want to play around with the weights and the scoring function, especially if you have other data that could be used to enhance the match. Raising the cutoff will increase the likelihood of accepting a match, so you'll get more false positives but fewer false negatives. Lowering it has the reverse effect.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 10 Aug 2012 01:42:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89012#M25354</guid>
      <dc:creator>GeoffreyBrent</dc:creator>
      <dc:date>2012-08-10T01:42:57Z</dc:date>
    </item>
    <item>
      <title>Re: Finding non-exact matches in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89013#M25355</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Check if your site has at least part of the DataFlux modules licensed (eg. part of it comes with DI Studio). If so then you could use the dqmatch() function within SAS which let's you create match codes for fuzzy matching.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you have even more DataFlux licensed then it might be worth to also standardize your data - eg.in a way that ' Susan Smith', 'Sue Smith', 'Smith, Susan', Susan G. Smith' all are recognised as the same person and you tranform them all to "Susan Smith".&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 10 Aug 2012 10:33:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89013#M25355</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2012-08-10T10:33:21Z</dc:date>
    </item>
    <item>
      <title>Re: Finding non-exact matches in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89014#M25356</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I found this NESUG 2007 paper really helpful; the COMPGED function used here compares every observation of a string variable to all the others, as in Geoffrey's solution, above:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf" title="http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf"&gt;http://www.nesug.org/proceedings/nesug07/ap/ap23.pdf&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The examples worked perfectly for me (I used the data step as I'm less familiar with PROC SQL). Good luck!&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Tue, 11 Sep 2012 20:10:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-non-exact-matches-in-one-dataset/m-p/89014#M25356</guid>
      <dc:creator>DaveHaans</dc:creator>
      <dc:date>2012-09-11T20:10:20Z</dc:date>
    </item>
  </channel>
</rss>

