<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Find possible duplicate in one dataset in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393891#M94892</link>
    <description>&lt;P&gt;Combining everything into one field may introduce more problems than you think.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When I have a project like this I use a free tool developed by the CDC call LinkPlus which is a probabalistic matching program available at &lt;A href="https://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm" target="_blank"&gt;https://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The program returns a probability of a match and indicators of which records it may match.&lt;/P&gt;</description>
    <pubDate>Thu, 07 Sep 2017 14:35:19 GMT</pubDate>
    <dc:creator>ballardw</dc:creator>
    <dc:date>2017-09-07T14:35:19Z</dc:date>
    <item>
      <title>Find possible duplicate in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393851#M94885</link>
      <description>&lt;P&gt;Hello everyone, I have a database with lot of rubbish, soon we will change our database and I'd like to clean as much as possible the data before the migration.&lt;BR /&gt;I'm trying to identify possible duplicated records inside one dataset, which are not exact match but are similar.&lt;BR /&gt;For example :&lt;BR /&gt;Record 1 - Name=John, Surname=Doe, Address= Fake Street&lt;BR /&gt;Record 2 - Name=Jonh, Surname= Doe Joe, Address=F. Street&lt;BR /&gt;&lt;BR /&gt;My idea is to create a unique string with name, surname, address, without spaces, (for example johndoefakestreet) and confront one to one all the record with all other record in the same&amp;nbsp;dataset, (approximately 800k records) using compged function, and keep only the record with the smallest value in order to identify possible duplicates (which I know there are present).&lt;BR /&gt;&lt;BR /&gt;I don't know how to perform this operation, or if there is an easiest way to do this.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm using sas 9.4, i hope it's clear what I'm trying to do&lt;BR /&gt;&lt;BR /&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 07 Sep 2017 13:01:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393851#M94885</guid>
      <dc:creator>gspila</dc:creator>
      <dc:date>2017-09-07T13:01:14Z</dc:date>
    </item>
    <item>
      <title>Re: Find possible duplicate in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393853#M94886</link>
      <description>&lt;P&gt;Do a proc freq on your data, this wy you will get a list of distinct values and how many times that value appears. &amp;nbsp;You can then use that output to see where cleaning can be done. &amp;nbsp;Iterate that process and as the list decreases it should go quicker.&lt;/P&gt;</description>
      <pubDate>Thu, 07 Sep 2017 13:12:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393853#M94886</guid>
      <dc:creator>RW9</dc:creator>
      <dc:date>2017-09-07T13:12:05Z</dc:date>
    </item>
    <item>
      <title>Re: Find possible duplicate in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393891#M94892</link>
      <description>&lt;P&gt;Combining everything into one field may introduce more problems than you think.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When I have a project like this I use a free tool developed by the CDC call LinkPlus which is a probabalistic matching program available at &lt;A href="https://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm" target="_blank"&gt;https://www.cdc.gov/cancer/npcr/tools/registryplus/lp_tech_info.htm&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The program returns a probability of a match and indicators of which records it may match.&lt;/P&gt;</description>
      <pubDate>Thu, 07 Sep 2017 14:35:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393891#M94892</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2017-09-07T14:35:19Z</dc:date>
    </item>
    <item>
      <title>Re: Find possible duplicate in one dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393903#M94895</link>
      <description>&lt;P&gt;I'll second&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13884"&gt;@ballardw&lt;/a&gt;&amp;nbsp;suggestion.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Otherwise for fuzzy matches look at:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;SPEDIS, COMPGED type functions as well as this post that has a good SQL example of doing this type of multiple match in a semi-brute force method.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780" target="_blank"&gt;https://communities.sas.com/t5/SAS-Procedures/Name-matching/td-p/82780&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 07 Sep 2017 14:52:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Find-possible-duplicate-in-one-dataset/m-p/393903#M94895</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2017-09-07T14:52:53Z</dc:date>
    </item>
  </channel>
</rss>

