<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Finding and merging duplicates in a dataset in SAS Procedures</title>
    <link>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110124#M30581</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Actually it is hard to find them. You have to clean the data using tranwrd function&lt;/P&gt;&lt;P&gt;example&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'st.', 'Street');&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'St.', 'Street');&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'st', 'Street');&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'St', 'Street');&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'street', 'Street');&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;you can use all of them in one data step.&amp;nbsp; Alternatively you break the address into parts. You will also find street is written as avenue or road or rd... I had gone through that for a dataset with 4 million records; Sort them by street name and suburb name, and then it would be easier.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Fri, 11 Oct 2013 01:43:38 GMT</pubDate>
    <dc:creator>Mit</dc:creator>
    <dc:date>2013-10-11T01:43:38Z</dc:date>
    <item>
      <title>Finding and merging duplicates in a dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110123#M30580</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hi there&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am working with a dataset which contains addresses of businesses. There are many non-exact duplicates because of the address being written in different ways (eg marsden street, marsden st). Is there a way to combine these addresses and find duplicates.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 11 Oct 2013 01:29:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110123#M30580</guid>
      <dc:creator>KingJ</dc:creator>
      <dc:date>2013-10-11T01:29:44Z</dc:date>
    </item>
    <item>
      <title>Re: Finding and merging duplicates in a dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110124#M30581</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Actually it is hard to find them. You have to clean the data using tranwrd function&lt;/P&gt;&lt;P&gt;example&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'st.', 'Street');&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'St.', 'Street');&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'st', 'Street');&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'St', 'Street');&lt;/P&gt;&lt;P&gt;address =tranwrd(address,'street', 'Street');&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;you can use all of them in one data step.&amp;nbsp; Alternatively you break the address into parts. You will also find street is written as avenue or road or rd... I had gone through that for a dataset with 4 million records; Sort them by street name and suburb name, and then it would be easier.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 11 Oct 2013 01:43:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110124#M30581</guid>
      <dc:creator>Mit</dc:creator>
      <dc:date>2013-10-11T01:43:38Z</dc:date>
    </item>
    <item>
      <title>Re: Finding and merging duplicates in a dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110125#M30582</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Thanks Mit&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 11 Oct 2013 02:35:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110125#M30582</guid>
      <dc:creator>KingJ</dc:creator>
      <dc:date>2013-10-11T02:35:21Z</dc:date>
    </item>
    <item>
      <title>Re: Finding and merging duplicates in a dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110126#M30583</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;This is a long shot but if you have a GIS product, either SAS or other that does geocoding from addresses, I would give that a shot. Because your situation is pretty typical many geocoding applications know how to handle variations of many address components. It could also provide you a list of those not codeable which could identify the really creative spellings.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Fri, 11 Oct 2013 14:59:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110126#M30583</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2013-10-11T14:59:09Z</dc:date>
    </item>
    <item>
      <title>Re: Finding and merging duplicates in a dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110127#M30584</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;To standardize addresses is a typical task where you would use DataFlux.&lt;/P&gt;&lt;P&gt;You can start and code this by yourself as Mit proposes but it will be a lot of work and the result will never be as good as what comes almost "out-of-the-box" with DataFlux.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sat, 12 Oct 2013 01:20:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110127#M30584</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2013-10-12T01:20:27Z</dc:date>
    </item>
    <item>
      <title>Re: Finding and merging duplicates in a dataset</title>
      <link>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110128#M30585</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I agree with ballardw. I have used Mapinfo long ego. But the prolem is that this will match to the correct address If the addresses are mis-spelled then there is no other way than cleaning the data. So I followed the following steps:&lt;/P&gt;&lt;P&gt;1. sort and summarise by the street and suburb.&lt;/P&gt;&lt;P&gt;2.Match with Mapinfo&lt;/P&gt;&lt;P&gt;3. Find out the addresses not matched and then clean them&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Sun, 13 Oct 2013 22:39:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Procedures/Finding-and-merging-duplicates-in-a-dataset/m-p/110128#M30585</guid>
      <dc:creator>Mit</dc:creator>
      <dc:date>2013-10-13T22:39:47Z</dc:date>
    </item>
  </channel>
</rss>

