<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Efficient append but finding duplicates in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701026#M214609</link>
    <description>&lt;P&gt;I have a large dataset (21M) and a small one (1000). I would love to use append with an indexed dataset. That is super fast but, if it finds duplicates, it has no option to report what obs were duplicated. Is there a way to find the dups easily? This may be more of a proc sql but append is nice due to speed.&lt;/P&gt;</description>
    <pubDate>Mon, 23 Nov 2020 20:59:23 GMT</pubDate>
    <dc:creator>AlanC</dc:creator>
    <dc:date>2020-11-23T20:59:23Z</dc:date>
    <item>
      <title>Efficient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701026#M214609</link>
      <description>&lt;P&gt;I have a large dataset (21M) and a small one (1000). I would love to use append with an indexed dataset. That is super fast but, if it finds duplicates, it has no option to report what obs were duplicated. Is there a way to find the dups easily? This may be more of a proc sql but append is nice due to speed.&lt;/P&gt;</description>
      <pubDate>Mon, 23 Nov 2020 20:59:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701026#M214609</guid>
      <dc:creator>AlanC</dc:creator>
      <dc:date>2020-11-23T20:59:23Z</dc:date>
    </item>
    <item>
      <title>Re: Effecient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701041#M214616</link>
      <description>&lt;P&gt;Show us your code please?&lt;/P&gt;</description>
      <pubDate>Mon, 23 Nov 2020 20:31:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701041#M214616</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2020-11-23T20:31:16Z</dc:date>
    </item>
    <item>
      <title>Re: Effecient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701044#M214617</link>
      <description>&lt;P&gt;What do you want the identification of "dupes" to look like?&lt;/P&gt;
&lt;P&gt;Do the dupes all originate in the appended data or may exist in the base data and the append would create the dupe?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 23 Nov 2020 20:44:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701044#M214617</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2020-11-23T20:44:25Z</dc:date>
    </item>
    <item>
      <title>Re: Effecient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701049#M214620</link>
      <description>&lt;P&gt;The 2 datasets are identical with 3 vars in their index. I am thinking just a simple where clause will probably be fast enough. I dont want to do a sort/merge.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 23 Nov 2020 20:49:41 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701049#M214620</guid>
      <dc:creator>AlanC</dc:creator>
      <dc:date>2020-11-23T20:49:41Z</dc:date>
    </item>
    <item>
      <title>Re: Effecient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701055#M214622</link>
      <description>&lt;P&gt;proc append base=BigData data=SmallData force;&lt;/P&gt;
&lt;P&gt;run;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;There is no option to catch dups if they are on a unique index. Hence, append may not be the best here.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 23 Nov 2020 20:58:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701055#M214622</guid>
      <dc:creator>AlanC</dc:creator>
      <dc:date>2020-11-23T20:58:44Z</dc:date>
    </item>
    <item>
      <title>Re: Effecient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701057#M214624</link>
      <description>&lt;P&gt;proc append is fast because it does not process data. As soon as you start processing data, then of course speed drops.&lt;/P&gt;
&lt;P&gt;If both tables are indexed, it seems to me that a SQL inner join, done before appending and with the small table named first, would be the fastest way to identify the duplicates.&lt;/P&gt;
&lt;P&gt;Note that 20m rows is a small table, and the manipulations you describe should take no time at all.&lt;/P&gt;</description>
      <pubDate>Mon, 23 Nov 2020 21:02:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701057#M214624</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-11-23T21:02:07Z</dc:date>
    </item>
    <item>
      <title>Re: Effecient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701058#M214625</link>
      <description>I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid  a sort merge. i am surprised append does not have a way of spitting out dups like sort.</description>
      <pubDate>Mon, 23 Nov 2020 21:03:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701058#M214625</guid>
      <dc:creator>AlanC</dc:creator>
      <dc:date>2020-11-23T21:03:59Z</dc:date>
    </item>
    <item>
      <title>Re: Effecient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701064#M214627</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13798"&gt;@AlanC&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;I think I found what I want which is the intersect keyword. I don't use SQL much but I wanted to avoid a sort merge. i am surprised append does not have a way of spitting out dups like sort.&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;SQL can do a lot of background "sorting" even though you don't specify it explicitly.&lt;/P&gt;</description>
      <pubDate>Mon, 23 Nov 2020 21:09:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701064#M214627</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2020-11-23T21:09:37Z</dc:date>
    </item>
    <item>
      <title>Re: Effecient append but finding duplicates</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701065#M214628</link>
      <description>&lt;P&gt;An &lt;EM&gt;intersect&lt;/EM&gt; will not use the indexes afaik. An&lt;EM&gt; inner join&lt;/EM&gt; will. If you want speed, use the indexes.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;proc append does not read the data: Observations are added in bulk. Hence its speed, and hence its limitations.&lt;/P&gt;</description>
      <pubDate>Mon, 23 Nov 2020 21:10:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Efficient-append-but-finding-duplicates/m-p/701065#M214628</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-11-23T21:10:24Z</dc:date>
    </item>
  </channel>
</rss>

