<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: finding duplicate records within time frames. in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79445#M256531</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;That is incredibly fast as well, thank you very much.&amp;nbsp; &lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Thu, 27 Sep 2012 19:52:54 GMT</pubDate>
    <dc:creator>Steelers_In_DC</dc:creator>
    <dc:date>2012-09-27T19:52:54Z</dc:date>
    <item>
      <title>finding duplicate records within time frames.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79441#M256527</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;I'm back with another one today, simple problem that a large dataset is making difficult.&amp;nbsp; I have est. 100 mm records that I'll post a section of below.&amp;nbsp; The first column is customer accounts, the second represents a date.&amp;nbsp; I'd like to find if there are any duplicate customer account numbers within a specific date.&amp;nbsp; I tried a proc freq but the dataset is too large.&amp;nbsp; Is retain&amp;nbsp; first.customer a solution?&amp;nbsp; I've been trying several things but coming up with nothing good:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;196607192&lt;/TD&gt;&lt;TD&gt;40499&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;196567409&lt;/TD&gt;&lt;TD&gt;40499&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;196384699&lt;/TD&gt;&lt;TD&gt;40499&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;196654152&lt;/TD&gt;&lt;TD&gt;40499&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;196524444&lt;/TD&gt;&lt;TD&gt;40499&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;196370474&lt;/TD&gt;&lt;TD&gt;40499&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;196580659&lt;/TD&gt;&lt;TD&gt;40499&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 27 Sep 2012 19:17:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79441#M256527</guid>
      <dc:creator>Steelers_In_DC</dc:creator>
      <dc:date>2012-09-27T19:17:09Z</dc:date>
    </item>
    <item>
      <title>Re: finding duplicate records within time frames.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79442#M256528</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;in addition, the dataset covers 20 months.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 27 Sep 2012 19:17:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79442#M256528</guid>
      <dc:creator>Steelers_In_DC</dc:creator>
      <dc:date>2012-09-27T19:17:50Z</dc:date>
    </item>
    <item>
      <title>Re: finding duplicate records within time frames.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79443#M256529</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;One last thing, here was my initial idea:&lt;/P&gt;&lt;P&gt;proc sql;&lt;/P&gt;&lt;P&gt;create table cus_hsd as&lt;/P&gt;&lt;P&gt;select ver_start_day_key, cust_acct_key&lt;/P&gt;&lt;P&gt;from prd_sas.sas_cus_hsd_source_master&lt;/P&gt;&lt;P&gt;order by ver_start_day_key, cust_acct_key;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;proc freq data=cus_hsd;&lt;/P&gt;&lt;P&gt;table ver_start_day_key * cust_acct_key / nocol nopercent norow noprint out=cus_hsd_freq;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data cus_hsd_final;&lt;/P&gt;&lt;P&gt;set cus_hsd_freq;&lt;/P&gt;&lt;P&gt;where count &amp;gt; 1;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 27 Sep 2012 19:28:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79443#M256529</guid>
      <dc:creator>Steelers_In_DC</dc:creator>
      <dc:date>2012-09-27T19:28:52Z</dc:date>
    </item>
    <item>
      <title>Re: finding duplicate records within time frames.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79444#M256530</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If your PROC SQL step is working, this would be a better way to follow up:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data cus_hsd_final;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; set cus_hsd;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; by ver_start_day_key cust_acct_key;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; count + 1;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; if last.cust_acct_key;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; if count &amp;gt; 1 then output;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; count=0;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;No PROC FREQ needed, as the DATA step can count.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Good luck.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 27 Sep 2012 19:47:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79444#M256530</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2012-09-27T19:47:22Z</dc:date>
    </item>
    <item>
      <title>Re: finding duplicate records within time frames.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79445#M256531</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;That is incredibly fast as well, thank you very much.&amp;nbsp; &lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 27 Sep 2012 19:52:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79445#M256531</guid>
      <dc:creator>Steelers_In_DC</dc:creator>
      <dc:date>2012-09-27T19:52:54Z</dc:date>
    </item>
    <item>
      <title>Re: finding duplicate records within time frames.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79446#M256532</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;If you wouldn't mind, can you take a moment to explain?&amp;nbsp; I understand what the count &amp;gt; 1 is doing, but not why the count=0 is necessary.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 27 Sep 2012 19:57:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79446#M256532</guid>
      <dc:creator>Steelers_In_DC</dc:creator>
      <dc:date>2012-09-27T19:57:04Z</dc:date>
    </item>
    <item>
      <title>Re: finding duplicate records within time frames.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79447#M256533</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;count=0 is setting COUNT back to 0, because the DATA step is about to begin processing the first record for the next customer.&amp;nbsp; Each block of records (day + customer = a block) should begin with count=0.&amp;nbsp; It's positioned at the bottom of the DATA step just to gain a small bit of speed, since you are dealing with 100M records.&amp;nbsp; This DATA step would have worked equally well, but would have needed to check for a block beginning:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data cust_hsd_final;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; set cust_hsd;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; by ver_start_day_key cust_acct_key;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; if first.cust_acct_key then count=1;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; else count + 1;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp; if last.cust_acct_key and count &amp;gt; 1;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However, it would have been a little bit slower because of the need to check for first.cust_acct_key.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 27 Sep 2012 20:07:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79447#M256533</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2012-09-27T20:07:00Z</dc:date>
    </item>
    <item>
      <title>Re: finding duplicate records within time frames.</title>
      <link>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79448#M256534</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Awesome, thanks so much.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Thu, 27 Sep 2012 20:11:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/finding-duplicate-records-within-time-frames/m-p/79448#M256534</guid>
      <dc:creator>Steelers_In_DC</dc:creator>
      <dc:date>2012-09-27T20:11:23Z</dc:date>
    </item>
  </channel>
</rss>

