<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Identify duplicates by datastep and proc sql in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601026#M173805</link>
    <description>&lt;P&gt;Why not do counting in a PROC that SAS created specifically to do counting? And then you have very little programming to do.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc freq data=have;
    tables id*date/list /* optional, to create a SAS data set add this: out=counts */;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;By the way, I don't see how the row&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;2 1/2/2005 shows up in your output with a count of 4.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 01 Nov 2019 17:23:23 GMT</pubDate>
    <dc:creator>PaigeMiller</dc:creator>
    <dc:date>2019-11-01T17:23:23Z</dc:date>
    <item>
      <title>Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601014#M173797</link>
      <description>&lt;P&gt;Hi Everyone,&lt;/P&gt;&lt;P&gt;I'm trying to identify duplicates by dataset and sql to compare performance. Following is the dataset:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;data have ;&lt;BR /&gt;input id $ date $;&lt;BR /&gt;cards;&lt;BR /&gt;1 1/1/2001&lt;BR /&gt;1 1/1/2001&lt;BR /&gt;1 1/2/2002&lt;BR /&gt;1 1/2/2002&lt;BR /&gt;1 1/2/2003&lt;BR /&gt;1 1/2/2003&lt;BR /&gt;2 1/1/2005&lt;BR /&gt;2 1/1/2005&lt;BR /&gt;2 1/1/2005&lt;BR /&gt;2 1/2/2005&lt;BR /&gt;2 1/2/2005&lt;BR /&gt;2 1/3/2006&lt;BR /&gt;2 1/3/2006&lt;BR /&gt;;&lt;BR /&gt;run ;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I hope to create programs to have a counter that defines any repeats by id and date. The output is like the following:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;id&amp;nbsp;&amp;nbsp; date&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; count&lt;/P&gt;&lt;P&gt;1 1/1/2001&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;1 1/1/2001&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&lt;BR /&gt;1 1/2/2002&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;1 1/2/2002&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&lt;BR /&gt;1 1/2/2003&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;1 1/2/2003&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&lt;BR /&gt;2 1/1/2005&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;2 1/1/2005&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&lt;BR /&gt;2 1/1/2005&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3&lt;BR /&gt;2 1/2/2005&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;2 1/2/2005&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&lt;BR /&gt;2 1/3/2006&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;2 1/3/2006&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How to create the count in both datastep and proc sql?&lt;/P&gt;&lt;P&gt;Appreciate any help!&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2019 18:05:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601014#M173797</guid>
      <dc:creator>lydiawawa</dc:creator>
      <dc:date>2019-11-01T18:05:47Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601026#M173805</link>
      <description>&lt;P&gt;Why not do counting in a PROC that SAS created specifically to do counting? And then you have very little programming to do.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc freq data=have;
    tables id*date/list /* optional, to create a SAS data set add this: out=counts */;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;By the way, I don't see how the row&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;2 1/2/2005 shows up in your output with a count of 4.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2019 17:23:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601026#M173805</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2019-11-01T17:23:23Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601033#M173810</link>
      <description>Thank you for spotting the error I made, just made the edit. Proc freq won't be able to generate the count variable. It will give me the total repeats only, but I want the count to record 1st appearance,1st repeat, 2rd repeat. So it is going to 1, 2, 3...</description>
      <pubDate>Fri, 01 Nov 2019 18:10:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601033#M173810</guid>
      <dc:creator>lydiawawa</dc:creator>
      <dc:date>2019-11-01T18:10:33Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601037#M173811</link>
      <description>&lt;P&gt;Proc SQL is not well suited for such sequential processing. Try this:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
do count = 1 by 1 until (last.date);
	set have; by id date;
	output;
	end;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 01 Nov 2019 18:19:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601037#M173811</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2019-11-01T18:19:00Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601038#M173812</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/30435"&gt;@lydiawawa&lt;/a&gt;:&lt;/P&gt;
&lt;P&gt;In the DATA step:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have ;                                                                                                                             
input id $ date $ ;                                                                                                                     
cards;                                                                                                                                  
1 1/1/2001                                                                                                                              
1 1/1/2001                                                                                                                              
1 1/2/2002                                                                                                                              
1 1/2/2002                                                                                                                              
1 1/2/2003                                                                                                                              
1 1/2/2003                                                                                                                              
2 1/1/2005                                                                                                                              
2 1/1/2005                                                                                                                              
2 1/1/2005                                                                                                                              
2 1/2/2005                                                                                                                              
2 1/2/2005                                                                                                                              
2 1/3/2006                                                                                                                              
2 1/3/2006                                                                                                                              
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want ;                                                                                                                             
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("id", "date") ;                                                                                                        
    h.definedata ("count") ;                                                                                                            
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  if h.find() ne 0 then count = 1 ;                                                                                                     
  else                  count + 1 ;                                                                                                     
  h.replace() ;                                                                                                                         
run ;                                 
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Note that the program makes no assumption that the input data are sorted. It works for unsorted data as well.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Kind regards&lt;/P&gt;
&lt;P&gt;Paul D.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2019 18:19:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601038#M173812</guid>
      <dc:creator>hashman</dc:creator>
      <dc:date>2019-11-01T18:19:57Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601040#M173813</link>
      <description>I believe I would need to learn Hash table if my datasets are about billion rows? It will be the only substitution for sorting and merging for large datasets? I thought proc sql would somehow improve performance in comparing to datastep, but the difference are very small from my observation so far.</description>
      <pubDate>Fri, 01 Nov 2019 18:23:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601040#M173813</guid>
      <dc:creator>lydiawawa</dc:creator>
      <dc:date>2019-11-01T18:23:32Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601042#M173814</link>
      <description>most my data requires sequential selection.</description>
      <pubDate>Fri, 01 Nov 2019 18:25:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601042#M173814</guid>
      <dc:creator>lydiawawa</dc:creator>
      <dc:date>2019-11-01T18:25:09Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601046#M173815</link>
      <description>&lt;P&gt;If your data are sorted by ID/DATE, then:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have ;
input id $ date $;
cards;
1 1/1/2001
1 1/1/2001
1 1/2/2002
1 1/2/2002
1 1/2/2003
1 1/2/2003
2 1/1/2005
2 1/1/2005
2 1/1/2005
2 1/2/2005
2 1/2/2005
2 1/3/2006
2 1/3/2006
run ;

data want;
  set have;
  by id date;
  count+1;
  if first.date then count=1;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;If your data are not sorted by ID/DATE, you could&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; proc sort data=have; by id date; run;&lt;/P&gt;
&lt;P&gt;and then apply the program above.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2019 18:31:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601046#M173815</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2019-11-01T18:31:18Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601075#M173819</link>
      <description>&lt;P&gt;You can accomplish this with SQL Passthrough using the ROW_NUMBER() function in SQL. Whether it would be faster than a DATASTEP is unknown to me though.&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2019 19:56:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601075#M173819</guid>
      <dc:creator>Krueger</dc:creator>
      <dc:date>2019-11-01T19:56:48Z</dc:date>
    </item>
    <item>
      <title>Re: Identify duplicates by datastep and proc sql</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601079#M173820</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/30435"&gt;@lydiawawa&lt;/a&gt;:&lt;/P&gt;
&lt;P&gt;There exists a fairly widely spread illusion that one needs to know how to use the hash object only when one has a lot of data to process. Surely under many circumstances hash tables can speed things up quite a bit - for example, by making it unnecessary to sort large files when data need to be combined or aggregated. However, the main strength of the hash object in general is that it is an extremely flexible and convenient tool for dynamic programming, frequently lending itself to accomplishing in one step and/or single pass what otherwise would require several and doing it using simpler and more straightforward logic to boot.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The task you've posted in this thread is a good illustration. Imagine that your data are unsorted and look at the hash program doing what you want:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have_unsorted ;                                                                                                                    
input id $ date $ ;                                                                                                                     
cards ;                                                                                                                                 
2  1/3/2006                                                                                                                             
2  1/1/2005                                                                                                                             
2  1/1/2005                                                                                                                             
2  1/1/2005                                                                                                                             
2  1/2/2005                                                                                                                             
1  1/1/2001                                                                                                                             
1  1/2/2002                                                                                                                             
2  1/2/2005                                                                                                                             
2  1/3/2006                                                                                                                             
1  1/2/2002                                                                                                                             
1  1/2/2003                                                                                                                             
1  1/2/2003                                                                                                                             
1  1/1/2001                                                                                                                             
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want_unsorted ;                                                                                                                    
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("id", "date") ;                                                                                                        
    h.definedata ("count") ;                                                                                                            
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set have_unsorted ;                                                                                                                   
  if h.find() ne 0 then count = 1 ;                                                                                                     
  else                  count + 1 ;                                                                                                     
  h.replace() ;                                                                                                                         
run ;          
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;If you're unfamiliar with the hash object, it may look Greek to you, and yet its logic it exceedingly simple. Namely, for each record:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;If a key-value (id,date) is not in the table yet, assign count=1 and store it in the table. Output count=1.&lt;/LI&gt;
&lt;LI&gt;Otherwise, look in the table and see what count is there for this (id,date). Add 1 to that count value and store the result back in the table for this (id,date) overwriting the previous value of count there. Output the new value of count.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;It's that simple. The hash table just keeps track of all previous counts for every (id,date) key-value encountered thus far. And because it automatically grows by 1 item every time a new (id,date) is seen, there's no need to pre-process the input to size it up at compile time - as it would be necessary, for example, if an array were used as the count-tracking table instead. Furthermore, when you search the table for the current record's (id,date) value to find what the previous value of count has been, this act of lookup takes the same time regardless of how many items have been stored in the table (say, ten or a million), as this is one of the hash object's properties. If you're interested, a brief compendium on things of this nature can be found here (penned by&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13569"&gt;@DonH&lt;/a&gt;&amp;nbsp;and yours truly):&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="http://support.sas.com/resources/papers/proceedings17/0821-2017.pdf" target="_self"&gt;http://support.sas.com/resources/papers/proceedings17/0821-2017.pdf&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Kind regards&lt;/P&gt;
&lt;P&gt;Paul Dorfman&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2019 20:14:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Identify-duplicates-by-datastep-and-proc-sql/m-p/601079#M173820</guid>
      <dc:creator>hashman</dc:creator>
      <dc:date>2019-11-01T20:14:24Z</dc:date>
    </item>
  </channel>
</rss>

