<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to find duplicate based on entire record using Data step in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561212#M157065</link>
    <description>&lt;P&gt;You need to sort.&amp;nbsp; You might be able to avoid sorting if the data was small and could all be put into a hash.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;So let's assume you already did this step to sort by all of&amp;nbsp; the variables so that dups are next to each other.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
  by _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Now you want to read the data and detect duplicate rows.&amp;nbsp; To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Try this example using SASHELP.CLASS.&amp;nbsp; &amp;nbsp;I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=sashelp.class out=class;
  by age _all_;
run;

data test;
  set class;
  by age _all_ age;
  if not (first.age and last.age);
run;

data test2;
  set class;
  by age ;
  if not (first.age and last.age);
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 23 May 2019 17:08:28 GMT</pubDate>
    <dc:creator>Tom</dc:creator>
    <dc:date>2019-05-23T17:08:28Z</dc:date>
    <item>
      <title>How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561208#M157062</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; I wanted to find duplicates based on the entire record of 55 columns. I have done it using SAS procedures like sort with NODUP/ NON UNIQUE KEY option. But i am wondering , can we able to accomplish this using the data step BY group processing.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;can we able to use _ALL_ , _NUMERIC_ etc in BY processing. I have used this in the BY statement in procedures. Instead of typing in all 55 columns.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Or is there is any other way in data step processing that we can get the duplicate record(the entire record )&amp;nbsp; without any specific key field.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2019 16:44:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561208#M157062</guid>
      <dc:creator>meenakshim</dc:creator>
      <dc:date>2019-05-23T16:44:35Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561212#M157065</link>
      <description>&lt;P&gt;You need to sort.&amp;nbsp; You might be able to avoid sorting if the data was small and could all be put into a hash.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;So let's assume you already did this step to sort by all of&amp;nbsp; the variables so that dups are next to each other.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
  by _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Now you want to read the data and detect duplicate rows.&amp;nbsp; To detect duplicates you just need to test the FIRST. and LAST. variables for the last BY variable. The trick is to add back one of the variables to the end.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Try this example using SASHELP.CLASS.&amp;nbsp; &amp;nbsp;I stuck AGE in before _ALL_ so that the second data step will show that duplicates can be found by use the FIRST. and LAST. variables.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=sashelp.class out=class;
  by age _all_;
run;

data test;
  set class;
  by age _all_ age;
  if not (first.age and last.age);
run;

data test2;
  set class;
  by age ;
  if not (first.age and last.age);
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2019 17:08:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561212#M157065</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2019-05-23T17:08:28Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561213#M157066</link>
      <description>&lt;P&gt;It's definitely possible to do this in a datastep, but it likely wouldn't be as efficient as just using proc sort (which was built exactly for this type of job). You'd probably want to sort the dataset first by ALL the fields, then use a first-DOT-&amp;lt;Last field in BY statement&amp;gt; (or last-DOT of the same field). I recommend not using a datastep for this, but please let me know what you find out!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'd be interested in:&lt;/P&gt;&lt;P&gt;1. Processing time using proc sort with a NODUP/ other option&lt;/P&gt;&lt;P&gt;2. Processing time using a datastep&lt;/P&gt;&lt;P&gt;3. Whatever the generalized approach for all fields is with the datastep. EDIT: Thanks&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/159"&gt;@Tom&lt;/a&gt;!! - neat trick adding another field to the end of _all_!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2019 17:13:37 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561213#M157066</guid>
      <dc:creator>noling</dc:creator>
      <dc:date>2019-05-23T17:13:37Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561229#M157067</link>
      <description>&lt;P&gt;Here is what a hash object approach may look like&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I create 20 mio rows of data with 10 variables for demonstration purposes. Then I create a data set of duplicates with PROC SORT Dupout method and Data Step Hash Method respectively..&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data testdata1(drop=i j);
   call streaminit(123);
   array vars var1-var10;
   do i=1 to 20e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* Run Time: 1 min 17 sec  */
proc sort data=testdata1 dupout=test1 nodupkey;
   by _ALL_;
run;

data testdata2(drop=i j);
   call streaminit(123);
   array vars var1-var10;
   do i=1 to 20e6;
      do j=1 to dim(vars);
         vars[j]=rand('integer', 1, 10);
      end;
      output;
   end;
run;

/* Run Time: 28 sec */
data test2;
   if _N_ = 1 then do;
      declare hash h(hashexp:20);
      h.defineKey('var1', 'var2', 'var3', 'var4', 'var5',
                  'var6', 'var7', 'var8', 'var9', 'var10');
      h.defineDone();
   end;
 
   set testdata2;

   if h.check() ne 0 then h.add();
   else output;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2019 17:45:26 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561229#M157067</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2019-05-23T17:45:26Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561239#M157076</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/31304"&gt;@PeterClemmensen&lt;/a&gt;: Thanks for sharing the run times. So, the DATA step using a hash object was significantly faster than PROC SORT -- on &lt;EM&gt;your&lt;/EM&gt; computer. This is interesting because it was vice versa on mine (24 vs. 16 s), thus illustrating that general recommendations regarding performance are problematic even if the data are identical.&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2019 19:20:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561239#M157076</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2019-05-23T19:20:57Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561245#M157081</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/32733"&gt;@FreelanceReinh&lt;/a&gt;,&amp;nbsp;thank you. And I agree completely &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2019 20:18:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561245#M157081</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2019-05-23T20:18:22Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561249#M157084</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/31304"&gt;@PeterClemmensen&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;(...)&lt;/P&gt;
&lt;P&gt;I do not recommend one over the other, only illustrating how it can be done with a hash object instead of PROC SORT.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Sure. Sorry for the ambiguity. My remark about recommendations was not related to your run time comparison.&lt;/P&gt;</description>
      <pubDate>Thu, 23 May 2019 21:51:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561249#M157084</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2019-05-23T21:51:39Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561370#M157138</link>
      <description>&lt;P&gt;Just one PROC SORT is suffice ;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE class=" language-sas"&gt;&lt;CODE class="  language-sas"&gt;&lt;SPAN class="token procnames"&gt;proc&lt;/SPAN&gt; &lt;SPAN class="token procnames"&gt;sort&lt;/SPAN&gt; &lt;SPAN class="token procnames"&gt;data&lt;/SPAN&gt;&lt;SPAN class="token operator"&gt;=&lt;/SPAN&gt;have out=want nouniquekey &lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
  &lt;SPAN class="token statement"&gt;by&lt;/SPAN&gt; &lt;SPAN class="token keyword"&gt;_all_&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;
&lt;SPAN class="token procnames"&gt;run&lt;/SPAN&gt;&lt;SPAN class="token punctuation"&gt;;&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 24 May 2019 12:45:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561370#M157138</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2019-05-24T12:45:18Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561404#M157156</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/254321"&gt;@meenakshim&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;From a coding perspective using &lt;EM&gt;Proc Sort ... nodupkey; by _all_; run&lt;/EM&gt; is by far the simplest approach. If you consult the docu for Proc Sort you'll also find that Proc Sort provides options to collect duplicates in a separate data set.&lt;/P&gt;</description>
      <pubDate>Fri, 24 May 2019 14:14:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561404#M157156</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2019-05-24T14:14:03Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561438#M157166</link>
      <description>&lt;P&gt;This method is new to me and i need to explore more. But it's interesting.&lt;/P&gt;&lt;P&gt;I could see only the duplicate records are outputting like in PROC SORT.&lt;/P&gt;</description>
      <pubDate>Fri, 24 May 2019 15:20:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561438#M157166</guid>
      <dc:creator>meenakshim</dc:creator>
      <dc:date>2019-05-24T15:20:43Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561445#M157168</link>
      <description>Yes.  In groups with more than one record (ie duplicates in this case) FIRST.lastbyvar will be true only on the first record and LAST.lastbyvar will be true only on the last record. A unique record will be both the first and the last of its group.&lt;BR /&gt;The reason to use the data step method instead of just letting PROC SORT do the job is when you want something different. You can write your own logic to control exactly what happens.&lt;BR /&gt;</description>
      <pubDate>Fri, 24 May 2019 15:44:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561445#M157168</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2019-05-24T15:44:43Z</dc:date>
    </item>
    <item>
      <title>Re: How to find duplicate based on entire record using Data step</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561447#M157170</link>
      <description>&lt;P&gt;I tried this before. But i couldn't figure it out what need to give in for FIRST and LAST BY variables to represent whole record. As you said ,by adding another variable to the end, it worked as expected. Thank u for the input.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My class data set is a copy of SASHELP.CLASS with 4 duplicates added.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;proc sort data=class;&lt;BR /&gt;by _all_;&lt;BR /&gt;run;&lt;/P&gt;&lt;P&gt;data uniqueclass notuniqueclass;&lt;BR /&gt;set class;&lt;BR /&gt;by _all_ age;&lt;BR /&gt;if (first.age and last.age) then output uniqueclass;&lt;BR /&gt;if not (first.age and last.age) then output notuniqueclass;&lt;/P&gt;&lt;P&gt;run;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;meenakshi&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 May 2019 15:48:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-find-duplicate-based-on-entire-record-using-Data-step/m-p/561447#M157170</guid>
      <dc:creator>meenakshim</dc:creator>
      <dc:date>2019-05-24T15:48:57Z</dc:date>
    </item>
  </channel>
</rss>

