<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to check for dataset duplicate observations and then remove them in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590401#M168970</link>
    <description>&lt;P&gt;1. A check to see whether there are duplicate observations in the dataset.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
data _null_;
if 0 then set have;
 declare hash h(dataset:'have',duplicate:'error');
 h.definekey(all:'y');
 h.definedone();
stop;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2. If there are duplicates, the using the two different methods (see below)&amp;nbsp; to remove them based on te 185,000 observations in the dataset.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
data _null_;
if 0 then set have;
 declare hash h(dataset:'have',ordered:'y');
 h.definekey('account_id');
 h.definedata(all:'y');
 h.definedone();

 h.output(dataset:'want1');
stop;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
data _null_;
if 0 then set have;
 declare hash h(dataset:'have',ordered:'y');
 h.definekey(all:'y');
 h.definedata(all:'y');
 h.definedone();

 h.output(dataset:'want2');
stop;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 20 Sep 2019 13:45:55 GMT</pubDate>
    <dc:creator>Ksharp</dc:creator>
    <dc:date>2019-09-20T13:45:55Z</dc:date>
    <item>
      <title>How to check for dataset duplicate observations and then remove them</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590044#M168835</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a dataset with 185,000 observations and just two variables - accountID (numeric, format is 12.) and month (numeric variable). Would it be possible to provide me with code so that:&lt;/P&gt;&lt;P&gt;1. A check to see whether there are duplicate observations in the dataset.&lt;/P&gt;&lt;P&gt;2. If there are duplicates, the using the two different methods (see below)&amp;nbsp; to remove them based on te 185,000 observations in the dataset.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;U&gt;Partial example of the 185,000 observations:&lt;/U&gt;&lt;/P&gt;&lt;P&gt;account ID&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Month&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 1&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201808&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 2&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201808&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 3&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201805&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 4&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201903&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 5&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201907&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 2&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201808&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 2&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201809&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If there are duplicates, there are&lt;U&gt; two different new datasets&lt;/U&gt;&amp;nbsp;(i.e. two pieces of code to get each scenario listed below) that need to be created:&lt;/P&gt;&lt;P&gt;1. &lt;U&gt;Unique account ID only&lt;/U&gt;, even if there is the same account ID but a different month.&lt;/P&gt;&lt;P&gt;account ID&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Month&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 1&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201808&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 2&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201808&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 3&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201805&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 4&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201903&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 5&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201907&lt;/P&gt;&lt;P&gt;(Only one account ID is kept and dataset is kept in ascending order of account ID).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. &lt;U&gt;Only&lt;/U&gt; removed when unique account ID &lt;U&gt;AND&lt;/U&gt; month are the &lt;U&gt;same&lt;/U&gt;, so that it creates something like:&lt;/P&gt;&lt;P&gt;account ID&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Month&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 1&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201808&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 2&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201808&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 2&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201809&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 3&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201805&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 4&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201903&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 5&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201907&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;&lt;/P&gt;&lt;P&gt;(&amp;nbsp; &amp;nbsp;&lt;STRIKE&gt;2&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;201808&lt;/STRIKE&gt; has been removed as it duplicates based on both account ID AND month. The dataset is kept in ascending order of account ID and month).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 19 Sep 2019 14:48:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590044#M168835</guid>
      <dc:creator>jeremy4</dc:creator>
      <dc:date>2019-09-19T14:48:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to check for dataset duplicate observations and then remove them</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590049#M168837</link>
      <description>&lt;P&gt;Code 1 to produce unique accounts&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
    create table want as select distinct account_id  from have;
quit;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Code 2 to produce unique account/month combinations&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
    create table want as select distinct account_id,month from have;
quit;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Thu, 19 Sep 2019 14:55:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590049#M168837</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2019-09-19T14:55:12Z</dc:date>
    </item>
    <item>
      <title>Re: How to check for dataset duplicate observations and then remove them</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590055#M168841</link>
      <description>Thanks a lot, is there a way to quick way to do a preliminary check of unique records when you have hundreds of thousands of observations, or would you use proc sql to select distinct observations and then compare the numbers in both datasets/tables?</description>
      <pubDate>Thu, 19 Sep 2019 15:00:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590055#M168841</guid>
      <dc:creator>jeremy4</dc:creator>
      <dc:date>2019-09-19T15:00:50Z</dc:date>
    </item>
    <item>
      <title>Re: How to check for dataset duplicate observations and then remove them</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590066#M168845</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/266226"&gt;@jeremy4&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;Thanks a lot, is there a way to quick way to do a preliminary check of unique records when you have hundreds of thousands of observations, or would you use proc sql to select distinct observations and then compare the numbers in both datasets/tables?&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I have to say that I am not quite sure exactly what you are getting at but there are also options on Proc Sort such as NOUNIQUE coupled with UNIQUEOUT=datasetname that will sent all records with unique sort key variables, your account and month for example to a separate output data set than the default sort output set.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;There is also NODUPKEY and DUPOUT=datasetname that would have duplicate observations in the Sort. DUPOUT and UNIQUEOUT can not be used in the same Proc Sort call.&lt;/P&gt;</description>
      <pubDate>Thu, 19 Sep 2019 15:24:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590066#M168845</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2019-09-19T15:24:38Z</dc:date>
    </item>
    <item>
      <title>Re: How to check for dataset duplicate observations and then remove them</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590083#M168849</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data nodup_key_equivalent ;
 if _n_ = 1 then do ;
 dcl hash h() ;
 h.definekey ('account_id  ') ;
 h.definedone() ;
 end ;
 set have ;
 if h.CHECK() ne 0 ;
 h.ADD() ;
 run ; &lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Thu, 19 Sep 2019 15:50:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590083#M168849</guid>
      <dc:creator>novinosrin</dc:creator>
      <dc:date>2019-09-19T15:50:33Z</dc:date>
    </item>
    <item>
      <title>Re: How to check for dataset duplicate observations and then remove them</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590088#M168850</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
input account_id month;
datalines;
1 201808
2 201808
3 201805
4 201903
5 201907
2 201808
2 201809
;
run;

proc sort data=have;
by account_id month;
run;

data want1 (drop=keep);
set have;
	by account_id;
	if first.account_id then keep = 1;
	else if last.account_id then keep = 0;
run;

data want2 (drop=keep);
set have;
	by account_id month;
	if first.month then keep = 1;
	else if last.month then keep = 0;
	if keep = 1;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Only problem here is that it sets any variable with multiple duplicates to missing if it's not the first or last. You could run a proc freq on the variables to get an idea with the larger dataset.&lt;/P&gt;</description>
      <pubDate>Thu, 19 Sep 2019 16:00:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590088#M168850</guid>
      <dc:creator>maguiremq</dc:creator>
      <dc:date>2019-09-19T16:00:33Z</dc:date>
    </item>
    <item>
      <title>Re: How to check for dataset duplicate observations and then remove them</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590116#M168858</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/266226"&gt;@jeremy4&lt;/a&gt;:&lt;/P&gt;
&lt;P&gt;To "just check" whether the file has dupes or not, you need to read the file until one dupe is found. Hence, potentially you may need to read the entire file (if the first dupe is in the last record). Since your file (185k obs) is extremely small, you can just read the whole file anyway using proc SORT. If you don't want to write any output data set at this point (I guess that would fall into the category of "just checking"), _NULL_ out the output data set:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have ;                                  
  input accountID Month ;                    
  cards ;                                    
1  201808                                    
2  201808                                    
3  201805                                    
4  201903                                    
5  201907                                    
2  201808                                    
2  201809                                    
;                                            
run ;                                        
                                             
proc sort nodupkey data = have out = _null_ ;
  by accountID ;                             
run ;                                        
                                             
proc sort nodupkey data = have out = _null_ ;
  by accountID month ;                       
run ;                                        
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The SAS log will tell you if you have dupes and how many. For your particular sample, it reports for the first and second sort, respectively:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;NOTE: 2 observations with duplicate key values were deleted.
NOTE: 1 observations with duplicate key values were deleted.
&lt;/PRE&gt;
&lt;P&gt;Note that "deleted" doesn't mean that the dupes are deleted from the &lt;EM&gt;input&lt;/EM&gt; file. Rather, it means they would be excluded from the output if it were written out (nothing is written out because of OUT=_NULL_ - you're "just checking").&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To generate the unduplicated output file you want, just plug the data set names you want instead of _NULL_:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort equals nodupkey data = have out = out_one ;                                                                                                                                                                                                           
  by accountID ;                                                                                                                                                                                                                                                
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
proc sort equals nodupkey data = have out = out_two ;                                                                                                                                                                                                           
  by accountID month ;                                                                                                                                                                                                                                          
run ;                                      
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Now if your input file were extremely large, not to mention also wide (i.e. having many more variables than the accountID and month), "just checking" using proc SORT to read the whole file can prove rather wasteful, especially if the first dupe is located near the beginning of the input file. In this case, checking for dupes using the hash object may be more efficient since (a) you don't need to sort the input file and (b) the step is stopped as soon as duplicate [accountID,Month] key is detected:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data _null_ ;                                                              
  dcl hash a () ;                                                          
  a.definekey ("accountID") ;                                              
  a.definedone () ;                                                        
  dcl hash am () ;                                                         
  am.definekey ("accountID", "month") ;                                    
  am.definedone () ;                                                       
  do _n_ = 1 by 1 until (eof) ;                                            
    set have (keep = accountID Month) end = eof ;                          
    if a.check() = 0 then a_dup = 1 ;                                      
    else a.add() ;                                                         
    if am.check() = 0 then am_dup = 1 ;                                    
    else am.add() ;                                                        
    if a_dup and am_dup then do ;                                          
      put "NOTE: Dupes both by accountID and [accountID,Month] detected." ;
      stop ;                                                               
    end ;                                                                  
  end ;                                                                    
  if a_dup then do ;                                                       
    put "NOTE: Dupes by accountID detected." ;                             
    if am_dup then put "NOTE: Dupes by [accountID,Month] detected." ;      
  end ;                                                                    
  else put "NOTE: No dupes detected." ;                                    
  stop ;                                                                   
run ;                                                                      
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;This program can be expanded in a number of ways, for example, to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;report on the total number of dupes by either key&lt;/LI&gt;
&lt;LI&gt;generate unduplicated files in both the original key order and sorted order&lt;/LI&gt;
&lt;LI&gt;generate files containing the eliminated duplicate records in the original key order and sorted order&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;(Note that doing any of the above requires that the entire input file be read.) There're more nifty things in the same vein the hash object can be used for. If interested, here's a shameless plug for the "hash book"&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13569"&gt;@DonH&lt;/a&gt;&amp;nbsp;and I have put together:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://support.sas.com/en/books/authors/paul-dorfman.html" target="_self"&gt;https://support.sas.com/en/books/authors/paul-dorfman.html&lt;/A&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Kind regards&lt;/P&gt;
&lt;P&gt;Paul D.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 19 Sep 2019 16:58:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590116#M168858</guid>
      <dc:creator>hashman</dc:creator>
      <dc:date>2019-09-19T16:58:58Z</dc:date>
    </item>
    <item>
      <title>Re: How to check for dataset duplicate observations and then remove them</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590401#M168970</link>
      <description>&lt;P&gt;1. A check to see whether there are duplicate observations in the dataset.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
data _null_;
if 0 then set have;
 declare hash h(dataset:'have',duplicate:'error');
 h.definekey(all:'y');
 h.definedone();
stop;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2. If there are duplicates, the using the two different methods (see below)&amp;nbsp; to remove them based on te 185,000 observations in the dataset.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
data _null_;
if 0 then set have;
 declare hash h(dataset:'have',ordered:'y');
 h.definekey('account_id');
 h.definedata(all:'y');
 h.definedone();

 h.output(dataset:'want1');
stop;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;
data _null_;
if 0 then set have;
 declare hash h(dataset:'have',ordered:'y');
 h.definekey(all:'y');
 h.definedata(all:'y');
 h.definedone();

 h.output(dataset:'want2');
stop;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Sep 2019 13:45:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-check-for-dataset-duplicate-observations-and-then-remove/m-p/590401#M168970</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2019-09-20T13:45:55Z</dc:date>
    </item>
  </channel>
</rss>

