Solved: How to check for dataset duplicate observations and then remove them

jeremy4 · Posted 09-19-2019 10:47 AM

Hi,

I have a dataset with 185,000 observations and just two variables - accountID (numeric, format is 12.) and month (numeric variable). Would it be possible to provide me with code so that:

1. A check to see whether there are duplicate observations in the dataset.

2. If there are duplicates, the using the two different methods (see below) to remove them based on te 185,000 observations in the dataset.

Partial example of the 185,000 observations:

account ID Month

1 201808

2 201808

3 201805

4 201903

5 201907

2 201808

2 201809

If there are duplicates, there are two different new datasets (i.e. two pieces of code to get each scenario listed below) that need to be created:

1. Unique account ID only, even if there is the same account ID but a different month.

account ID Month

1 201808

2 201808

3 201805

4 201903

5 201907

(Only one account ID is kept and dataset is kept in ascending order of account ID).

2. Only removed when unique account ID AND month are the same, so that it creates something like:

account ID Month

1 201808

2 201808

2 201809

3 201805

4 201903

5 201907

( ~~2 201808~~ has been removed as it duplicates based on both account ID AND month. The dataset is kept in ascending order of account ID and month).

PaigeMiller · Posted 09-19-2019 10:53 AM

Code 1 to produce unique accounts

proc sql;
    create table want as select distinct account_id  from have;
quit;

Code 2 to produce unique account/month combinations

proc sql;
    create table want as select distinct account_id,month from have;
quit;

--
Paige Miller

View solution in original post

PaigeMiller · Posted 09-19-2019 10:53 AM

Code 1 to produce unique accounts

proc sql;
    create table want as select distinct account_id  from have;
quit;

Code 2 to produce unique account/month combinations

proc sql;
    create table want as select distinct account_id,month from have;
quit;

--
Paige Miller

jeremy4 · Posted 09-19-2019 11:00 AM

Thanks a lot, is there a way to quick way to do a preliminary check of unique records when you have hundreds of thousands of observations, or would you use proc sql to select distinct observations and then compare the numbers in both datasets/tables?

ballardw · Posted 09-19-2019 11:24 AM

@jeremy4 wrote:
Thanks a lot, is there a way to quick way to do a preliminary check of unique records when you have hundreds of thousands of observations, or would you use proc sql to select distinct observations and then compare the numbers in both datasets/tables?

I have to say that I am not quite sure exactly what you are getting at but there are also options on Proc Sort such as NOUNIQUE coupled with UNIQUEOUT=datasetname that will sent all records with unique sort key variables, your account and month for example to a separate output data set than the default sort output set.

There is also NODUPKEY and DUPOUT=datasetname that would have duplicate observations in the Sort. DUPOUT and UNIQUEOUT can not be used in the same Proc Sort call.

novinosrin · Posted 09-19-2019 11:50 AM

data nodup_key_equivalent ;
 if _n_ = 1 then do ;
 dcl hash h() ;
 h.definekey ('account_id  ') ;
 h.definedone() ;
 end ;
 set have ;
 if h.CHECK() ne 0 ;
 h.ADD() ;
 run ;

maguiremq · Posted 09-19-2019 12:00 PM

data have;
input account_id month;
datalines;
1 201808
2 201808
3 201805
4 201903
5 201907
2 201808
2 201809
;
run;

proc sort data=have;
by account_id month;
run;

data want1 (drop=keep);
set have;
	by account_id;
	if first.account_id then keep = 1;
	else if last.account_id then keep = 0;
run;

data want2 (drop=keep);
set have;
	by account_id month;
	if first.month then keep = 1;
	else if last.month then keep = 0;
	if keep = 1;
run;

Only problem here is that it sets any variable with multiple duplicates to missing if it's not the first or last. You could run a proc freq on the variables to get an idea with the larger dataset.

hashman · Posted 09-19-2019 12:58 PM

@jeremy4:

To "just check" whether the file has dupes or not, you need to read the file until one dupe is found. Hence, potentially you may need to read the entire file (if the first dupe is in the last record). Since your file (185k obs) is extremely small, you can just read the whole file anyway using proc SORT. If you don't want to write any output data set at this point (I guess that would fall into the category of "just checking"), _NULL_ out the output data set:

data have ;                                  
  input accountID Month ;                    
  cards ;                                    
1  201808                                    
2  201808                                    
3  201805                                    
4  201903                                    
5  201907                                    
2  201808                                    
2  201809                                    
;                                            
run ;                                        
                                             
proc sort nodupkey data = have out = _null_ ;
  by accountID ;                             
run ;                                        
                                             
proc sort nodupkey data = have out = _null_ ;
  by accountID month ;                       
run ;

The SAS log will tell you if you have dupes and how many. For your particular sample, it reports for the first and second sort, respectively:

NOTE: 2 observations with duplicate key values were deleted.
NOTE: 1 observations with duplicate key values were deleted.

Note that "deleted" doesn't mean that the dupes are deleted from the input file. Rather, it means they would be excluded from the output if it were written out (nothing is written out because of OUT=_NULL_ - you're "just checking").

To generate the unduplicated output file you want, just plug the data set names you want instead of _NULL_:

proc sort equals nodupkey data = have out = out_one ;                                                                                                                                                                                                           
  by accountID ;                                                                                                                                                                                                                                                
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
proc sort equals nodupkey data = have out = out_two ;                                                                                                                                                                                                           
  by accountID month ;                                                                                                                                                                                                                                          
run ;

Now if your input file were extremely large, not to mention also wide (i.e. having many more variables than the accountID and month), "just checking" using proc SORT to read the whole file can prove rather wasteful, especially if the first dupe is located near the beginning of the input file. In this case, checking for dupes using the hash object may be more efficient since (a) you don't need to sort the input file and (b) the step is stopped as soon as duplicate [accountID,Month] key is detected:

data _null_ ;                                                              
  dcl hash a () ;                                                          
  a.definekey ("accountID") ;                                              
  a.definedone () ;                                                        
  dcl hash am () ;                                                         
  am.definekey ("accountID", "month") ;                                    
  am.definedone () ;                                                       
  do _n_ = 1 by 1 until (eof) ;                                            
    set have (keep = accountID Month) end = eof ;                          
    if a.check() = 0 then a_dup = 1 ;                                      
    else a.add() ;                                                         
    if am.check() = 0 then am_dup = 1 ;                                    
    else am.add() ;                                                        
    if a_dup and am_dup then do ;                                          
      put "NOTE: Dupes both by accountID and [accountID,Month] detected." ;
      stop ;                                                               
    end ;                                                                  
  end ;                                                                    
  if a_dup then do ;                                                       
    put "NOTE: Dupes by accountID detected." ;                             
    if am_dup then put "NOTE: Dupes by [accountID,Month] detected." ;      
  end ;                                                                    
  else put "NOTE: No dupes detected." ;                                    
  stop ;                                                                   
run ;

This program can be expanded in a number of ways, for example, to:

report on the total number of dupes by either key
generate unduplicated files in both the original key order and sorted order
generate files containing the eliminated duplicate records in the original key order and sorted order

(Note that doing any of the above requires that the entire input file be read.) There're more nifty things in the same vein the hash object can be used for. If interested, here's a shameless plug for the "hash book" @DonH and I have put together:

https://support.sas.com/en/books/authors/paul-dorfman.html

Kind regards

Paul D.

Ksharp · Posted 09-20-2019 09:45 AM

1. A check to see whether there are duplicate observations in the dataset.


data _null_;
if 0 then set have;
 declare hash h(dataset:'have',duplicate:'error');
 h.definekey(all:'y');
 h.definedone();
stop;
run;

2. If there are duplicates, the using the two different methods (see below) to remove them based on te 185,000 observations in the dataset.


data _null_;
if 0 then set have;
 declare hash h(dataset:'have',ordered:'y');
 h.definekey('account_id');
 h.definedata(all:'y');
 h.definedone();

 h.output(dataset:'want1');
stop;
run;


data _null_;
if 0 then set have;
 declare hash h(dataset:'have',ordered:'y');
 h.definekey(all:'y');
 h.definedata(all:'y');
 h.definedone();

 h.output(dataset:'want2');
stop;
run;

How to check for dataset duplicate observations and then remove them

Re: How to check for dataset duplicate observations and then remove them

Re: How to check for dataset duplicate observations and then remove them

Re: How to check for dataset duplicate observations and then remove them

Re: How to check for dataset duplicate observations and then remove them

Re: How to check for dataset duplicate observations and then remove them

Re: How to check for dataset duplicate observations and then remove them

Re: How to check for dataset duplicate observations and then remove them

Re: How to check for dataset duplicate observations and then remove them

Registration is open

Registration is open

SAS Training: Just a Click Away