Re: What is the best way to identify identical duplicate records in la...

abhityagi · Posted 11-21-2019 03:01 PM

Hi Team,

I am looking for a best approach to identify complete duplicate records(redundancy) in large dataset. I can not use proc sort since it will impact the performance of the code. Is there anyone can share a sample of code except proc sort for identifying duplicate records in a dataset. It will be highly appreciated.

Thanks,

Abhishek

Krueger · Posted 11-21-2019 03:06 PM

Here's a good resource that goes over different methods. You can determine which one, if any is best for your situation. https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf

Kurt_Bremser · Posted 11-21-2019 03:22 PM

All methods to find duplicates require a sorting process. The only question is how fast you can do it.

The fastest method would be the use of a hash object, but that requires that all unique keys fit in memory; likewise the summary method with class mentioned in the paper linked by @Krueger.

How "large" is your dataset (physical file size, observation size, number of observations)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

hashman · Posted 11-21-2019 11:42 PM

@Kurt_Bremser: "All methods to find duplicates require a sorting process."

How about this:

allocate a lookup table
read the next record from the file being deduped
if the key is not in the table, output; else go to #2

I don't see any sorting involved in this process - unless you re-sort the lookup table after every insertion. But the latter would mean that you've picked a wrong lookup table structure. Using a direct-addressed table, such as an open-addressed hash, key-indexed, or bitmap table, requires no sorting whatsoever. Note that I didn't list the SAS hash object just for the sake of purity, as at some angle, using an order is involved when a key is inserted into / searched in one of the underlying AVL trees. That notwithstanding, when it is used for deduping, no sorting of the input data is needed, and its internal "sorting" (i.e. traversing the tree) is extremely fast.

However, it doesn't have to be the hash object, as organizing a true open-addressed hash table, contrary to common perception, isn't difficult at all. For example, this one is based on the simplest collision resolution policy (by linear probing):

data have ;                                                                                                                                                                                                                                                     
  do _n_ = 1 to 1.5 * n ;                                                                                                                                                                                                                                       
    _iorc_ = ceil (ranuni (1) * n) ;                                                                                                                                                                                                                            
    set sashelp.class nobs = n point = _iorc_ ;                                                                                                                                                                                                                 
    output ;                                                                                                                                                                                                                                                    
  end ;                                                                                                                                                                                                                                                         
  stop ;                                                                                                                                                                                                                                                        
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
%let h = 101 ; * prime number > 2*nobs in HAVE, can be auto-computed ;                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                
data want (drop = ___:) ;                                                                                                                                                                                                                                       
  array h [&h] $16 _temporary_ ;                                                                                                                                                                                                                                
  set have ;                                                                                                                                                                                                                                                    
  array nn _numeric_ ;                                                                                                                                                                                                                                          
  array cc _char_ ;                                                                                                                                                                                                                                             
  ___m = put (md5 (catx (":", of nn[*], of cc[*])), $16.) ;                                                                                                                                                                                                     
  do _n_ = 1 + mod (input (___m, pib6.), &h) by 1 until (h[_n_] = ___m or cmiss (h[_n_])) ;                                                                                                                                                                     
    if _n_ > &h then _n_ = 1 ;                                                                                                                                                                                                                                  
  end ;                                                                                                                                                                                                                                                         
  if cmiss (h[_n_]) ;                                                                                                                                                                                                                                           
  h[_n_] = ___m ;                                                                                                                                                                                                                                               
run ;

The array H above is a true open-addressed hash table, and if you look at its content at the end of the step, you'll see that it's the farthest thing from anything "sorted" one could ever imagine.

Kind regards

Paul D.

Kurt_Bremser · Posted 11-22-2019 07:31 AM

I maybe should have added "efficient". I just expanded your dataset to a reasonable number to get some processing time:

data have ;                                                                                                                                                                                                                                                     
  do _n_ = 1 to 20000 * n ;                                                                                                                                                                                                                                       
    _iorc_ = ceil (ranuni (1) * n) ;                                                                                                                                                                                                                            
    set sashelp.class nobs = n point = _iorc_ ;                                                                                                                                                                                                                 
    output ;                                                                                                                                                                                                                                                    
  end ;                                                                                                                                                                                                                                                         
  stop ;                                                                                                                                                                                                                                                        
run ;

And ran three different methods against it:

%let h = 760007 ; * prime number > 2*nobs in HAVE, can be auto-computed ;                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                
data want1 (drop = ___:) ;                                                                                                                                                                                                                                       
  array h [&h] $16 _temporary_ ;                                                                                                                                                                                                                                
  set have ;                                                                                                                                                                                                                                                    
  array nn _numeric_ ;                                                                                                                                                                                                                                          
  array cc _char_ ;  
  ___m = put (md5 (catx (":", of nn[*], of cc[*])), $16.) ;
  do _n_ = 1 + mod (input (___m, pib6.), &h) by 1 until (h[_n_] = ___m or cmiss (h[_n_])) ;                                                                                                                                                                     
    if _n_ > &h then _n_ = 1 ; 
  end ;
  if cmiss (h[_n_]) ;                                                                                                                                                                                                                                           
  h[_n_] = ___m ;
run ;  

proc sort data=have out=want2 noduprec;
by name;
run;

data want3 (drop = ___:) ;                                                                                                                                                                                                                                       
  if _n_ = 1 then do ;                                                                                                                                                                                                                                          
    dcl hash h () ;                                                                                                                                                                                                                                             
    h.definekey ("___m") ;                                                                                                                                                                                                                                      
    h.definedone () ;                                                                                                                                                                                                                                           
  end ;                                                                                                                                                                                                                                                         
  set have ;                                                                                                                                                                                                                                                    
  array nn _numeric_ ;                                                                                                                                                                                                                                          
  array cc _char_ ;                                                                                                                                                                                                                                             
  ___m = put (md5 (catx (":", of nn[*], of cc[*])), $16.) ;                                                                                                                                                                                                     
  if h.check() ne 0 ;                                                                                                                                                                                                                                           
  h.add() ;                                                                                                                                                                                                                                                     
run ;

On SAS UE on a MacBook Pro with a 2,6 GHz Intel Core i7.

The real surprise was the performance of proc sort, which beat the others handily by a factor of 3, and that using a hash object did not really outperform the "manual" method:

 73         data want1 (drop = ___:) ;
 74           array h [&h] $16 _temporary_ ;
 75           set have ;
 76           array nn _numeric_ ;
 77           array cc _char_ ;
 78           ___m = put (md5 (catx (":", of nn[*], of cc[*])), $16.) ;
 79           do _n_ = 1 + mod (input (___m, pib6.), &h) by 1 until (h[_n_] = ___m or cmiss (h[_n_])) ;
 80             if _n_ > &h then _n_ = 1 ;
 81           end ;
 82           if cmiss (h[_n_]) ;
 83           h[_n_] = ___m ;
 84         run ;
 
 NOTE: There were 380000 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT1 has 19 observations and 5 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.53 seconds
       cpu time            0.52 seconds
       
 
 85         
 86         proc sort data=have out=want2 noduprec;
 87         by name;
 88         run;
 
 NOTE: There were 380000 observations read from the data set WORK.HAVE.
 NOTE: 379981 duplicate observations were deleted.
 NOTE: The data set WORK.WANT2 has 19 observations and 5 variables.
 NOTE:  Verwendet wurde: PROZEDUR SORT - (Gesamtverarbeitungszeit):
       real time           0.16 seconds
       cpu time            0.24 seconds
       
 
 89         
 90         data want3 (drop = ___:) ;
 91           if _n_ = 1 then do ;
 92             dcl hash h () ;
 93             h.definekey ("___m") ;
 94             h.definedone () ;
 95           end ;
 96           set have ;
 97           array nn _numeric_ ;
 98           array cc _char_ ;
 99           ___m = put (md5 (catx (":", of nn[*], of cc[*])), $16.) ;
 100          if h.check() ne 0 ;
 101          h.add() ;
 102        run ;
 
 NOTE: There were 380000 observations read from the data set WORK.HAVE.
 NOTE: The data set WORK.WANT3 has 19 observations and 5 variables.
 NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
       real time           0.48 seconds
       cpu time            0.48 seconds

It seems that's because sort uses multiple cores (CPU time > real time).

So, for the OP, it comes down to Maxim 4.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

hashman · Posted 11-22-2019 01:15 PM

@Kurt_Bremser:

I'm not actually surprised.

1. As far as SORT goes, it may be a "reasonable number" but it's nowhere near the area where SORT starts reaching its usual bottlenecks. With 5 short variables, SORT will likely outperform anything else well into the 10 million obs and then some. And I don't think it will even have to make use of multi-threading because with the composite key cardinality of 15 it just keeps all its internal bookkeeping in memory. In fact, it will try making use of memory all the way up to SORTSIZE for its bookkeeping, and only after a certain threshold has been reached the limit is reached, it will start offloading parts of it to utility files in the SORTWORK space. That's why nowadays, when memories are huge even on small hardware, you see SORT grabbing so much RAM in order to speed things up. This is in sharp contrast with the days of yore, when the memories were scant (just 20 years ago on the mainframe it was normal to see REGION set to 256K), and the biggest advantage of sorting (and the reason why so much effort had been dedicated to optimizing its algorithms) was that in the end, it could get the job done using very little memory (albeit at the expense of processing time and using bookkeeping on disk and even tapes), while a fast lookup table approach that would require keeping all distinct keys in memory would fail due to the latter's paucity.

In my real-world experience, SORT used for deduping starts sputtering somewhere around 50 GB and more. At one org I consulted for, there were up to 10 extracts, between 50 to 100 GB each, from SQL Server that needed to be deduped as the first ETL cleansing step. It was taking SORT 1.5 to 2 hours of real time apiece, not to mention constant nagging problems with the SORTWORK space (on Linux), and as a result, the ETL would simply fail to fit into the off-peak time window during which it had to run to make the data available in the morning. The problem was solved by keeping the keys only plus the RID and sort-deduping only those; then using the RIDs to mark the duplicate records in the original extracts for deletion. Since it resulted in reducing each dedup process to no more than 3 minutes, it enthused me to described the method here:

https://support.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2426-2018.pdf

In the present OP's case, when all the variables form the composite dedup key, the concept can be still applied by combining it with using MD5 to create a single $16 key variable to reduce the sorting workload. Only in this case the number of dupes greatly exceeds the number of the records to keep, so it's better to rewrite HAVE into WANT by using the RIDs corresponding to the distinct records rather than using the RIDs corresponding to the dupes to mark the records in HAVE for deletion. For example:

data have ;                                                                                                                             
  do _n_ = 1 to 1.5 * n ;                                                                                                               
    _iorc_ = ceil (ranuni (1) * n) ;                                                                                                    
    set sashelp.class nobs = n point = _iorc_ ;                                                                                         
    output ;                                                                                                                            
  end ;                                                                                                                                 
  stop ;                                                                                                                                
run ;                                                                                                                                   
                                                                                                                                        
data v (keep = ___:)  / view = v ;                                                                                                      
  set have ;                                                                                                                            
  array nn _numeric_ ;                                                                                                                  
  array cc _char_ ;                                                                                                                     
  ___m = put (md5 (catx (":", of nn[*], of cc[*])), $16.) ;                                                                             
  ___rid = _n_ ;                                                                                                                        
run ;                                                                                                                                   
                                                                                                                                        
proc sort nodupkey data = v out = nodup (keep = ___rid) ;                                                                               
  by ___m ;                                                                                                                             
run ;                                                                                                                                   
                                                                                                                                        
data want ;                                                                                                                             
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (dataset:"nodup") ;                                                                                                      
    h.definekey (all:"y") ;                                                                                                             
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set have curobs = ___rid ;                                                                                                            
  if h.check() = 0 ;                                                                                                                    
run ;

2. The hand-coded hash outperforms the hash object in this case because the array hash table is extremely sparse: 760007 slots for 15 distinct keys, so most likely there are no collisions at all, and the DO loop terminates after the first iteration. It means that effectively, it is as fast as key-indexing, except for the pretty high cost of computing the hash function (though it's mitigated by the fact that MD5, the costliest part if it, has to be computed, anyway).

Also, array-based hashing always outperforms the hash object by a healthy margin, both in terms of time and memory footprint, if the key is a single integer variable - or a combination of keys that can be mapped to a single integer one-to-one very inexpensively. This is because in this case, the hash function is nothing but MOD, so it's very fast, which makes the entire algorithm very fast as well. Memory-wise, the shortest hash object item, even if there's a single numeric key, is 48 bytes, while one array item for such a key is 8 bytes. Therefore, even if we make the array 3 times sparser vis-a-vis the number of keys stored there, its memory footprint is still 1/2 of the hash object table.

Furthermore, such an open-addressed hash table is faster than the hash object for data aggregation, and the reason is simple: With the former, we can add something directly to the array item, while with the latter, we first need to copy the value from a hash into the PDV, add something to it, and then replace it in the item with the new value.

Thanks for the nice discussion.

Kind regards

Paul D.

hashman · Posted 11-21-2019 10:48 PM

@abhityagi:

First, let's concoct duplicate data:

data have ;                                                                                                                                                                                                                                                     
  do _n_ = 1 to 1.5 * n ;                                                                                                                                                                                                                                       
    _iorc_ = ceil (ranuni (1) * n) ;                                                                                                                                                                                                                            
    set sashelp.class nobs = n point = _iorc_ ;                                                                                                                                                                                                                 
    output ;                                                                                                                                                                                                                                                    
  end ;                                                                                                                                                                                                                                                         
  stop ;                                                                                                                                                                                                                                                        
run ;

That's how it looks like:

Name       Sex    Age    Height    Weight                                                                                                                                                                                                                       
-----------------------------------------                                                                                                                                                                                                                       
Carol       F      14     62.8      102.5                                                                                                                                                                                                                       
William     M      15     66.5      112.0                                                                                                                                                                                                                       
Janet       F      15     62.5      112.5                                                                                                                                                                                                                       
Henry       M      14     63.5      102.5                                                                                                                                                                                                                       
Thomas      M      11     57.5       85.0                                                                                                                                                                                                                       
William     M      15     66.5      112.0                                                                                                                                                                                                                       
Joyce       F      11     51.3       50.5                                                                                                                                                                                                                       
Joyce       F      11     51.3       50.5                                                                                                                                                                                                                       
Alfred      M      14     69.0      112.5                                                                                                                                                                                                                       
Alice       F      13     56.5       84.0                                                                                                                                                                                                                       
Robert      M      12     64.8      128.0                                                                                                                                                                                                                       
John        M      12     59.0       99.5                                                                                                                                                                                                                       
Ronald      M      15     67.0      133.0                                                                                                                                                                                                                       
Alice       F      13     56.5       84.0                                                                                                                                                                                                                       
William     M      15     66.5      112.0                                                                                                                                                                                                                       
James       M      12     57.3       83.0                                                                                                                                                                                                                       
James       M      12     57.3       83.0                                                                                                                                                                                                                       
Mary        F      15     66.5      112.0                                                                                                                                                                                                                       
William     M      15     66.5      112.0                                                                                                                                                                                                                       
Henry       M      14     63.5      102.5                                                                                                                                                                                                                       
Mary        F      15     66.5      112.0                                                                                                                                                                                                                       
Janet       F      15     62.5      112.5                                                                                                                                                                                                                       
Joyce       F      11     51.3       50.5                                                                                                                                                                                                                       
James       M      12     57.3       83.0                                                                                                                                                                                                                       
John        M      12     59.0       99.5                                                                                                                                                                                                                       
Ronald      M      15     67.0      133.0                                                                                                                                                                                                                       
Louise      F      12     56.3       77.0                                                                                                                                                                                                                       
Judy        F      14     64.3       90.0

Now let's dedup it without any sorting whatsoever:

data want (drop = ___:) ;                                                                                                                                                                                                                                       
  if _n_ = 1 then do ;                                                                                                                                                                                                                                          
    dcl hash h () ;                                                                                                                                                                                                                                             
    h.definekey ("___m") ;                                                                                                                                                                                                                                      
    h.definedone () ;                                                                                                                                                                                                                                           
  end ;                                                                                                                                                                                                                                                         
  set have ;                                                                                                                                                                                                                                                    
  array nn _numeric_ ;                                                                                                                                                                                                                                          
  array cc _char_ ;                                                                                                                                                                                                                                             
  ___m = put (md5 (catx (":", of nn[*], of cc[*])), $16.) ;                                                                                                                                                                                                     
  if h.check() ne 0 ;                                                                                                                                                                                                                                           
  h.add() ;                                                                                                                                                                                                                                                     
run ;

The output looks like this:

Name       Sex    Age    Height    Weight                                                                                                                                                                                                                       
-----------------------------------------                                                                                                                                                                                                                       
Carol       F      14     62.8      102.5                                                                                                                                                                                                                       
William     M      15     66.5      112.0                                                                                                                                                                                                                       
Janet       F      15     62.5      112.5                                                                                                                                                                                                                       
Henry       M      14     63.5      102.5                                                                                                                                                                                                                       
Thomas      M      11     57.5       85.0                                                                                                                                                                                                                       
Joyce       F      11     51.3       50.5                                                                                                                                                                                                                       
Alfred      M      14     69.0      112.5                                                                                                                                                                                                                       
Alice       F      13     56.5       84.0                                                                                                                                                                                                                       
Robert      M      12     64.8      128.0                                                                                                                                                                                                                       
John        M      12     59.0       99.5                                                                                                                                                                                                                       
Ronald      M      15     67.0      133.0                                                                                                                                                                                                                       
James       M      12     57.3       83.0                                                                                                                                                                                                                       
Mary        F      15     66.5      112.0                                                                                                                                                                                                                       
Louise      F      12     56.3       77.0                                                                                                                                                                                                                       
Judy        F      14     64.3       90.0

Now, you can do it much simpler without involving the MD5 function at all:

data want ;                                                                                                                                                                                                                                                     
  if _n_ = 1 then do ;                                                                                                                                                                                                                                          
    dcl hash h (dataset:"have(obs=0)") ;                                                                                                                                                                                                                        
    h.definekey (all:"y") ;                                                                                                                                                                                                                                     
    h.definedone () ;                                                                                                                                                                                                                                           
  end ;                                                                                                                                                                                                                                                         
  set have ;                                                                                                                                                                                                                                                    
  if h.check() ne 0 ;                                                                                                                                                                                                                                           
  h.add() ;                                                                                                                                                                                                                                                     
run ;

However, if you have a very long record, the length of the composite key defined to the hash object can result in a very long hash item and overwhelm your memory resources. Using MD5 ensures that the hash item length will never exceed 64 bytes regardless of how many variables you have in HAVE and/or how long they are. The only caveat is that they may fail to fit into the CATX $200 buffer, but in this case you can recode as:

data want (drop = ___:) ;                                                                                                                                                                                                                                       
  if _n_ = 1 then do ;                                                                                                                                                                                                                                          
    dcl hash h () ;                                                                                                                                                                                                                                             
    h.definekey ("___m") ;                                                                                                                                                                                                                                      
    h.definedone () ;                                                                                                                                                                                                                                           
    s = h.item_size ;                                                                                                                                                                                                                                           
    put s= ;                                                                                                                                                                                                                                                    
  end ;                                                                                                                                                                                                                                                         
  set have ;                                                                                                                                                                                                                                                    
  array nn _numeric_ ;                                                                                                                                                                                                                                          
  array cc _char_ ;                                                                                                                                                                                                                                             
  ___c = put (catx (":", of nn[*], of cc[*]), $32767.) ;                                                                                                                                                                                                        
  ___m = put (md5 (___c), $16.) ;                                                                                                                                                                                                                               
  if h.check() ne 0 ;                                                                                                                                                                                                                                           
  h.add() ;                                                                                                                                                                                                                                                     
run ;

Kind regards

Paul D.

What is the best way to identify identical duplicate records in large dataset

Re: What is the best way to identify identical duplicate records in large dataset

Re: What is the best way to identify identical duplicate records in large dataset

Re: What is the best way to identify identical duplicate records in large dataset

Re: What is the best way to identify identical duplicate records in large dataset

Re: What is the best way to identify identical duplicate records in large dataset

Re: What is the best way to identify identical duplicate records in large dataset

Classroom Training Available!