Solved: Removing duplicates by removing reappeared chunk of records (could it ...

lydiawawa · Posted 11-05-2019 09:04 PM

Hi,

I would like to remove the reappearance of the chunk of records such as the following:

data have;
input name $    identifier ;
cards;
mary     1
mary     2
mary     2
mary     4
mary     5
mary     7
mary     6
adam     2
adam     3
adam     3
adam     7
/*remove*/
mary     1
mary     2
mary     2
mary     4
mary     5
mary     7
mary     6
/*remove*/
adam     8
mary     1
mary     2
mary     3
mary     4
mary     5
mary     7
mary     6
adam    9
mary     1
mary     2
mary     3

;

After removing the reappearance of the chunk records of Mary and duplicates, I should have the sorted records of :

mary     1
mary     2
mary     4
mary     5
mary     6
mary     7
adam    2 
adam    3 
adam    7
adam    8
mary     1
mary     2
mary     3
mary     4
mary     5
mary     6
mary     7
adam    9
mary     1
mary     2
mary     3

The second appearance chunk of Mary has the same order and value of records as the first chunk, and they should be removed. The original data has billions of records. Could this be done by Hash to avoid sorting the raw dataset? If not, datastep would also suffice.

_N_ would have to be created to maintain the original structure, I believe.

Thank you!

Patrick · Posted 11-06-2019 01:53 AM

@lydiawawa

Here an approach using your revised data. If that's going to work for you will depend on the available memory and how many distinct groups you have in your actual data.

data have;
  input name $ identifier;
  cards;
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
adam 2
adam 3
adam 3
adam 7
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
adam 8
mary 1
mary 2
mary 3
mary 4
mary 5
mary 7
mary 6
adam 9
mary 1
mary 2
mary 3
;

/* create table keeplist with group_id's to be kept */
data _null_;
  if _n_=1 then
    do;
      length digest $16 group_id 8;
      dcl hash h1();
      h1.defineKey('digest');
      h1.defineData('group_id');
      h1.defineDone();
    end;

  set have(keep=name identifier) end=last;
  by name notsorted;
  retain digest;
  if first.name then 
    do;
      group_id+1;
      call missing(digest);
    end;
  digest=md5(catx('|',digest,name,identifier));
  if last.name then h1.ref();

  if last then h1.output(dataset:'keeplist');
run;

/* data want: 
    - keep only first identical group within a name
        - repeated identifier values relevant
        - sort order relevant
    - remove dups within group
*/
data want;
  if _n_=1 then
    do;
      length group_id 8;
      dcl hash h1(dataset:'keeplist');
      h1.defineKey('group_id');
      h1.defineDone();
    end;

  set have(keep=name identifier);
  by notsorted name identifier;
  if first.name then group_id+1;
  if not first.identifier then delete;
  if h1.check()=0 then output;
run;

proc print data=want;
run;

View solution in original post

Patrick · Posted 11-05-2019 09:26 PM

Hash is what I would be going for but with billions of records question is if you've got sufficient memory.

If there isn't sufficient memory then I can't think of a way which doesn't require sorting.

To estimate the memory requirements:

17MB + N rows with distinct "keys" * sum of lengths of variables in hash * 1.1 (just adding 10%)

@hashman

How would one have to estimate hash memory requirements properly?

lydiawawa · Posted 11-05-2019 09:43 PM

Thank you for the suggestion. When you referred to sum of lengths of variables are those the lengths of variable names? For ex: variable x and y will have total length of 2?

I think sorting may work in this scenario if I only select certain variables. Based on experience, the wider the dataset, the slower the sorting.

@hashman

Yes, hashman has saved me on many occasions lol...

Patrick · Posted 11-05-2019 10:19 PM

Not the names but the actual variable lengths which you could get via a Proc Contents.

Yes, any I/O heavy operations will speed-up when reducing the volume.

You could use something like below. It requires unfortunately still several passes through the data and sorting.

data have;
  input name $ identifier;
  cards;
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
adam 2
adam 3
adam 3
adam 7
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
;

data inter;
  set have;
  rowid+1;
run;

proc sort data=inter out=inter nodupkey equals;
  by name identifier;
run;
proc sort data=inter out=want;
  by rowid;
run;

proc print data=want;
run;

lydiawawa · Posted 11-05-2019 11:20 PM

Sorry Patrick, I was not very clear of what I want the output to be, @Tom pointed out. I revised the original post.

Patrick · Posted 11-06-2019 01:53 AM

@lydiawawa

Here an approach using your revised data. If that's going to work for you will depend on the available memory and how many distinct groups you have in your actual data.

data have;
  input name $ identifier;
  cards;
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
adam 2
adam 3
adam 3
adam 7
mary 1
mary 2
mary 2
mary 4
mary 5
mary 7
mary 6
adam 8
mary 1
mary 2
mary 3
mary 4
mary 5
mary 7
mary 6
adam 9
mary 1
mary 2
mary 3
;

/* create table keeplist with group_id's to be kept */
data _null_;
  if _n_=1 then
    do;
      length digest $16 group_id 8;
      dcl hash h1();
      h1.defineKey('digest');
      h1.defineData('group_id');
      h1.defineDone();
    end;

  set have(keep=name identifier) end=last;
  by name notsorted;
  retain digest;
  if first.name then 
    do;
      group_id+1;
      call missing(digest);
    end;
  digest=md5(catx('|',digest,name,identifier));
  if last.name then h1.ref();

  if last then h1.output(dataset:'keeplist');
run;

/* data want: 
    - keep only first identical group within a name
        - repeated identifier values relevant
        - sort order relevant
    - remove dups within group
*/
data want;
  if _n_=1 then
    do;
      length group_id 8;
      dcl hash h1(dataset:'keeplist');
      h1.defineKey('group_id');
      h1.defineDone();
    end;

  set have(keep=name identifier);
  by notsorted name identifier;
  if first.name then group_id+1;
  if not first.identifier then delete;
  if h1.check()=0 then output;
run;

proc print data=want;
run;

lydiawawa · Posted 11-06-2019 09:55 AM

Hash is so much faster than any of the other methods. This is a life saver!

hashman · Posted 11-05-2019 11:52 PM

@Patrick:

By defining the hash variables, their PDV host variables, and looking at its ITEM_SIZE attribute, then multiplying the latter by the max number of hash items you reckon you will end up with. To determine the item size in case of @lydiawawa's keys:

data _null_ ;                 
  dcl hash h () ;             
  h.definekey ("name", "id") ;
  h.definedata ("_dummy") ;   
  h.definedone () ;           
  _dummy = "" ;               
  size = h.item_size ;        
  put size= ;                 
  stop ;                      
  set have ;                  
run ;

@ChrisNZ has written a fantastic comprehensive (macro) routine, based on ITEM_SIZE that tells all kind of things about the hash object's appetite for memory based on the size and number of variables defined to the key and data portions. He might be willing to share it with you if you ask nicely privately ;).

Kind regards

Paul D.

ChrisNZ · Posted 11-07-2019 02:57 AM

@hashman Here is the macro (I added more information for 32-bit platforms, compared to the version you have).

Anything there that diverges from your experience of SAS hash tables?

High-Performance SAS Coding - Third Edition

hashman · Posted 11-07-2019 10:01 AM

@ChrisNZ: Have read through it and detected nothing that would deviate from my knowledge. Thanks!

Tom · Posted 11-05-2019 10:39 PM

Are you just looking to get the distinct observations? Why not just sort the dataset?

proc sort data=have out=want nodupkey;
  by name identifier;
run;

The sort procedure has been highly optimized to work as fast as possible.

Or do you really want to treat varying length groups of observations as the unit to be compared? So that MARY+2,MARY+3,MARY+5 is different than MARY+2,MARY+3,MARY+4 ?

lydiawawa · Posted 11-05-2019 10:55 PM

Thank you for helping me explaining the situation better. Yes, I would like to take into account the length of grouped observation in this scenario. As by your example, these two chunks should not be considered as duplicates. I will be editing my original post.

hashman · Posted 11-05-2019 11:21 PM

@lydiawawa: A short answer is: Depends on the data.

A longer one:

If your input file is sorted WRT the first occurrences of the key-values, just dedup it by hashing. It will be the less memory-taxing, the fewer distinct values of the key you have in the file. This is because when you dedup by hashing, at the end of the process the hash table will have contained every distinct key-value.

If not and the dupes are numerous but the distinct keys are not, first dedup by hashing and then sort the result, in this sequence. This way, you neither overtax the memory nor end up sorting a long file.

Example:

data have ;                                                                                                                             
  input name $ id ;                                                                                                                     
cards;                                                                                                                                  
mary   1                                                                                                                                
mary   2                                                                                                                                
mary   2                                                                                                                                
mary   4                                                                                                                                
mary   5                                                                                                                                
mary   7                                                                                                                                
mary   6                                                                                                                                
adam   2                                                                                                                                
adam   3                                                                                                                                
adam   3                                                                                                                                
adam   7                                                                                                                                
mary   1                                                                                                                                
mary   2                                                                                                                                
mary   2                                                                                                                                
mary   4                                                                                                                                
mary   5                                                                                                                                
mary   7                                                                                                                                
mary   6                                                                                                                                
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _dummy) ;                                                                                                             
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("name", "id") ;                                                                                                        
    h.definedata ("_dummy") ;                                                                                                           
    h.definedone () ;                                                                                                                   
    _dummy = "" ;                                                                                                                       
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  if h.check() ne 0 ;                                                                                                                   
  h.add() ;                                                                                                                             
run ;                                                                                                                                   
                                                                                                                                        
proc sort ;                                                                                                                             
  by name id ;                                                                                                                          
run ;

The $1 _dummy is placed in the data portion because (a) you can't have a hash table without the data portion (at least one hash variable must be there) and (b) $1 is the shortest you can do. If you omit DEFINEDATA, the variables defined in DEFINEKEY will be auto-placed in the data portion - in this case, with the summary length of 16 bytes per item.

The $1 _dummy trick is used with the eye on the situation where your composite key is rather long. Otherwise, it may not win you any memory prizes. For example, in your particular situation, omitting DEFINEDATA will result in the total system length of the hash variables 8*4=32 bytes, which on a 64-bit platform results in the hash item length of 64 bytes - and it is 64 bytes for any total key+data portion length of 2-32 bytes (it cannot be 1, as one key at $1 and one data at $! are already 2 bytes). One byte more, and the item length becomes 80.

If you have a composite key exceeding 32 bytes in total, you can, regardless of its total length, reduce the key portion length item length to 16 bytes by MD5-ing the key. The step below won't result in any memory footprint reduction since your composite key is only 16 bytes to start with; it just shows the principle.

data want (drop = _:) ;                                                                                                                 
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("_md5") ;                                                                                                              
    h.definedata ("_dummy") ;                                                                                                           
    h.definedone () ;                                                                                                                   
    _dummy = "" ;                                                                                                                       
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  _md5 = put (md5 (catx (":", name, id)), $16.) ;                                                                                       
  if h.check() ne 0 ;                                                                                                                   
  h.add() ;                                                                                                                             
run ;

Kind regards

Paul D.

lydiawawa · Posted 11-05-2019 11:27 PM

Hi @hashman , I made a correction to my original post. I did not emphasize that the length of the chunk is also part of de-duplication consideration, which has been pointed out by @Tom. Thank you for the suggestion, I think it is a good workflow that I must adapt for large dataset.

lydiawawa · Posted 11-06-2019 09:54 AM

I hope there is an option to accept two answers. Thank you so much for providing the direction. This is my first time dealing with national level large dataset has been a struggle so far.

Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Re: Removing duplicates by removing reappeared chunk of records (could it be done by Hash?)

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!