Fastest way to count number of duplicates

lambdavu · Posted 09-02-2017 03:37 PM

Hi everyone. This is my first post, I will try my best to be as descriptive as possible, if anything is not clear please let me know!

The problem is as follows: I need to cycle through a number of datasets and find the number of duplicates in each of them, given a combination of variable which should provide a unique key. These datasets are available as SAS Views and not actual tables.

The looping part is laid out and works, meaning I can loop through all the tables and select the correct set of keys for each with some macro variables. Some of this tables are big, though (in the tens of millions of records and hundreds of column) so that the PROC SQL that currently counts the number of duplicates works like this:

data have /view=have;
input key1 key2 $ data;
datalines;
1 A 93489
1 A 93849
2 B 98798
2 B 78788
3 A 39489
3 B 93112
1 B 90039
2 A 45789
;
run;

proc sql noprint;
   create table dups as
   select key1, key2, count(*) as num
   from have
   group by key1, key2
   having num > 1;
quit;

It works fine, but for some big tables it's pretty slow. I therefore wonder if there is a better way, also given the marvelous things i read about hash tables being very efficient for merging and sorting. Do you think hash tables are the right way to go to find a more efficient implementation? Can you point me in the right direction, if this is the case, with some reference to a paper?

Thanks!

Reeza · Posted 09-02-2017 04:00 PM

What version of SAS do you have?

PROC SORT itself has a bunch of duplicate record options that may allow you to extract records.

I'm assuming none of your tables are indexed?

/*This example demonstrates how to get a list of duplicates based on the key variables.
KEY variables are placed in the BY statement. 
Any 'duplicates' according to the KEY variables are kept in the OUT= dataset
*/

*Generate sample data with duplicate KEY;
data class;
    set sashelp.class sashelp.class(obs=5);
run;

*Use the NOUNIQUEKEYS to get rid of unique records and keep duplicates;
proc sort data=class nouniquekeys out=duplicates;
    by name;
run;

If you're interested in hash methods, I recommend searching the specific topic of interest at lexjansen.com. It's a repository of SAS papers that are user written over time and go back pretty far. It has a lot of detailed examples and some may be very close to what you want. A good intro paper on hashes with worked examples is this one:

http://support.sas.com/resources/papers/proceedings11/168-2011.pdf

And one that covers hash tables and duplicates:

http://support.sas.com/resources/papers/proceedings16/10200-2016.pdf

Hope that points you in the correct direction.

Patrick · Posted 09-02-2017 08:39 PM

@lambdavu

Look at views as encapsulated code which only gets executed once you access the view. This means that the runtimes you see are not only for your SQL but also for the view which gets executed as part of the SQL where you use the view.

It's possible that a significant part of the time you observe gets consumed by executing the view.

If you want to find out how much time the view consumes then implement something like:

data _null_;
  set myview;
run;

As for your question: SAS hashes are great when you need to implement lookup tables as it allows you to load the lookup table into memory and then lookup values without the need to sort the potentially big base table.

To detect duplicates you will need to sort the data. SAS hashes allow you to sort in-memory but you still need to first load into the hash and then unload from the hash (=full read/write operation) which are single threaded processes.

A SAS sort on the other hand can execute multi-threaded. Both Proc SQL and Proc SORT use in the background the same sort mechanism.

I've been curious how runtimes for multiple coding options actually look like and it appears that your current coding version seems to be pretty good.

Here what I've done plus the log with the runtimes.

options fullstimer msglevel=i;
data have;
  do key1=1000 to 1 by -1;
    do key2=10000 to 1 by -1;
      output;
      if ceil(ranuni(1)*100000)=1 then output;
    end;
  end;
  stop;
run;

/** option 1: single data step using hash */
data want_opt1(keep=key1 key2);
  if 0 then set have(keep=key1 key2);
  dcl hash h1(multidata:'y', ordered:'y', hashexp:10, dataset:'have(keep=key1 key2)');
  h1.defineKey('key1','key2');
  h1.defineData('key1','key2');
  h1.defineDone();
  dcl hiter hh1('h1');

   /* Iterate through the hash object and output data values */
  length r 8;
  rc = hh1.first();
  do while (rc = 0);
    h1.check();
    h1.has_next(result:r);
    if r>0 then output;
    rc = hh1.next();
  end;
  stop;
run;

/** option 2: proc sort and data step **/
proc sort data=have(keep=key1 key2) out=inter;
  by key1 key2;
run;

data want_opt2;
  set inter;
  by key1 key2;
  if not (first.key2 and last.key2);
run;

/** option 3: Proc SQL with Group By and Having **/
proc sql noprint;
   create table want_opt3 as   
   select key1, key2, count(*) as num
   from have
   group by key1, key2
   having num > 1;
quit;


/** option 4: Proc SQL sorted view and data step **/
proc sql noprint;
   create view inter2 as   
   select key1, key2
   from have
   order by key1, key2
   ;
quit;

data want_opt4;
  set inter2;
  by key1 key2;
  if not (first.key2 and last.key2);
run;

/** option 5: Proc Sort only **/
proc sort data=have(keep=key1 key2) out=want_opt5 nounikeys threads noequals;
  by key1 key2;
run;


/** option 6: Sort in hash, ds to detemine duplicates **/
data _null_;
  if 0 then set have(keep=key1 key2);
  dcl hash h1(multidata:'y', ordered:'y', dataset:'have(keep=key1 key2)');
  h1.defineKey('key1','key2');
  h1.defineData('key1','key2');
  h1.defineDone();

  h1.output(dataset:'inter');
  stop;
run;

data want_opt6;
  set inter;
  by key1 key2;
  if not (first.key2 and last.key2);
run;


/** clean-up **/
proc datasets lib=work nolist nowarn memtype=(data view);
  delete inter:;
  run;
quit;

27         options fullstimer msglevel=i;
28         data have;
29           do key1=1000 to 1 by -1;
30             do key2=10000 to 1 by -1;
31               output;
32               if ceil(ranuni(1)*100000)=1 then output;
33             end;
34           end;
35           stop;
36         run;

NOTE: The data set WORK.HAVE has 10000111 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.58 seconds
      user cpu time       0.46 seconds
      system cpu time     0.11 seconds
      memory              754.90k
      OS Memory           21596.00k
      Timestamp           03/09/2017 10:37:34 AM
      Step Count                        127  Switch Count  54
      

37         
38         /** option 1: single data step using hash */
39         data want_opt1(keep=key1 key2);
40           if 0 then set have(keep=key1 key2);
41           dcl hash h1(multidata:'y', ordered:'y', hashexp:10, dataset:'have(keep=key1 key2)');
42           h1.defineKey('key1','key2');
43           h1.defineData('key1','key2');
44           h1.defineDone();
45           dcl hiter hh1('h1');
2                                                          The SAS System                            08:59 Sunday, September 3, 2017

46         
47            /* Iterate through the hash object and output data values */
48           length r 8;
49           rc = hh1.first();
50           do while (rc = 0);
51             h1.check();
52             h1.has_next(result:r);
53             if r>0 then output;
54             rc = hh1.next();
55           end;
56           stop;
57         run;

NOTE: There were 10000111 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT_OPT1 has 222 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           12.56 seconds
      user cpu time       12.01 seconds
      system cpu time     0.49 seconds
      memory              1219839.42k
      OS Memory           1239956.00k
      Timestamp           03/09/2017 10:37:47 AM
      Step Count                        128  Switch Count  108
      

58         
59         /** option 2: proc sort and data step **/
60         proc sort data=have(keep=key1 key2) out=inter;
61           by key1 key2;
62         run;

NOTE: There were 10000111 observations read from the data set WORK.HAVE.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.INTER has 10000111 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           1.31 seconds
      user cpu time       3.18 seconds
      system cpu time     0.42 seconds
      memory              632457.53k
      OS Memory           653232.00k
      Timestamp           03/09/2017 10:37:48 AM
      Step Count                        129  Switch Count  34
      

63         
64         data want_opt2;
65           set inter;
66           by key1 key2;
67           if not (first.key2 and last.key2);
68         run;

NOTE: There were 10000111 observations read from the data set WORK.INTER.
NOTE: The data set WORK.WANT_OPT2 has 222 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           1.03 seconds
      user cpu time       1.01 seconds
      system cpu time     0.01 seconds
      memory              967.40k
3                                                          The SAS System                            08:59 Sunday, September 3, 2017

      OS Memory           21596.00k
      Timestamp           03/09/2017 10:37:49 AM
      Step Count                        130  Switch Count  44
      

69         
70         /** option 3: Proc SQL with Group By and Having **/
71         proc sql noprint;
72            create table want_opt3 as
73            select key1, key2, count(*) as num
74            from have
75            group by key1, key2
76            having num > 1;
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT_OPT3 created, with 111 rows and 3 columns.

77         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           2.55 seconds
      user cpu time       4.68 seconds
      system cpu time     0.29 seconds
      memory              484576.03k
      OS Memory           504632.00k
      Timestamp           03/09/2017 10:37:52 AM
      Step Count                        131  Switch Count  48
      

78         
79         
80         /** option 4: Proc SQL sorted view and data step **/
81         proc sql noprint;
82            create view inter2 as
83            select key1, key2
84            from have
85            order by key1, key2
86            ;
NOTE: SQL view WORK.INTER2 has been defined.
87         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              49.59k
      OS Memory           21852.00k
      Timestamp           03/09/2017 10:37:52 AM
      Step Count                        132  Switch Count  44
      

88         
89         data want_opt4;
90           set inter2;
91           by key1 key2;
92           if not (first.key2 and last.key2);
93         run;

NOTE: SAS threaded sort was used.
NOTE: There were 10000111 observations read from the data set WORK.HAVE.
NOTE: There were 10000111 observations read from the data set WORK.INTER2.
4                                                          The SAS System                            08:59 Sunday, September 3, 2017

NOTE: The data set WORK.WANT_OPT4 has 222 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           3.06 seconds
      user cpu time       4.64 seconds
      system cpu time     0.49 seconds
      memory              474563.07k
      OS Memory           495032.00k
      Timestamp           03/09/2017 10:37:55 AM
      Step Count                        133  Switch Count  48
      

94         
95         /** option 5: Proc Sort only **/
96         proc sort data=have(keep=key1 key2) out=want_opt5 nounikeys threads noequals;
97           by key1 key2;
98         run;

NOTE: SAS threaded sort was used.
NOTE: 9999889 observations with unique key values were deleted.
NOTE: The data set WORK.WANT_OPT5 has 222 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           1.41 seconds
      user cpu time       3.21 seconds
      system cpu time     0.29 seconds
      memory              475164.96k
      OS Memory           496852.00k
      Timestamp           03/09/2017 10:37:56 AM
      Step Count                        134  Switch Count  35
      

99         
100        
101        /** option 6: Sort in hash, ds to detemine duplicates **/
102        data _null_;
103          if 0 then set have(keep=key1 key2);
104          dcl hash h1(multidata:'y', ordered:'y', dataset:'have(keep=key1 key2)');
105          h1.defineKey('key1','key2');
106          h1.defineData('key1','key2');
107          h1.defineDone();
108        
109          h1.output(dataset:'inter');
110          stop;
111        run;

NOTE: There were 10000111 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.INTER has 10000111 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           6.14 seconds
      user cpu time       5.64 seconds
      system cpu time     0.48 seconds
      memory              1219705.82k
      OS Memory           1240980.00k
      Timestamp           03/09/2017 10:38:02 AM
      Step Count                        135  Switch Count  68
      

112        
113        data want_opt6;
5                                                          The SAS System                            08:59 Sunday, September 3, 2017

114          set inter;
115          by key1 key2;
116          if not (first.key2 and last.key2);
117        run;

NOTE: There were 10000111 observations read from the data set WORK.INTER.
NOTE: The data set WORK.WANT_OPT6 has 222 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           1.19 seconds
      user cpu time       1.12 seconds
      system cpu time     0.01 seconds
      memory              1076.18k
      OS Memory           22620.00k
      Timestamp           03/09/2017 10:38:04 AM
      Step Count                        136  Switch Count  44
      

118        
119        
120        /** clean-up **/
121        proc datasets lib=work nolist nowarn memtype=(data view);
122          delete inter:;
123          run;

NOTE: Deleting WORK.INTER (memtype=DATA).
NOTE: Deleting WORK.INTER2 (memtype=VIEW).
124        quit;

KachiM · Posted 09-02-2017 11:24 PM

What is the length of KEY2, the string Variable and the number of distinct KEY2? These two will determine another fastest method.

lambdavu · Posted 09-03-2017 01:15 AM

@Reza

I am running sas 9.3. Did not know about PROC SORT NOUNIKEYS, looks promising, thanks

@KachiM

It depends on the table, in general the keys could be more than 2 and their length and number of unique values varies. Thank you all, anyway! I will test the PROC SORT option as soon as I get access to the data again, together with the THREADS options. Is there any way to ascertain wether the PROCs are actually running multi-threaded? I don't remember seeing any info about threads in my sas log...

lambdavu · Posted 09-04-2017 08:31 AM

I noticed a wierd behaviour, looking at my output but also at the attempt reported above: the count of duplicate observations different between the proc SQL (111) and the PROC SORT - DATA step based methods (222). The difference is due to the fact that in the first case we are counting the number of duplicate keys and in the second the number of duplicate observations. To get the same number, we should then perform a sum of all the counts in a separate query. I thought I should note it here for posterity...

s_lassen · Posted 10-27-2017 07:38 AM

@lambdavu:

A single PROC SORT step seems a bit faster than your SQL solution:

proc sort data=have(keep=key1 key2) out=_null_ dupout=dups nodupkey;
  by key1 key2;
run;

After which WORK.DUPS will hold at least one of each duplicate key pair (the first value is written to the OUT= data set, which we drop in this case).

If you want the count as in your SQL code, you can add one to the number of records per key:

data want;
  do count=2 by 1 until(last.key2);
    set dups;
    by key1 key2;
    end;
run;

Fastest way to count number of duplicates

Re: Fastest way to count number of duplicates

Re: Fastest way to count number of duplicates

Re: Fastest way to count number of duplicates

Re: Fastest way to count number of duplicates

Re: Fastest way to count number of duplicates

Re: Fastest way to count number of duplicates

Classroom Training Available!