- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone. This is my first post, I will try my best to be as descriptive as possible, if anything is not clear please let me know!
The problem is as follows: I need to cycle through a number of datasets and find the number of duplicates in each of them, given a combination of variable which should provide a unique key. These datasets are available as SAS Views and not actual tables.
The looping part is laid out and works, meaning I can loop through all the tables and select the correct set of keys for each with some macro variables. Some of this tables are big, though (in the tens of millions of records and hundreds of column) so that the PROC SQL that currently counts the number of duplicates works like this:
data have /view=have;
input key1 key2 $ data;
datalines;
1 A 93489
1 A 93849
2 B 98798
2 B 78788
3 A 39489
3 B 93112
1 B 90039
2 A 45789
;
run;
proc sql noprint;
create table dups as select key1, key2, count(*) as num
from have
group by key1, key2
having num > 1;
quit;
It works fine, but for some big tables it's pretty slow. I therefore wonder if there is a better way, also given the marvelous things i read about hash tables being very efficient for merging and sorting. Do you think hash tables are the right way to go to find a more efficient implementation? Can you point me in the right direction, if this is the case, with some reference to a paper?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What version of SAS do you have?
PROC SORT itself has a bunch of duplicate record options that may allow you to extract records.
I'm assuming none of your tables are indexed?
/*This example demonstrates how to get a list of duplicates based on the key variables.
KEY variables are placed in the BY statement.
Any 'duplicates' according to the KEY variables are kept in the OUT= dataset
*/
*Generate sample data with duplicate KEY;
data class;
set sashelp.class sashelp.class(obs=5);
run;
*Use the NOUNIQUEKEYS to get rid of unique records and keep duplicates;
proc sort data=class nouniquekeys out=duplicates;
by name;
run;
If you're interested in hash methods, I recommend searching the specific topic of interest at lexjansen.com. It's a repository of SAS papers that are user written over time and go back pretty far. It has a lot of detailed examples and some may be very close to what you want. A good intro paper on hashes with worked examples is this one:
http://support.sas.com/resources/papers/proceedings11/168-2011.pdf
And one that covers hash tables and duplicates:
http://support.sas.com/resources/papers/proceedings16/10200-2016.pdf
Hope that points you in the correct direction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Look at views as encapsulated code which only gets executed once you access the view. This means that the runtimes you see are not only for your SQL but also for the view which gets executed as part of the SQL where you use the view.
It's possible that a significant part of the time you observe gets consumed by executing the view.
If you want to find out how much time the view consumes then implement something like:
data _null_;
set myview;
run;
As for your question: SAS hashes are great when you need to implement lookup tables as it allows you to load the lookup table into memory and then lookup values without the need to sort the potentially big base table.
To detect duplicates you will need to sort the data. SAS hashes allow you to sort in-memory but you still need to first load into the hash and then unload from the hash (=full read/write operation) which are single threaded processes.
A SAS sort on the other hand can execute multi-threaded. Both Proc SQL and Proc SORT use in the background the same sort mechanism.
I've been curious how runtimes for multiple coding options actually look like and it appears that your current coding version seems to be pretty good.
Here what I've done plus the log with the runtimes.
options fullstimer msglevel=i;
data have;
do key1=1000 to 1 by -1;
do key2=10000 to 1 by -1;
output;
if ceil(ranuni(1)*100000)=1 then output;
end;
end;
stop;
run;
/** option 1: single data step using hash */
data want_opt1(keep=key1 key2);
if 0 then set have(keep=key1 key2);
dcl hash h1(multidata:'y', ordered:'y', hashexp:10, dataset:'have(keep=key1 key2)');
h1.defineKey('key1','key2');
h1.defineData('key1','key2');
h1.defineDone();
dcl hiter hh1('h1');
/* Iterate through the hash object and output data values */
length r 8;
rc = hh1.first();
do while (rc = 0);
h1.check();
h1.has_next(result:r);
if r>0 then output;
rc = hh1.next();
end;
stop;
run;
/** option 2: proc sort and data step **/
proc sort data=have(keep=key1 key2) out=inter;
by key1 key2;
run;
data want_opt2;
set inter;
by key1 key2;
if not (first.key2 and last.key2);
run;
/** option 3: Proc SQL with Group By and Having **/
proc sql noprint;
create table want_opt3 as
select key1, key2, count(*) as num
from have
group by key1, key2
having num > 1;
quit;
/** option 4: Proc SQL sorted view and data step **/
proc sql noprint;
create view inter2 as
select key1, key2
from have
order by key1, key2
;
quit;
data want_opt4;
set inter2;
by key1 key2;
if not (first.key2 and last.key2);
run;
/** option 5: Proc Sort only **/
proc sort data=have(keep=key1 key2) out=want_opt5 nounikeys threads noequals;
by key1 key2;
run;
/** option 6: Sort in hash, ds to detemine duplicates **/
data _null_;
if 0 then set have(keep=key1 key2);
dcl hash h1(multidata:'y', ordered:'y', dataset:'have(keep=key1 key2)');
h1.defineKey('key1','key2');
h1.defineData('key1','key2');
h1.defineDone();
h1.output(dataset:'inter');
stop;
run;
data want_opt6;
set inter;
by key1 key2;
if not (first.key2 and last.key2);
run;
/** clean-up **/
proc datasets lib=work nolist nowarn memtype=(data view);
delete inter:;
run;
quit;
27 options fullstimer msglevel=i; 28 data have; 29 do key1=1000 to 1 by -1; 30 do key2=10000 to 1 by -1; 31 output; 32 if ceil(ranuni(1)*100000)=1 then output; 33 end; 34 end; 35 stop; 36 run; NOTE: The data set WORK.HAVE has 10000111 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.58 seconds user cpu time 0.46 seconds system cpu time 0.11 seconds memory 754.90k OS Memory 21596.00k Timestamp 03/09/2017 10:37:34 AM Step Count 127 Switch Count 54 37 38 /** option 1: single data step using hash */ 39 data want_opt1(keep=key1 key2); 40 if 0 then set have(keep=key1 key2); 41 dcl hash h1(multidata:'y', ordered:'y', hashexp:10, dataset:'have(keep=key1 key2)'); 42 h1.defineKey('key1','key2'); 43 h1.defineData('key1','key2'); 44 h1.defineDone(); 45 dcl hiter hh1('h1'); 2 The SAS System 08:59 Sunday, September 3, 2017 46 47 /* Iterate through the hash object and output data values */ 48 length r 8; 49 rc = hh1.first(); 50 do while (rc = 0); 51 h1.check(); 52 h1.has_next(result:r); 53 if r>0 then output; 54 rc = hh1.next(); 55 end; 56 stop; 57 run; NOTE: There were 10000111 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT_OPT1 has 222 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 12.56 seconds user cpu time 12.01 seconds system cpu time 0.49 seconds memory 1219839.42k OS Memory 1239956.00k Timestamp 03/09/2017 10:37:47 AM Step Count 128 Switch Count 108 58 59 /** option 2: proc sort and data step **/ 60 proc sort data=have(keep=key1 key2) out=inter; 61 by key1 key2; 62 run; NOTE: There were 10000111 observations read from the data set WORK.HAVE. NOTE: SAS threaded sort was used. NOTE: The data set WORK.INTER has 10000111 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 1.31 seconds user cpu time 3.18 seconds system cpu time 0.42 seconds memory 632457.53k OS Memory 653232.00k Timestamp 03/09/2017 10:37:48 AM Step Count 129 Switch Count 34 63 64 data want_opt2; 65 set inter; 66 by key1 key2; 67 if not (first.key2 and last.key2); 68 run; NOTE: There were 10000111 observations read from the data set WORK.INTER. NOTE: The data set WORK.WANT_OPT2 has 222 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 1.03 seconds user cpu time 1.01 seconds system cpu time 0.01 seconds memory 967.40k 3 The SAS System 08:59 Sunday, September 3, 2017 OS Memory 21596.00k Timestamp 03/09/2017 10:37:49 AM Step Count 130 Switch Count 44 69 70 /** option 3: Proc SQL with Group By and Having **/ 71 proc sql noprint; 72 create table want_opt3 as 73 select key1, key2, count(*) as num 74 from have 75 group by key1, key2 76 having num > 1; NOTE: SAS threaded sort was used. NOTE: Table WORK.WANT_OPT3 created, with 111 rows and 3 columns. 77 quit; NOTE: PROCEDURE SQL used (Total process time): real time 2.55 seconds user cpu time 4.68 seconds system cpu time 0.29 seconds memory 484576.03k OS Memory 504632.00k Timestamp 03/09/2017 10:37:52 AM Step Count 131 Switch Count 48 78 79 80 /** option 4: Proc SQL sorted view and data step **/ 81 proc sql noprint; 82 create view inter2 as 83 select key1, key2 84 from have 85 order by key1, key2 86 ; NOTE: SQL view WORK.INTER2 has been defined. 87 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 49.59k OS Memory 21852.00k Timestamp 03/09/2017 10:37:52 AM Step Count 132 Switch Count 44 88 89 data want_opt4; 90 set inter2; 91 by key1 key2; 92 if not (first.key2 and last.key2); 93 run; NOTE: SAS threaded sort was used. NOTE: There were 10000111 observations read from the data set WORK.HAVE. NOTE: There were 10000111 observations read from the data set WORK.INTER2. 4 The SAS System 08:59 Sunday, September 3, 2017 NOTE: The data set WORK.WANT_OPT4 has 222 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 3.06 seconds user cpu time 4.64 seconds system cpu time 0.49 seconds memory 474563.07k OS Memory 495032.00k Timestamp 03/09/2017 10:37:55 AM Step Count 133 Switch Count 48 94 95 /** option 5: Proc Sort only **/ 96 proc sort data=have(keep=key1 key2) out=want_opt5 nounikeys threads noequals; 97 by key1 key2; 98 run; NOTE: SAS threaded sort was used. NOTE: 9999889 observations with unique key values were deleted. NOTE: The data set WORK.WANT_OPT5 has 222 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 1.41 seconds user cpu time 3.21 seconds system cpu time 0.29 seconds memory 475164.96k OS Memory 496852.00k Timestamp 03/09/2017 10:37:56 AM Step Count 134 Switch Count 35 99 100 101 /** option 6: Sort in hash, ds to detemine duplicates **/ 102 data _null_; 103 if 0 then set have(keep=key1 key2); 104 dcl hash h1(multidata:'y', ordered:'y', dataset:'have(keep=key1 key2)'); 105 h1.defineKey('key1','key2'); 106 h1.defineData('key1','key2'); 107 h1.defineDone(); 108 109 h1.output(dataset:'inter'); 110 stop; 111 run; NOTE: There were 10000111 observations read from the data set WORK.HAVE. NOTE: The data set WORK.INTER has 10000111 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 6.14 seconds user cpu time 5.64 seconds system cpu time 0.48 seconds memory 1219705.82k OS Memory 1240980.00k Timestamp 03/09/2017 10:38:02 AM Step Count 135 Switch Count 68 112 113 data want_opt6; 5 The SAS System 08:59 Sunday, September 3, 2017 114 set inter; 115 by key1 key2; 116 if not (first.key2 and last.key2); 117 run; NOTE: There were 10000111 observations read from the data set WORK.INTER. NOTE: The data set WORK.WANT_OPT6 has 222 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 1.19 seconds user cpu time 1.12 seconds system cpu time 0.01 seconds memory 1076.18k OS Memory 22620.00k Timestamp 03/09/2017 10:38:04 AM Step Count 136 Switch Count 44 118 119 120 /** clean-up **/ 121 proc datasets lib=work nolist nowarn memtype=(data view); 122 delete inter:; 123 run; NOTE: Deleting WORK.INTER (memtype=DATA). NOTE: Deleting WORK.INTER2 (memtype=VIEW). 124 quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What is the length of KEY2, the string Variable and the number of distinct KEY2? These two will determine another fastest method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am running sas 9.3. Did not know about PROC SORT NOUNIKEYS, looks promising, thanks
It depends on the table, in general the keys could be more than 2 and their length and number of unique values varies. Thank you all, anyway! I will test the PROC SORT option as soon as I get access to the data again, together with the THREADS options. Is there any way to ascertain wether the PROCs are actually running multi-threaded? I don't remember seeing any info about threads in my sas log...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I noticed a wierd behaviour, looking at my output but also at the attempt reported above: the count of duplicate observations different between the proc SQL (111) and the PROC SORT - DATA step based methods (222). The difference is due to the fact that in the first case we are counting the number of duplicate keys and in the second the number of duplicate observations. To get the same number, we should then perform a sum of all the counts in a separate query. I thought I should note it here for posterity...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A single PROC SORT step seems a bit faster than your SQL solution:
proc sort data=have(keep=key1 key2) out=_null_ dupout=dups nodupkey; by key1 key2; run;
After which WORK.DUPS will hold at least one of each duplicate key pair (the first value is written to the OUT= data set, which we drop in this case).
If you want the count as in your SQL code, you can add one to the number of records per key:
data want; do count=2 by 1 until(last.key2); set dups; by key1 key2; end; run;