Hi everyone. This is my first post, I will try my best to be as descriptive as possible, if anything is not clear please let me know! The problem is as follows: I need to cycle through a number of datasets and find the number of duplicates in each of them, given a combination of variable which should provide a unique key. These datasets are available as SAS Views and not actual tables. The looping part is laid out and works, meaning I can loop through all the tables and select the correct set of keys for each with some macro variables. Some of this tables are big, though (in the tens of millions of records and hundreds of column) so that the PROC SQL that currently counts the number of duplicates works like this: data have /view=have;
input key1 key2 $ data;
datalines;
1 A 93489
1 A 93849
2 B 98798
2 B 78788
3 A 39489
3 B 93112
1 B 90039
2 A 45789
;
run;
proc sql noprint;
create table dups as select key1, key2, count(*) as num
from have
group by key1, key2
having num > 1;
quit; It works fine, but for some big tables it's pretty slow. I therefore wonder if there is a better way, also given the marvelous things i read about hash tables being very efficient for merging and sorting. Do you think hash tables are the right way to go to find a more efficient implementation? Can you point me in the right direction, if this is the case, with some reference to a paper? Thanks!
... View more