@Kurt_Bremser:
I'm not actually surprised.
1. As far as SORT goes, it may be a "reasonable number" but it's nowhere near the area where SORT starts reaching its usual bottlenecks. With 5 short variables, SORT will likely outperform anything else well into the 10 million obs and then some. And I don't think it will even have to make use of multi-threading because with the composite key cardinality of 15 it just keeps all its internal bookkeeping in memory. In fact, it will try making use of memory all the way up to SORTSIZE for its bookkeeping, and only after a certain threshold has been reached the limit is reached, it will start offloading parts of it to utility files in the SORTWORK space. That's why nowadays, when memories are huge even on small hardware, you see SORT grabbing so much RAM in order to speed things up. This is in sharp contrast with the days of yore, when the memories were scant (just 20 years ago on the mainframe it was normal to see REGION set to 256K), and the biggest advantage of sorting (and the reason why so much effort had been dedicated to optimizing its algorithms) was that in the end, it could get the job done using very little memory (albeit at the expense of processing time and using bookkeeping on disk and even tapes), while a fast lookup table approach that would require keeping all distinct keys in memory would fail due to the latter's paucity.
In my real-world experience, SORT used for deduping starts sputtering somewhere around 50 GB and more. At one org I consulted for, there were up to 10 extracts, between 50 to 100 GB each, from SQL Server that needed to be deduped as the first ETL cleansing step. It was taking SORT 1.5 to 2 hours of real time apiece, not to mention constant nagging problems with the SORTWORK space (on Linux), and as a result, the ETL would simply fail to fit into the off-peak time window during which it had to run to make the data available in the morning. The problem was solved by keeping the keys only plus the RID and sort-deduping only those; then using the RIDs to mark the duplicate records in the original extracts for deletion. Since it resulted in reducing each dedup process to no more than 3 minutes, it enthused me to described the method here:
https://support.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2426-2018.pdf
In the present OP's case, when all the variables form the composite dedup key, the concept can be still applied by combining it with using MD5 to create a single $16 key variable to reduce the sorting workload. Only in this case the number of dupes greatly exceeds the number of the records to keep, so it's better to rewrite HAVE into WANT by using the RIDs corresponding to the distinct records rather than using the RIDs corresponding to the dupes to mark the records in HAVE for deletion. For example:
data have ;
do _n_ = 1 to 1.5 * n ;
_iorc_ = ceil (ranuni (1) * n) ;
set sashelp.class nobs = n point = _iorc_ ;
output ;
end ;
stop ;
run ;
data v (keep = ___:) / view = v ;
set have ;
array nn _numeric_ ;
array cc _char_ ;
___m = put (md5 (catx (":", of nn[*], of cc[*])), $16.) ;
___rid = _n_ ;
run ;
proc sort nodupkey data = v out = nodup (keep = ___rid) ;
by ___m ;
run ;
data want ;
if _n_ = 1 then do ;
dcl hash h (dataset:"nodup") ;
h.definekey (all:"y") ;
h.definedone () ;
end ;
set have curobs = ___rid ;
if h.check() = 0 ;
run ;
2. The hand-coded hash outperforms the hash object in this case because the array hash table is extremely sparse: 760007 slots for 15 distinct keys, so most likely there are no collisions at all, and the DO loop terminates after the first iteration. It means that effectively, it is as fast as key-indexing, except for the pretty high cost of computing the hash function (though it's mitigated by the fact that MD5, the costliest part if it, has to be computed, anyway).
Also, array-based hashing always outperforms the hash object by a healthy margin, both in terms of time and memory footprint, if the key is a single integer variable - or a combination of keys that can be mapped to a single integer one-to-one very inexpensively. This is because in this case, the hash function is nothing but MOD, so it's very fast, which makes the entire algorithm very fast as well. Memory-wise, the shortest hash object item, even if there's a single numeric key, is 48 bytes, while one array item for such a key is 8 bytes. Therefore, even if we make the array 3 times sparser vis-a-vis the number of keys stored there, its memory footprint is still 1/2 of the hash object table.
Furthermore, such an open-addressed hash table is faster than the hash object for data aggregation, and the reason is simple: With the former, we can add something directly to the array item, while with the latter, we first need to copy the value from a hash into the PDV, add something to it, and then replace it in the item with the new value.
Thanks for the nice discussion.
Kind regards
Paul D.
... View more