11-05-2013 04:09 PM
I've been working with hash objects and reading about the topic for quite a while now but I have a math background and not an IT one and I fail to see how the following contradiction could've been managed handled.
HASHEXP allows the user to set the number of hash buckets and the efficiency of a hash table comes largely from distributing the keys uniformely over the hash buckets as finding the bucket is O(1) whereas collision handling is O(log_2(N/#buckets)) where N.
Now according to Paul Dorfman's
The natural approach to a hash function would be the division function which is said have good properties (in terms of uniformity of distribution over buckets) if the number of buckets is a prime number distant from a power of 2. At the same time, HASHEXP allocates a power of 2 buckets (2^hashexp). I couldn't find any documentation on what specific underlying hash function was implemented by SAS but it feels contradictory to specify a power of 2 number of hash buckets if I wish to get an O(1) lookup table. Don't get me wrong, since it is all in-memory, even an o(log_2(N)) binary tree where badluck put everything in the same bucket and the binary tree collision has to be ran through for every record is probably going to save me time over the I/O intensive sorts but I am going to do a small presentation about the hash tool to colleagues in a few weeks and while I expect the mass never to have used it as we normally don't have access to internet at our workstations (joy of NSOs) and thus many colleges seem to have lagged behind over the new SAS features but at the same time, still, I expect to have a few people in the audience with far more knowledge than me on the IT/SAS intelligence side and wish at least not to look like a fool if specific questions about the hash function come up.