Solved: Hash object internal hash function

PeterClemmensen · Posted 11-07-2018 05:58 PM

I am working my way through the great book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study by @DonH and @hashman. First of all thank you both for a great book.

I have a question regarding the internal hash function of the SAS hash object. On page 13 it says that the internal hash function working for the hash object behind the scenes is different from MD5. Can anyone explain further what happens when keys are distributed across AVL trees? Is a commonly known algorithm used?

I tagged the two authors in this post since they probably have something clever to say. However please do not hesitate to reply if you have knowledge on the topic 🙂

Regards Peter

The DATA to DATA Step Macro
Blog: SASnrd

hashman · Posted 11-09-2018 11:51 PM

Hello, Peter,

First, thank you for your kind words.

Don (aka @DonH) has already conveyed what we meant on page 13 by saying that the internal hash object function is "different".

We simply don't know what it is - the developers haven't shared this knowledge with us - but, based on our rather vast practice with the hash object, we know that it satisfies the two necessary and sufficient prerequisites of a hash function "good" enough to support the hash insert/search algorithm:

- it's fast

- it distributes the keys uniformly across the 2**hashexp trees regardless of the input keys' distribution

As to "what happens when keys are distributed across AVL trees", it's fairly simple:

All the keys sent by the hash function to the same tree are inserted into it ascending using the AVL algorithm to balance the tree at every insertion, so that it is neither too fat or too skinny irrespective of the distribution of the input keys. That guarantees that when the tree is searched, the search time always scales as O(log2(N)) and never degenerates into the linear, i.e. O(N) search.

It's easy to prove that each AVL tree is in ascending key order by creating an object instance with a single tree by coding HASHEXP:0 and looking at its content, for example:

data _null_ ;                       
  dcl hash h(hashexp:0) ;           
  h.defineKey("k") ;                
  h.definedone() ;                  
  do k = 9, 1, 8, 2, 7, 3, 6, 4, 5 ;
    h.add() ;                       
  end ;                             
  h.output (dataset:"HASH") ;          
run ;

Looking at the data set HASH, you'll see that the table is in A-order, even though ORDERED:"A" is not coded, the order of K in HASH being: 1 2 3 4 5 6 7 8 9. If you code HASHEXP:1 (i.e. create 2 trees), the result will be: 2 3 4 5 6 7 1 8 9, with a plausible conclusion that one tree gets 2 3 4 5 6 7 and the other - 1 8 9. The more HASHEXP increases, the more randomly disordered (to the extent of the randomness of the hash function) the keys will appear in the table.

On the other hand, if HASHEXP > 0 and ORDERED:"A"|"D" is specified, the key streams from the different trees, each ordered intrinsically, are merely sequentially match-merged into a single ordered stream when extracted either by a hash iterator or the OUTPUT method. Since this process is blazingly fast by nature, no palpable performance degradation results from having the table ordered via the ORDERED:"A"|"D" argument tag; for extracting the keys from the trees in order and merely stacking them up is not appreciably faster than extracting them in order and match-merging them into the output stream.

Best regards

Paul D.

View solution in original post

DonH · Posted 11-09-2018 07:48 AM

This is more a question for Paul (aka @hashman) than me. But he is unable to post right now, so here is a brief summary he provided to me:

The page 13 phrase "Though the hash function working for the hash object behind the scenes is different" refers not to MD5 per se but to "one example of a decent hash function" from the previous sentence, i.e. to the entire expression used for the variable TN. I don't know what kind of hash function SAS uses in the hash object, but it's definitely not the same as the expression for TN. It may or may not include MD5 as part of it, though I doubt it does - most likely, they just make use of some function already available in a C library.

We would both also like to use this question to reinforce the fact that the hash object and the hash function are very different entities. Again quoting Paul:

Of course, for a hash object to work, an internal hash function is a prerequisite (except for the case of hashexp:0, when all the keys fall into the same bucket since it's the only one available). However, this function is an entity quite different from the explicit hash functions surfaced by SAS for anyone to use, such as MD5, SHA256, SHA512, and so on.

A hash function is used internally by the hash object to assign the bucket/branch in the internal tree structure.

Hope this helps clarify the issue.

hashman · Posted 11-09-2018 11:51 PM

Hello, Peter,

First, thank you for your kind words.

Don (aka @DonH) has already conveyed what we meant on page 13 by saying that the internal hash object function is "different".

We simply don't know what it is - the developers haven't shared this knowledge with us - but, based on our rather vast practice with the hash object, we know that it satisfies the two necessary and sufficient prerequisites of a hash function "good" enough to support the hash insert/search algorithm:

- it's fast

- it distributes the keys uniformly across the 2**hashexp trees regardless of the input keys' distribution

As to "what happens when keys are distributed across AVL trees", it's fairly simple:

All the keys sent by the hash function to the same tree are inserted into it ascending using the AVL algorithm to balance the tree at every insertion, so that it is neither too fat or too skinny irrespective of the distribution of the input keys. That guarantees that when the tree is searched, the search time always scales as O(log2(N)) and never degenerates into the linear, i.e. O(N) search.

It's easy to prove that each AVL tree is in ascending key order by creating an object instance with a single tree by coding HASHEXP:0 and looking at its content, for example:

data _null_ ;                       
  dcl hash h(hashexp:0) ;           
  h.defineKey("k") ;                
  h.definedone() ;                  
  do k = 9, 1, 8, 2, 7, 3, 6, 4, 5 ;
    h.add() ;                       
  end ;                             
  h.output (dataset:"HASH") ;          
run ;

Looking at the data set HASH, you'll see that the table is in A-order, even though ORDERED:"A" is not coded, the order of K in HASH being: 1 2 3 4 5 6 7 8 9. If you code HASHEXP:1 (i.e. create 2 trees), the result will be: 2 3 4 5 6 7 1 8 9, with a plausible conclusion that one tree gets 2 3 4 5 6 7 and the other - 1 8 9. The more HASHEXP increases, the more randomly disordered (to the extent of the randomness of the hash function) the keys will appear in the table.

On the other hand, if HASHEXP > 0 and ORDERED:"A"|"D" is specified, the key streams from the different trees, each ordered intrinsically, are merely sequentially match-merged into a single ordered stream when extracted either by a hash iterator or the OUTPUT method. Since this process is blazingly fast by nature, no palpable performance degradation results from having the table ordered via the ORDERED:"A"|"D" argument tag; for extracting the keys from the trees in order and merely stacking them up is not appreciably faster than extracting them in order and match-merging them into the output stream.

Best regards

Paul D.

PeterClemmensen · Posted 11-21-2018 03:03 AM

Thank you both. Makes more sense to me now.

And again, thank you for a great book 🙂

The DATA to DATA Step Macro
Blog: SASnrd

hashman · Posted 11-21-2018 03:17 AM

We greatly appreciate your taking your time to read it.

Paul D.

Hash object internal hash function

Re: Hash object internal hash function

Re: Hash object internal hash function

Re: Hash object internal hash function

Re: Hash object internal hash function

Re: Hash object internal hash function

Hash object internal hash function

Re: Hash object internal hash function

Re: Hash object internal hash function

Re: Hash object internal hash function

Re: Hash object internal hash function

Re: Hash object internal hash function

Click image to register for webinar

Classroom Training Available!