Clearing Up Some Hash Object Mysteries - A Trialogue

4 Likes

Preface

The German slang word for jargon and gobbledygook is Parteichinesisch. The literal translation of the word is Party Chinese. It is what uninitiated lay people used to call what was to them an incomprehensible language spoken and written by members of a German political party. The three people who took part in the conversation below @DonH @hashman and @jsattler made a good faith effort to use simple English when talking about a technical subject. They hoped that by doing so, the subject matter might become more accessible than it is sometimes in SAS conference papers and documentation.

Person A – I’d like to understand some things about hash objects in greater depth and am glad we can all chat. Let me start by saying that I do have at least a basic working understanding of hash objects and object-oriented programming used in SAS. For example, I know they are in-memory structures created and manipulated in a DATA step. The manipulation techniques include what is called “methods”, "attributes", and "operators". Each performs a specific detailed manipulation of the table and what’s in it.

Person B - Hash object methods remind me of SAS functions, small tools that manipulate variables in specific ways. There a more than 600 functions in SAS.

Person A – That’s right. And you might also want to think about other similarities between hash objects and other things in SAS that are probably familiar to you. For example, SAS data structures called arrays are created in a DATA step and kept in memory. They too, like hash tables, are transient structures that vanish when the DATA step’s processing finishes.

Person C – Yes, that’s true. But arrays are fixed-size structures predefined at compile time. A hash table can grow and shrink dynamically at run time. That’s an important difference between arrays and hash tables.

Person A – I’d like to say one more thing about arrays. When I teach SAS to new programmers, I try to make it easy for them to digest what seems to be strange. Arrays are temporary groupings of variables, they are told in the course. That terminology sounds like jargon. I tell students that they already know about temporary groupings of variables. The VAR statement in the PRINT procedure is an example. But let’s come back to the subject of methods. I think there are a lot of them.

Person B – Definitely, I took a look at some recent documentation and counted about 30 methods and operators. I’ll put a list of them at the end of our conversation.

Person A – What we notice is that SAS hash objects use a type of programming called "object-oriented." This language fits comfortably into SAS as a 4GL. The programming statements are short and powerful and allow complicated applications to be built with little code. But let’s be honest - object-oriented programming is an acquired taste for many people, and not everyone likes to learn it.

Person B – Have you ever worked with SAS/AF? It also uses objects and methods. Specific methods are used for particular objects like list boxes. It isn’t easy to learn when and how to use these methods, but once you get up the learning curve, SAS/AF makes it possible to build very powerful end user applications.

Person C – True enough. But spending a lot of time learning how SAS/AF and Screen Control Language (now called SAS Component Language) work is not a necessary building block for anyone trying to learn about hash objects. At best, pointing out that there is “nothing new under the sun” can make us feel comfortable that there are antecedents in SAS to hash object programming. Some people might possibly be curious about other uses of object-oriented programming in SAS.

Person A – I think we’re all on the same wavelength about hash objects in SAS in general. However, as I said when we started to chat, I’d like to dig a little deeper into hash objects and see what’s going on internally. SAS programmers know that working with a 4GL can be done better if one has an understanding of internals.

Person B – Can you give me an example?

Person A – Sure. Consider the two-phases of code processing in SAS, compiling and executing. Data, for a SAS program, is like gas in a car engine. Before the engine can burn the gas and run, it has to have everything in place and ready to go – the ignition system, the fuel system, the exhaust, etc. Only then will the gas get burned. A SAS program has to be compiled first before any data gets fed into it. Compiling puts all the pieces together – it checks the code syntax, sets up structures to hold data, and translates SAS code into machine language.

Person B – So what? Why does a SAS programmer benefit from this “under the hood” knowledge?

Person C – Let me jump in here. I’ll give you a simple illustration. When a program is compiled, it does so one line of code at a time starting at the first line. It’s possible to put statements in the wrong line order in your code. Take variable length. If you start your code with an assignment statement like City=’New York’, the literal value sets the length of the city variable to be 8. If you put a length statement after it, to accommodate longer city names, it’s too late. Their values will be truncated because the literal value has already called the shots.

Person A – Right. If you understand the two phases of processing in SAS, you’ll be in a better position to write good code. What I’d like to know is how hash objects work “under the hood.” Can one be a better “hasher” if one understands internals?

Person C – That’s a good question. Let’s hold off answering it until we are clear about some things. We’re really getting ahead of ourselves. Let’s go back to basics. First, language can be confusing. We need to have a clear understanding of terminology. Let’s define hashing first. Then we can discuss other things like hash functions and more.

Person A – I agree! Let’s get back to basics. The word “hash” comes from the French word “hacher” meaning to cut up into little pieces. We all know about corned beef hash and how you make it by chopping some meat together with other things like onions and potatoes.

Person B – So what are we cutting up with our program?

Person C – Our data. We take some data and distribute it into a small number of containers or buckets. A large amount of data is cut into pieces.

Person A – Really? That’s it?

Person C – No, that’s just for openers. It’s a high-level description of what hashing is about. If we really want to understand this, we should focus on how the cutting up and distribution works. Remember, there is no such thing as magic in data processing.

Person A – I think I’m getting it. Internally, as data gets fed into our hash object, it gets cut into pieces. There is something like a quality control person on a conveyor belt who sorts the things on the belt into good products and rejects.

Person C – Yes, that’s a pretty good analogy, except that with our hash object the data get sorted into buckets and nothing gets rejected - provided that the data are clean.

Person B – So what kind of conveyor belt “sorting” is being applied here?

Person C – Great question. In the SAS implementation of hashing, there is an internal function to do that. The function returns an integer value to a key and stores the key and associated data in a bucket numbered by that integer.

Person A – So what happens if we want to match data with what’s stored in the buckets?

Person C – The key-value of data being read in gets the same function treatment and is assigned the same bucket number. That makes it easy to locate a matching bucket and the data in it. The keys in each bucket of the hash table are inserted and stored in ascending order. Searching is fast because it uses what is called the “binary search” algorithm. In simple English, that means it divides the keys in half and goes to the middle of the keys in the bucket to see whether the value it’s trying to match is above or below the middle of the sorted keys. It repeats this dividing to and subdividing into upper and lower parts until it finds the corresponding key or determines that it's not in the bucket. And if it's not in the bucket, it is not in the entire table, either, because thanks to the hash function, any given key can reside in one and only one bucket only.

Person A – Ah so! It reminds me of SAS indexes.

Person B – Going back to the conveyor belt analogy, we’re dealing here with something like eggs that can be sorted into small, medium, large and jumbo buckets. A small egg is always put into the small bucket because it is always identified as small. The bucket number assignment function works the same way. It won’t mis-classify a small egg as a large one. Of course, the category “small” covers many different small measurements. If we want to find a particular egg size, we go to that bucket and start doing our binary search.

Person A – When I looked at the documentation, I saw that there are hash functions in SAS. For example, there is something called MD5. Another is called SHA256. Are these used in hash object programming for assigning buckets and for searching?

Person C – No, those functions have a different purpose and convert data into strings with obscure values. The little pieces are the converted string values at the byte level.

Person A – If they are obscure, what the heck good are they?

Person C – Well, they are good for enciphering some data. This belongs under the heading of cryptology. I think that such functions could help make data secure. That’s a completely different reason to use a hash function than what we’re chatting about.

Person A – This is all beginning to make sense. But I’m still curious about something else that goes on internally. If the items are added to the buckets dynamically, as data is being read into the hash table, does the bucket structure need to be managed in the process to make searching through it efficient? What keeps the bucket structure from being out of proportion and unbalanced, so that some searches might be slower than others? In other words, if the number of items in a bucket increases as data gets fed in, does the bucket structure adjust itself as it’s being built?

Person C – Cool question. Of course there is self-adjustment. Here we get into the field of data processing theories. What is the theoretical basis or formula used to balance and re-balance the bucket structure dynamically?

Person C - The theory behind this dynamic structure was invented by two Russian researchers – Georgiy Adelson-Velsky and Evgenii Landis. The structure built according to this theory is called an "AVL tree" and named after the first initials of their last names. Their goal was to retain a balance between the left and right sides of the tree. Makes sense. Imagine the result if you put all of your decorations only on one side of your Christmas tree as you decorate it.

Person B – Adelson-Velsky went on to become a creator of computerized chess programs. His work evolved into Kaissa (named after the goddess of chess), the world’s first computer chess champion. I guess you can call people who apply mathematical theory to programming “data scientists”, although nowadays that term means something else. If anyone listening to our conversation is comfortable with higher level math, a Google search will show some of the formulas used for self-balancing trees.

Person A – Thank you both for explaining many things about the internals of the SAS hash object. There is another expression I’ve read that I don’t fully understand: “Collisions”.

Person B – As we’ve seen so far, hashing means assigning many keys to a relatively small number of buckets. In addition, it performs dynamic tree balancing as we’ve also seen. The third thing it does is to resolve “collisions” within a particular bucket. In other words, hashing performs multitasking, the same way that compiling does. Let’s take a look at “collisions” and how they are resolved.

Person C – Let’s say that three key values and their associated data get mapped to the same bucket. A search for the bucket number doesn’t do enough for us. We need to micromanage the search at the bucket location and perform the binary search within the bucket to find if our search key is there. That adds work to the matching process, but not enough to be a problem. Overall, thanks to the binary search, key matching is very efficient.

Person A – I’m not totally happy with the word “collision", although I do understand now what it means. It’s more like overcrowding. Two passengers on an airplane somehow get tickets with the same seat number. It can happen. The onboard crew has a little extra work to do before everyone gets seated, but this is a small job and won’t delay the departure time. Most of the passengers don’t “collide” and the boarding process is efficient.

Person B – I read a SUGI paper that deals with collisions in detail. If you’re interested in the details of collision management, be sure to check it out.

Person B – Is there anything else we need to chat about?

Person C – Yes. We still haven’t talked about how understanding internals can help make us better programmers. What kinds of mistakes can we make with hash objects and how can we diagnose the problems causing the mistakes and correct them?

Person A – I would guess that there are different kinds of mistakes like what we normally have in SAS. I imagine there are mistakes with syntax and mistakes with logic.

Person C – Correct. And as with the SAS experience we have already, we know that logical errors aren’t always easy to understand. In SAS, we have debugging techniques for logical errors (e.g., the PUTLOG statement and interactive debugger), but I’m not sure we have anything equivalent with hash objects.

Person B – Let’s take this one step at a time. Syntax errors are easier to diagnose and we can get error messages in the log when there is a problem with hash object syntax. So let’s talk about syntax errors first to get started.

Person A – Are syntax errors in hash object programming caused by the hash object syntax itself or by the relationship between the hash object code and the SAS compiler? What do we need to understand about internals to understand either one?

Person C – One of the errors related to the hash object by itself could be caused by the amount of memory available to process the data. If the process runs out of memory, everything comes to a stop. In this case, you would need to allocate more memory to the job. This is a very simple internal thing to understand – the hash object expands dynamically, provided that it has room to do so. I think we can stretch our definition of “syntax” to include how system options are set.

Person B – Another relatively simple thing to remember is that hash objects vanish at the end of the DATA step. That means you can’t reference one in a new DATA step if you used it before in a preceding DATA step. This and the memory error show that “understanding internals” can mean something you can learn quickly.

Person A – I’m wondering about timing. If the hash object is activated at run time, how can the SAS compiler check its syntax?

Person C – You’re catching on. Yes, the hash object working at run time and the compiler working before run time can be a problem. At the very least, SAS may display errors in the log to indicate that a hash variable is "undeclared". To fix this, you must make sure that for every hash variable defined to the hash object (which happens at run time) a variable with the same name has been created in the Program Data Vector at compile time. If you add a statement or two to your code to ensure that, you’ll turn off the errors. It doesn’t seem very elegant to program this way, but if you can live with it, I guess it’s OK.

Person A – Can we talk a little about logical errors using hash objects?

Person C – Sure, but let me say in advance that analyzing them isn’t always easy. For example, we can't use the PUTLOG statement to "see" what's in the table. However, at any given point at run time we can dump the data content of the table to a SAS data set and examine it there - even in the process of using the interactive debugger. The communities article SAS Hash Object Debugging Tips provides some examples of what it’s like to analyze logical errors in hash object logic.

Note by transcriber of the conversation: At this point, all three people decided it was time to stop chatting and go out for a beer. They returned the next day and added some comments to the transcript of their conversation.

Some Afterthoughts and Comments

Person B – I think I know a way to explain self-balancing trees visually. Let’s use a do-it-yourself approach. We can create a bunch of buckets dynamically and see what happens. Click on the link above. Then enter a number into the Insert box and push the Insert button. Do this a few times. You’ll see how the structure gets built as you insert numbers. In technical “parteichinesisch language”, the structure you’re building is called a “Tree.” As you add more numbers, you’ll see how its structure changes. After you have built the tree, you can see what happens when you want to find one of the numbers using the Find box and button.

A Listing of Methods, Attributes and Objects (which may or may not be complete)

Methods
- ADD
- CHECK
- CLEAR
- DEFINEDATA
- DEFINEDONE
- DEFINEKEY
- DELETE
- EQUALS
- FIND
- FIND_NEXT
- FIND_PREV
- FIRST
- HAS_NEXT
- HA_PREV
- LAST
- NEXT
- OUTPUT
- PREV
- REF
- REMOVE
- REMOVEDUP
- REPLACE
- REPLACEDUP
- SETCUR
- SUM
- SUMDUP
Attributes
- ITEM_SIZE
- NUM_ITEMS
Objects and Statements
- DECLARE Statement
- Hash
- Hash Iterator
- _NEW_ Operator

Person A - It might be helpful to see methods classified the same way functions can be put into categories (numeric, character, statistical and so on.) The two item attributes remind me of variable information functions in SAS, like vname and vtype. They don’t modify a variable but return descriptive information about it.

Person C - One such grouping can be found in chapters 2 and 3 of the SAS Press book Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study.