Baking With Arrays Versus Cooking With Hash: In-Memory Lookup Techniques

1 Like

Watch this Ask the Expert session to learn the difference between these two in-memory techniques.

You will learn to:

Write code to use arrays and hash objects and easily spot the differences between the two.
Improve your programming efficiency using arrays and hash objects.
Execute complex data manipulation tasks with fewer lines of code.
Use efficient and convenient in-memory techniques for quick data storage and retrieval.
Save on input/output, one of the most expensive computing resources, by utilizing arrays and hash objects.

The questions from the Q&A segment held at the end of the webinar are listed below and the slides from the webinar are attached.

Q&A

I use LARGE _Temporary_ arrays, like 1 M or more. Do you have any tips or suggestions for speed?

By using _temporary_ arrays, that in itself is likely to speed things up as opposed to having to look up information in the PDV. You can also use the Keep= dataset option to limit the number of variables reading into the hash object.

Variable "code" is in your Foods dataset. But "code" is also used as an iterator?

Code is used as the index column while the array is being populated. Once the array is populated, the section of code that is doing the lookup is using the value of code from the foods data set as the subscript into the array.

How do you specify more than one variable in the hash data part?

In the definedata method, just separate each of the variables with a comma.

Is there a difference between declaring the hash before the set statement, versus declaring after the set statement?

The hash object needs to be created (declared) and populated before you can use the find method to look up a row in the hash object.

What is call missing(country) in the hash definition?

Call missing(country) explicitly assigns a missing value to the country variable. Without it, you'll get a message in the log indicating that country is uninitialized.

How do you know the key "code" variable must match with the "code" key from the hash?

When you invoke the find method, the default is to use the variable (code) that's defined as the key to do the lookup.

Which is better (quicker) HASH or FORMAT matching?

Hash matching is generally considered faster than Format matching. Formats use a binary search and that splitting can be very slow as tables grow in size. Here’s a link to more information about speed comparisons.

Could HASH object use more than one DATA Field, or you'll need multiple HASHes to define multiple data fields?

A hash object can have multiple data fields.

I have gotten an ERROR when using an array with both implicitly and explicitly. How are you avoiding that?

With explicit arrays, the index specification (either an asterisks or number of elements in the array) must be explicitly defined in the array statement after the array name in curly parentheses ({}).In an implicit array, an index specification indicating the number of array elements is not required. If an error message appears, it would be good to check that the explicit array has been explicitly provided the number of elements.

In CAS, is there still a size limit on the Hash object? If yes, what's the threshold?

CAS tables can exceed physical memory size because of the way CAS transparently handles things using the CAS backing store. But that applies only to CAS tables.

What are the benefits of using arrays/hashes versus joins in Proc SQL?

Arrays and hash objects are in-memory lookup techniques where you gain the advantage of IO, moving data from external storage to memory and then doing the processing in memory. Contrast that to SQL joins, data step merges, if- then-else conditional processing statements where I/O expense is incurred as we go to disk each time to perform a table lookup. So, the advantage or arrays/hashes would be that the processing is done in-memory.

What would be the benefit of using a hash as opposed to an SQL merge?

I believe you meant a SQL join. (Terminology wise the data step performs a merge, SQL performs a Join). This question is similar to the previous one comparing arrays/hashes with SQL merges. The if then conditional processing is performed in-disk. Lookup techniques like arrays, hash objects are performed in-memory. They load the data into memory. Once the entire hash is in-memory, the analyst can go and look up items, find a match, take action on matching rows and perform different tasks with the data in-memory.

What could be another use case for HASH besides Lookup applications?

With the hash iterator object called a hiter object, we could go in and order the data. If you need a top ten report or a bottom 10 report, you could use the hiter object to move forwards/and or backwards in the hash object. SAS programmers can also see the hash object as an alternative to any proc-sql-join or data-step merge. In addition to Lookup applications, Hash tables offer swift and convenient access to data using keys, can efficiently manage dynamic and heterogeneous data, reduce space complexity by avoiding duplicate elements, and support operations such as, counting the number of items in a table, removing items, deleting duplicates. Etc.

I got confused as to which dataset the find method was referring to, can you explain again?

The FIND method is using the value of the variable code in the PDV(, which is the key value). The key variable is called code. The find method is going to take the key value of code in the PDV and try to find a match in the hash object. Where was the hash object built? The hash object was built by loading the country data set. What is the code value sitting in the PDV? The code value sitting in the PDV is coming in from the Foods data set. We're taking the code of foods and we're looking up that code in the hash object to grab the value of country from the Hash.

When do you recommend using traditional join/merge techniques over hash/arrays?

You're looking at two different techniques here. Looking at the traditional techniques, we call them in-disk techniques and then we're calling these in-memory techniques that we shared today. Sometimes, you'll find if you have very large table that you're going to load into hash that needs a lot of memory. If you have memory restrictions, then you can go back and you can use our in-disk techniques. This concept is quite complex because we are talking computing resources and then we start to talk more about what resources are important to me. Is it the storage? Is it memory? Is it CPU? Is it I/O? The answer to this question eventually boils down to it depends- what is important to you, what kind of memory do you have. Those are some of the questions and the answers to this question that will inform the decision you make. Do you have enough memory? OK, let's go use hash object. Don't have enough memory? OK, I'll probably use that in-disk technique.

May SQL queries that do not have indexes by certain data elements and will take long time to process be replaced with hashing array usage technique?

Certainly. That's certainly a possibility if you don't have your indexing in your SQL query and it's taking a very long time to process. I would consider as you attempt, and this is a general suggestion for everybody. I would have you consider the concept of benchmarking. Now benchmarking is a topic that everybody talks about. But there are so many items related to best practices and benchmarking. Because with benchmarking, you want to make sure that you open your SAS session, run that code using SQL on the same data, and then close up your SAS session. Of course, make sure you have options fullstimer turned on so you can get that resource statistics in the log. Make a note of the resources used for this run of your code, then exit your SAS session. Open up a fresh SAS session, look at the same code, the same data, and run your hash object code. The technique I want you to consider is that you need to do benchmarking with the two techniques in separate SAS sessions. Why is that? Because if you've seen the concept of car racing, where a race car goes ahead and the second car gets pulled in like a drag using the momentum of the first car with barely any effort from the 2^nd car. That's basically what's happening. If in a SAS session I execute a SQL query. Right after that, I execute a hash object. What happens is that SAS is going to load modules into memory. So for your execution of the 2^nd technique, it has the potential to be faster. It's not really indicative of the technique being truly faster. So, ensure that you open up separate SAS sessions to benchmark two techniques. Other things on benchmarking which are really important, benchmark technique at the same time of the day. Be sure that you're not benchmarking the SQL technique at 9:00 AM and a hash object at noon, because at different times of the day there might be network bandwidth that play into the final resource statistics. Ideally utilize best practices non benchmarking. That will help each of you decide which techniques should I use based on my resources, my company needs, my data size, etc.

Can we use hash with large datasets without losing time performance?

I would have you go back to the link where this user cut 90% of the processing time using hash object, so you have a real live user experience that you can use.

Since arrays/hashing uses memory, is there a limit to the size of data that can be processed?

It basically comes down to your organization's memory restrictions. How much memory room do you have? What other applications are running? A short and simple way is to look at your task manager. Do a control-alt-delete, look at your task manager, see what applications are using memory. And perhaps dedicate your hash object memory to everything that your computer has. But closing down a bunch of applications that consume huge memory would be a good start. If you're watching a video, for example, ON24 platform, these online platforms could consume memory. So just do a little check on your side to see what's available and then you should be able to work with them.

Can definekey be multiple variables?

Yes, they can be used with multiple key variables.

Why was call missing(country) necessary?

It's related to the length statement. Recall one important thing about the hash. The key variables and the data variables need to be variables in the program data vector. The length statement guaranteed that country is set up in the program data vector as character with the length of 16. But there's no way for SAS to initialize the variable to missing since the length statement is only compile time. It's not an executable statement. SAS doesn't have instructions to set country to missing, so you're going to get a warning. A note in the log saying variable country has not been initialized and therefore we have to set CALL MISSING country. We have to set that so we are giving SAS instruction that country is set to missing. But I overcame that. If you look at the code below, you’ll see a nifty little trick to remove the extra work of 1. Proc contents, 2. Length statement, 3. Call missing by simply adding an IF 0

data cooking;

if _N_=1 then do;

if 0 then set bakecook.country

(keep=code Country);

declare hash C(dataset:"bakecook.country (where=(code=&ccode))");

C.definekey('code');

C.definedata('country');

C.definedone();

end;

set foods.'codefooddesc$'n;

If c.find()=0;

run;

If your data is in a database like Teradata, how do you build a hash table with 'in database' syntax?

So this is something that I'll be sharing at a different as the expert later in the year. But a short answer to that is database connections can be either implicit or explicit pass through. Essentially, if we can point to the database with the LIBNAME statement, then we can actually reference it much like it's a SAS table and we can work with all of the items in the hash object with standard data steps code.

Is there a nice site where you can recommend, Has objects is next language I need to crack in my language list?

The slide presentation has a list of handy resources at the end which can be used as reference, in addition to the link below.

https://support.sas.com/resources/papers/proceedings20/4415-2020.pdf

Recommended Resources

Better Hashing

Value Of Arrays

Table Lookups: From IF-THEN To Key-Indexing

Table Lookup Techniques: From the Basics To The Innovative

Different Ways of Combining SAS Tables

Moving from SAS®9 to SAS® Viya®

Move to Viya Board

Please see additional resources in the attached slide deck.

Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.

Baking With Arrays Versus Cooking With Hash: In-Memory Lookup Techniques

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!