Solved: Re: Table lookups, Hash Object vs Array, vs proc format

ifb10 · Posted 04-24-2023 09:26 AM

Hi everyone,

I have a rather general question: Is there a case where one would use an Array over a Hash Object in SAS for table lookups? And if yes, how is it beneficial? Finally, does anybody have an overview (mindmap) where the most important functionalities (lookup, sorting, processing etc.) are linked to SAS methods, functions and procedures?

For a starter like myself it is not always intuitive to decide when to use what.

Thank you!

yabwon · Posted 04-25-2023 03:11 AM

For this discussion it is necessary to call for two people who knows a lot about the subject: Paul Dorfman ( @hashman ) and Don Henderson ( @DonH )

Few good materials describing some of discussed her topics you could refer to BASUG ( @Quentin ) web page. Especially those 2 videos my be valuable:

Paul's & Don's "Hashing Using Arrays" (https://www.basug.org/videos?wix-vod-video-id=1364e1a03c2c480d988c08b57874298c&wix-vod-comp-id=comp-...)

and

Russ Lavery "An Animated Guide to SAS Hashing: From Your First Hash to Hash of Hashes" (https://www.basug.org/videos?wix-vod-video-id=0c8db09279004466a8af42567908a019&wix-vod-comp-id=comp-...)

In the first Paul & Don explain the road from arrays to hash tables (Paul is the man who was using hash tables in SAS before SAS had hash tables), in the second Russ does a zero-to-hero introduction to hash tables and also compares their performance with formats.

When I'm doing my SAS classes, I'm introducing those concepts (among others) to my students in the following order:

1) arrays (especially temporary ones, because of their general importance in programming and implementing various algorithms),

2) formats (because of their usability [not only for look-ups] and robustness), and

3) hash tables (because they are dynamic data structures and can be "modified" in the execution time, which gives millions of new possibilities, and what makes them my favourite).

If we are talking about arrays and lookup they are extremely fast when you are doing integers lookups, basically because you can do the lookup in constant time.

Arrays also cooperate with the IN operator which makes them useful: IF variable IN array THEN ...

As @Kurt_Bremser wrote, arrays of dates easily fits into memory (even an array of seconds from 2023 fits):

1    %let s = %sysevalf("1jan1700"d);
2    %let e = %sysevalf("31dec9999"d);
3    %put &=s &=e;
S=-94962 E=2936547
4
5    data _null_;
6      array x[&s.:&e.] _temporary_;
7
8      d=dim(x);
9      put d=;
10     do _N_ = &s. to &e.;
11       x[_N_] = 17;
12     end;
13   run;

d=3031510
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.03 seconds
      memory              24261.62k
      OS Memory           36736.00k
14
15   %let s = %sysevalf("1jan2023 00:00:00"dt);
16   %let e = %sysevalf("31dec2023 23:59:59"dt);
17   %put &=s &=e;
S=1988150400 E=2019686399
18
19   data _null_;
20     array x[&s.:&e.] _temporary_;
21
22     d=dim(x);
23     put d=;
24     do _N_ = &s. to &e.;
25       x[_N_] = 42;
26     end;
27   run;

d=31536000
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      user cpu time       0.12 seconds
      system cpu time     0.06 seconds
      memory              246953.62k
      OS Memory           259428.00k

So creating and array of dates with 1's for birthdays for XX and XXI century for session with 8GB of RAM is like nothing.

In situations when data are counted in up to hundreds of thousands you wont even notice any difference in performance. And when the data are big, the every time it is "trys-and-errors".

As @SASKiwi wrote, don't forget that SQL joins and Data Step merges works perfect for look-ups and often are 1) faster to implement and 2) understood by majority (even non SAS programmers).

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

ballardw · Posted 04-24-2023 10:36 AM

In my opinion the array option might be preferred if the list of values is relatively small and subject to frequent change. Since hash is going to require a data set it may be easier to update an array definition than to create a new data set.

Your subject line includes for Proc Format but the question comment doesn't.

Hash and arrays are likely to be adding variables. Formats can be applied to variables that already exist if the need is a single source change and may be more flexible because changing the format in the body of a report or analysis procedure can be significantly quicker than adding or modifying variables in large data sets plus formats can be applied to many variables with minimal code. Hash or an array is going to require multiple uses of the values. Formats also, for the same reasons, lend themselves a bit more to "what if" scenarios. As in checking what if this boundary value is change by 2 units. Just modify the format and run the analysis/report. Of course the limitation of a format is that it only applies to a single variable.

SASKiwi · Posted 04-24-2023 05:53 PM

All things being equal, choose the lookup technique you are most familiar with. If you are familiar with more than one, choose the one that fits best with your use case. It appears hash objects get the most visibility at least recently on the Community. However these are definitely an advanced SAS programming technique. There's little point using techniques you don't fully understand, and even more importantly, techniques your team members don't understand. For that reason I use hashes sparingly, typically when it is the most effective and efficient solution. Personally I think lookup formats are underrated. They are a lot simpler to code and in many cases are as fast as hashes. I've got DATA steps doing over 50 lookup formats - one statement each. Imagine the code 50 hashes in a DATA would require....

@ballardw - Multi-variable format lookups are possible by concatenating the variables into one lookup string, then using that to construct and do the lookup. But of course you can only return single values unless you concatenate those as well.

EDIT: Don't forget lookups can also be done in SQL and DATA step joins.

Reeza · Posted 04-24-2023 06:14 PM

Pick two and master them. Everyone has their own opinions on which is better, but the best one is the one you can implement the fastest. These days programming time is more expensive than compute times by far.

Kurt_Bremser · Posted 04-25-2023 01:13 AM

A temporary array provides the fastest lookup. It needs a clearly defined, discrete, known beforehand index which is or can be converted to an integer. E.g. I like to use a temporary array for lookups based on dates Arrays spanning from 1900-01-01 to 2099-01-01 will easily fit into memory. But you need a separate array for every value you need to look up (e.g. one for counts, one for prices).

Hash objects are also fast, do not need that you know the range of keys beforehand, but need to be coded (like the array) into every step when used. Hash objects allow the retrieval of several values with one method call.

Formats are slowest. But they need the least coding effort by the "end user", and can be maintained in a central location. And you also need a separate format for every value group, like with arrays.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

yabwon · Posted 04-25-2023 03:11 AM