Hi everyone,
I have a rather general question: Is there a case where one would use an Array over a Hash Object in SAS for table lookups? And if yes, how is it beneficial? Finally, does anybody have an overview (mindmap) where the most important functionalities (lookup, sorting, processing etc.) are linked to SAS methods, functions and procedures?
For a starter like myself it is not always intuitive to decide when to use what.
Thank you!
For this discussion it is necessary to call for two people who knows a lot about the subject: Paul Dorfman ( @hashman ) and Don Henderson ( @DonH )
Few good materials describing some of discussed her topics you could refer to BASUG ( @Quentin ) web page. Especially those 2 videos my be valuable:
Paul's & Don's "Hashing Using Arrays" (https://www.basug.org/videos?wix-vod-video-id=1364e1a03c2c480d988c08b57874298c&wix-vod-comp-id=comp-...)
and
Russ Lavery "An Animated Guide to SAS Hashing: From Your First Hash to Hash of Hashes" (https://www.basug.org/videos?wix-vod-video-id=0c8db09279004466a8af42567908a019&wix-vod-comp-id=comp-...)
In the first Paul & Don explain the road from arrays to hash tables (Paul is the man who was using hash tables in SAS before SAS had hash tables), in the second Russ does a zero-to-hero introduction to hash tables and also compares their performance with formats.
When I'm doing my SAS classes, I'm introducing those concepts (among others) to my students in the following order:
1) arrays (especially temporary ones, because of their general importance in programming and implementing various algorithms),
2) formats (because of their usability [not only for look-ups] and robustness), and
3) hash tables (because they are dynamic data structures and can be "modified" in the execution time, which gives millions of new possibilities, and what makes them my favourite).
If we are talking about arrays and lookup they are extremely fast when you are doing integers lookups, basically because you can do the lookup in constant time.
Arrays also cooperate with the IN operator which makes them useful: IF variable IN array THEN ...
As @Kurt_Bremser wrote, arrays of dates easily fits into memory (even an array of seconds from 2023 fits):
1 %let s = %sysevalf("1jan1700"d);
2 %let e = %sysevalf("31dec9999"d);
3 %put &=s &=e;
S=-94962 E=2936547
4
5 data _null_;
6 array x[&s.:&e.] _temporary_;
7
8 d=dim(x);
9 put d=;
10 do _N_ = &s. to &e.;
11 x[_N_] = 17;
12 end;
13 run;
d=3031510
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
user cpu time 0.00 seconds
system cpu time 0.03 seconds
memory 24261.62k
OS Memory 36736.00k
14
15 %let s = %sysevalf("1jan2023 00:00:00"dt);
16 %let e = %sysevalf("31dec2023 23:59:59"dt);
17 %put &=s &=e;
S=1988150400 E=2019686399
18
19 data _null_;
20 array x[&s.:&e.] _temporary_;
21
22 d=dim(x);
23 put d=;
24 do _N_ = &s. to &e.;
25 x[_N_] = 42;
26 end;
27 run;
d=31536000
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
user cpu time 0.12 seconds
system cpu time 0.06 seconds
memory 246953.62k
OS Memory 259428.00k
So creating and array of dates with 1's for birthdays for XX and XXI century for session with 8GB of RAM is like nothing.
In situations when data are counted in up to hundreds of thousands you wont even notice any difference in performance. And when the data are big, the every time it is "trys-and-errors".
As @SASKiwi wrote, don't forget that SQL joins and Data Step merges works perfect for look-ups and often are 1) faster to implement and 2) understood by majority (even non SAS programmers).
In my opinion the array option might be preferred if the list of values is relatively small and subject to frequent change. Since hash is going to require a data set it may be easier to update an array definition than to create a new data set.
Your subject line includes for Proc Format but the question comment doesn't.
Hash and arrays are likely to be adding variables. Formats can be applied to variables that already exist if the need is a single source change and may be more flexible because changing the format in the body of a report or analysis procedure can be significantly quicker than adding or modifying variables in large data sets plus formats can be applied to many variables with minimal code. Hash or an array is going to require multiple uses of the values. Formats also, for the same reasons, lend themselves a bit more to "what if" scenarios. As in checking what if this boundary value is change by 2 units. Just modify the format and run the analysis/report. Of course the limitation of a format is that it only applies to a single variable.
All things being equal, choose the lookup technique you are most familiar with. If you are familiar with more than one, choose the one that fits best with your use case. It appears hash objects get the most visibility at least recently on the Community. However these are definitely an advanced SAS programming technique. There's little point using techniques you don't fully understand, and even more importantly, techniques your team members don't understand. For that reason I use hashes sparingly, typically when it is the most effective and efficient solution. Personally I think lookup formats are underrated. They are a lot simpler to code and in many cases are as fast as hashes. I've got DATA steps doing over 50 lookup formats - one statement each. Imagine the code 50 hashes in a DATA would require....
@ballardw - Multi-variable format lookups are possible by concatenating the variables into one lookup string, then using that to construct and do the lookup. But of course you can only return single values unless you concatenate those as well.
EDIT: Don't forget lookups can also be done in SQL and DATA step joins.
A temporary array provides the fastest lookup. It needs a clearly defined, discrete, known beforehand index which is or can be converted to an integer. E.g. I like to use a temporary array for lookups based on dates Arrays spanning from 1900-01-01 to 2099-01-01 will easily fit into memory. But you need a separate array for every value you need to look up (e.g. one for counts, one for prices).
Hash objects are also fast, do not need that you know the range of keys beforehand, but need to be coded (like the array) into every step when used. Hash objects allow the retrieval of several values with one method call.
Formats are slowest. But they need the least coding effort by the "end user", and can be maintained in a central location. And you also need a separate format for every value group, like with arrays.
For this discussion it is necessary to call for two people who knows a lot about the subject: Paul Dorfman ( @hashman ) and Don Henderson ( @DonH )
Few good materials describing some of discussed her topics you could refer to BASUG ( @Quentin ) web page. Especially those 2 videos my be valuable:
Paul's & Don's "Hashing Using Arrays" (https://www.basug.org/videos?wix-vod-video-id=1364e1a03c2c480d988c08b57874298c&wix-vod-comp-id=comp-...)
and
Russ Lavery "An Animated Guide to SAS Hashing: From Your First Hash to Hash of Hashes" (https://www.basug.org/videos?wix-vod-video-id=0c8db09279004466a8af42567908a019&wix-vod-comp-id=comp-...)
In the first Paul & Don explain the road from arrays to hash tables (Paul is the man who was using hash tables in SAS before SAS had hash tables), in the second Russ does a zero-to-hero introduction to hash tables and also compares their performance with formats.
When I'm doing my SAS classes, I'm introducing those concepts (among others) to my students in the following order:
1) arrays (especially temporary ones, because of their general importance in programming and implementing various algorithms),
2) formats (because of their usability [not only for look-ups] and robustness), and
3) hash tables (because they are dynamic data structures and can be "modified" in the execution time, which gives millions of new possibilities, and what makes them my favourite).
If we are talking about arrays and lookup they are extremely fast when you are doing integers lookups, basically because you can do the lookup in constant time.
Arrays also cooperate with the IN operator which makes them useful: IF variable IN array THEN ...
As @Kurt_Bremser wrote, arrays of dates easily fits into memory (even an array of seconds from 2023 fits):
1 %let s = %sysevalf("1jan1700"d);
2 %let e = %sysevalf("31dec9999"d);
3 %put &=s &=e;
S=-94962 E=2936547
4
5 data _null_;
6 array x[&s.:&e.] _temporary_;
7
8 d=dim(x);
9 put d=;
10 do _N_ = &s. to &e.;
11 x[_N_] = 17;
12 end;
13 run;
d=3031510
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
user cpu time 0.00 seconds
system cpu time 0.03 seconds
memory 24261.62k
OS Memory 36736.00k
14
15 %let s = %sysevalf("1jan2023 00:00:00"dt);
16 %let e = %sysevalf("31dec2023 23:59:59"dt);
17 %put &=s &=e;
S=1988150400 E=2019686399
18
19 data _null_;
20 array x[&s.:&e.] _temporary_;
21
22 d=dim(x);
23 put d=;
24 do _N_ = &s. to &e.;
25 x[_N_] = 42;
26 end;
27 run;
d=31536000
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
user cpu time 0.12 seconds
system cpu time 0.06 seconds
memory 246953.62k
OS Memory 259428.00k
So creating and array of dates with 1's for birthdays for XX and XXI century for session with 8GB of RAM is like nothing.
In situations when data are counted in up to hundreds of thousands you wont even notice any difference in performance. And when the data are big, the every time it is "trys-and-errors".
As @SASKiwi wrote, don't forget that SQL joins and Data Step merges works perfect for look-ups and often are 1) faster to implement and 2) understood by majority (even non SAS programmers).
Thank you for the detailed answer!
Thanks for the kudos to me and Paul (aka @hashman).
I think the choice should depend on the nature of the lookup. Some rough guidelines I have used:
And I would like to add that a SQL join or a DATA Step merge can sometimes be a quite effective solution.
Bottom line: there is really no generic answer to this question.
Hey, Don: In your reply, the bottom line is what I agree with most. In light of which, under a number of scenarios, using arrays can be the simplest, fastest, and most memory-efficient solution - both in terms of table lookup and aggregation.
Friendly regards,
Paul D.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.