BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
ifb10
Obsidian | Level 7

Hi everyone,

 

I have a rather general question: Is there a case where one would use an Array over a Hash Object in SAS for table lookups? And if yes, how is it beneficial? Finally, does anybody have an overview (mindmap) where the most important functionalities (lookup, sorting, processing etc.)  are linked to SAS methods, functions and procedures? 

 

For a starter like myself it is not always intuitive to decide when to use what.

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Onyx | Level 15

For this discussion it is necessary to call for two people who knows a lot about the subject: Paul Dorfman ( @hashman ) and Don Henderson ( @DonH )

 

Few good materials describing some of discussed her topics you could refer to BASUG ( @Quentin ) web page. Especially those 2 videos my be valuable:

Paul's & Don's "Hashing Using Arrays" (https://www.basug.org/videos?wix-vod-video-id=1364e1a03c2c480d988c08b57874298c&wix-vod-comp-id=comp-...)

and 

Russ Lavery "An Animated Guide to SAS Hashing: From Your First Hash to Hash of Hashes" (https://www.basug.org/videos?wix-vod-video-id=0c8db09279004466a8af42567908a019&wix-vod-comp-id=comp-...)

In the first Paul & Don explain the road from arrays to hash tables (Paul is the man who was using hash tables in SAS before SAS had hash tables), in the second Russ does a zero-to-hero introduction to hash tables and also compares their performance with formats.

 

When I'm doing my SAS classes, I'm introducing those concepts (among others) to my students in the following order:

1) arrays (especially temporary ones, because of their general importance in programming and implementing various algorithms),

2) formats (because of their usability [not only for look-ups] and robustness), and

3) hash tables (because they are dynamic data structures and can be "modified" in the execution time, which gives millions of new possibilities, and what makes them my favourite).

 

If we are talking about arrays and lookup they are extremely fast when you are doing integers lookups, basically because you can do the lookup in constant time.

Arrays also cooperate with the IN operator which makes them useful: IF variable IN array THEN ...

As @Kurt_Bremser wrote, arrays of dates easily fits into memory (even an array of seconds from 2023 fits):

1    %let s = %sysevalf("1jan1700"d);
2    %let e = %sysevalf("31dec9999"d);
3    %put &=s &=e;
S=-94962 E=2936547
4
5    data _null_;
6      array x[&s.:&e.] _temporary_;
7
8      d=dim(x);
9      put d=;
10     do _N_ = &s. to &e.;
11       x[_N_] = 17;
12     end;
13   run;

d=3031510
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.03 seconds
      memory              24261.62k
      OS Memory           36736.00k
14
15   %let s = %sysevalf("1jan2023 00:00:00"dt);
16   %let e = %sysevalf("31dec2023 23:59:59"dt);
17   %put &=s &=e;
S=1988150400 E=2019686399
18
19   data _null_;
20     array x[&s.:&e.] _temporary_;
21
22     d=dim(x);
23     put d=;
24     do _N_ = &s. to &e.;
25       x[_N_] = 42;
26     end;
27   run;

d=31536000
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      user cpu time       0.12 seconds
      system cpu time     0.06 seconds
      memory              246953.62k
      OS Memory           259428.00k

So creating and array of dates with 1's for birthdays for XX and XXI century for session with 8GB of RAM is like nothing.

 

In situations when data are counted in up to hundreds of thousands you wont even notice any difference in performance. And when the data are big, the every time it is "trys-and-errors". 

 

As @SASKiwi wrote, don't forget that SQL joins and Data Step merges works perfect for look-ups and often are 1) faster to implement and 2) understood by majority (even non SAS programmers). 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

8 REPLIES 8
ballardw
Super User

In my opinion the array option might be preferred if the list of values is relatively small and subject to frequent change. Since hash is going to require a data set it may be easier to update an array definition than to create a new data set.

 

Your subject line includes for Proc Format but the question comment doesn't.

Hash and arrays are likely to be adding variables. Formats can be applied to variables that already exist if the need is a single source change and may be more flexible because changing the format in the body of a report or analysis procedure can be significantly quicker than adding or modifying variables in large data sets plus formats can be applied to many variables with minimal code. Hash or an array is going to require multiple uses of the values. Formats also, for the same reasons, lend themselves a bit more to "what if" scenarios. As in checking what if this boundary value is change by 2 units. Just modify the format and run the analysis/report. Of course the limitation of a format is that it only applies to a single variable.

SASKiwi
PROC Star

All things being equal, choose the lookup technique you are most familiar with. If you are familiar with more than one, choose the one that fits best with your use case. It appears hash objects get the most visibility at least recently on the Community. However these are definitely an advanced SAS programming technique. There's little point using techniques you don't fully understand, and even more importantly, techniques your team members don't understand. For that reason I use hashes sparingly, typically when it is the most effective and efficient solution. Personally I think lookup formats are underrated. They are a lot simpler to code and in many cases are as fast as hashes. I've got DATA steps doing over 50 lookup formats - one statement each. Imagine the code 50 hashes in a DATA would require....

 

@ballardw - Multi-variable format lookups are possible by concatenating the variables into one lookup string, then using that to construct and do the lookup. But of course you can only return single values unless you concatenate those as well.

 

EDIT: Don't forget lookups can also be done in SQL and DATA step joins.

Reeza
Super User
Pick two and master them. Everyone has their own opinions on which is better, but the best one is the one you can implement the fastest. These days programming time is more expensive than compute times by far.
Kurt_Bremser
Super User

A temporary array provides the fastest lookup. It needs a clearly defined, discrete, known beforehand index which is or can be converted to an integer. E.g. I like to use a temporary array for lookups based on dates  Arrays spanning from 1900-01-01 to 2099-01-01 will easily fit into memory. But you need a separate array for every value you need to look up (e.g. one for counts, one for prices).

 

Hash objects are also fast, do not need that you know the range of keys beforehand, but need to be coded (like the array) into every step when used. Hash objects allow the retrieval of several values with one method call.

 

Formats are slowest. But they need the least coding effort by the "end user", and can be maintained in a central location. And you also need a separate format for every value group, like with arrays.

yabwon
Onyx | Level 15

For this discussion it is necessary to call for two people who knows a lot about the subject: Paul Dorfman ( @hashman ) and Don Henderson ( @DonH )

 

Few good materials describing some of discussed her topics you could refer to BASUG ( @Quentin ) web page. Especially those 2 videos my be valuable:

Paul's & Don's "Hashing Using Arrays" (https://www.basug.org/videos?wix-vod-video-id=1364e1a03c2c480d988c08b57874298c&wix-vod-comp-id=comp-...)

and 

Russ Lavery "An Animated Guide to SAS Hashing: From Your First Hash to Hash of Hashes" (https://www.basug.org/videos?wix-vod-video-id=0c8db09279004466a8af42567908a019&wix-vod-comp-id=comp-...)

In the first Paul & Don explain the road from arrays to hash tables (Paul is the man who was using hash tables in SAS before SAS had hash tables), in the second Russ does a zero-to-hero introduction to hash tables and also compares their performance with formats.

 

When I'm doing my SAS classes, I'm introducing those concepts (among others) to my students in the following order:

1) arrays (especially temporary ones, because of their general importance in programming and implementing various algorithms),

2) formats (because of their usability [not only for look-ups] and robustness), and

3) hash tables (because they are dynamic data structures and can be "modified" in the execution time, which gives millions of new possibilities, and what makes them my favourite).

 

If we are talking about arrays and lookup they are extremely fast when you are doing integers lookups, basically because you can do the lookup in constant time.

Arrays also cooperate with the IN operator which makes them useful: IF variable IN array THEN ...

As @Kurt_Bremser wrote, arrays of dates easily fits into memory (even an array of seconds from 2023 fits):

1    %let s = %sysevalf("1jan1700"d);
2    %let e = %sysevalf("31dec9999"d);
3    %put &=s &=e;
S=-94962 E=2936547
4
5    data _null_;
6      array x[&s.:&e.] _temporary_;
7
8      d=dim(x);
9      put d=;
10     do _N_ = &s. to &e.;
11       x[_N_] = 17;
12     end;
13   run;

d=3031510
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.03 seconds
      memory              24261.62k
      OS Memory           36736.00k
14
15   %let s = %sysevalf("1jan2023 00:00:00"dt);
16   %let e = %sysevalf("31dec2023 23:59:59"dt);
17   %put &=s &=e;
S=1988150400 E=2019686399
18
19   data _null_;
20     array x[&s.:&e.] _temporary_;
21
22     d=dim(x);
23     put d=;
24     do _N_ = &s. to &e.;
25       x[_N_] = 42;
26     end;
27   run;

d=31536000
NOTE: DATA statement used (Total process time):
      real time           0.17 seconds
      user cpu time       0.12 seconds
      system cpu time     0.06 seconds
      memory              246953.62k
      OS Memory           259428.00k

So creating and array of dates with 1's for birthdays for XX and XXI century for session with 8GB of RAM is like nothing.

 

In situations when data are counted in up to hundreds of thousands you wont even notice any difference in performance. And when the data are big, the every time it is "trys-and-errors". 

 

As @SASKiwi wrote, don't forget that SQL joins and Data Step merges works perfect for look-ups and often are 1) faster to implement and 2) understood by majority (even non SAS programmers). 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



ifb10
Obsidian | Level 7

Thank you for the detailed answer! 

DonH
Lapis Lazuli | Level 10

Thanks for the kudos to me and Paul (aka @hashman).

I think the choice should depend on the nature of the lookup. Some rough guidelines I have used:

 

  • Formats are incredibly simple to use if you are only looking a single value. Especially of the relationship between the lookup key and the value to be looked up is relatively static.
  • I've not used ARRAYs much. They have some serious restrictions. The only time I consider them is when the key to the lookup is an integer than can easily be used AND the number of values is small enough that I can conveniently type them in. If the list is long or comes from a data set, loading array is harder IMO that using a CNTLIN data set to create a forrmat, or even to load the data set into a hash table.
  • Hash tables are particularly appropriate IMO when you are looking up multiple values or the lookup is a fuzzy one.

And I would like to add that a SQL join or a DATA Step merge can sometimes be a quite effective solution.

Bottom line: there is really no generic answer to this question.

hashman
Ammonite | Level 13

Hey, Don: In your reply, the bottom line is what I agree with most. In light of which, under a number of scenarios, using arrays can be the simplest, fastest, and most memory-efficient solution - both in terms of table lookup and aggregation.

Friendly regards,

Paul D.     

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1427 views
  • 17 likes
  • 8 in conversation