Re: Persisting Hash Tables Across Data Step Boundaries

jimbarbour · Posted 01-07-2020 01:20 PM

I read with interest Art Carpenter's paper from WUSS 2018 on how to get a hash table to persist across multiple data steps. I also attended Art's presentation at WUSS 2018. In the paper, Art describes how one can load a hash table in the beginning of a SAS job and then use that same has table in multiple data steps without having to re-declare or re-load the hash table. Link: https://www.lexjansen.com/wuss/2018/41_Final_Paper_PDF.pdf

Art outlines four basic steps:

1. Design/create the hash table (easy, almost not a step).

2. Embed the hash table declaration and calls (FIND, ADD, etc.) in a SAS user defined function compiled by Proc FCMP.

3. Embed the function call inside a macro.

4. Use the RESOLVE function inside each DATA step to call the macro that calls the function.

A bit convoluted, but in the end, depending on how many DATA steps one is executing and how long it takes to load the hash table, a potential performance gain may be realized.

There's just one problem: In my case, running SAS 9.4 M6 on a Windows Server 2016 Standard machine, my hash table does not persist across step boundaries. I have to re-declare and reload the hash table for each data step. Everything works, the hash table, the compiled function, the macro, and the RESOLVE, but the hash table must be re-declared and re-loaded for each DATA step.

It appears that somehow using the hash table in a DATA step is forcing a re-declaration/reload. If I instanciate the the hash table at a point early in my overall SAS job, I can access the hash table downstream -- without re-declaring/reloading -- even after multiple intermediate SORT, DATA, and SQL steps as long as no DATA step accesses the hash table. However, if I run a DATA step that accesses the hash table, suddenly, the hash table is gone in all downstream steps -- steps where it had been working before. In other words, if I have steps A, B, C, and D, I can declare and load the hash table and use it in all four steps so long as I don't call the hash table from within any DATA step. However, if I access the hash table in, say, a DATA step in step B, then suddenly the hash table is no longer available in steps C and D, even though it was a) previously available and b) there were no code changes to steps C and D.

Has anyone else run into this? Any ideas on any workaround(s)? The code is a bit lengthy, but I'm happy to post it if it were helpful.

Thank you,

Jim

Kurt_Bremser · Posted 01-08-2020 04:39 AM

From the paper:

When requests were made in two different DATA steps the same savings were not realized across DATA steps. This
confirms that the hash table was not saved in
memory across the DATA step boundary. The
first call to the GETSTATS routine within a
DATA step causes the hash table to be
reloaded.

So there's no advantage for this method when a hash object needs to be used in more than one data step.

There's a caveat when using the method as described in the paper: as soon as the macro is used, the hash persists in memory, and you may run into memory problems in other steps:

data big (keep=name age height weight rand);
 length name $15;
 set sashelp.class(rename=(name=oldname));
 do i = 1 to 1000000;
 rand = ranuni(123456789);
 name=cats(oldname,i);
 output big;
 end;
 run;
proc sort data=big;
 by rand;
 run;

proc fcmp outlib=work.functions.hash;
 subroutine GetStats(name $, height,weight,age) ;
 outargs height, weight, age;
 declare hash class(dataset:'work.big');
 rc=class.definekey('name');
 rc=class.definedata('name','height','weight', 'age');
 rc=class.definedone();
 rc=class.find();
 endsub;
 run;

options cmplib=(work.functions);

data lookup1;
 retain height weight age .;
 name='Alfred23456';
 call getstats(name,height,weight,age);
 output;
 name='Alice98765';
 call getstats(name,height,weight,age);
 output;
 run;


%macro callgetstats(name=Alfred1);
 %local name height weight age ;
 %let height=.;
 %let weight=.;
 %let age=.;
 %syscall getstats(name,height,weight,age);
 %put &=name &=height &=weight &=age;
%mend callgetstats;
%callgetstats(name=Barbara98765)

data lookup2;
 retain height weight age .;
 name='Alfred23456';
 call getstats(name,height,weight,age);
 output;
 name='Alice98765';
 call getstats(name,height,weight,age);
 output;
 run;

I took the code from the paper, but added one order of magnitude to the dataset to get some real processing time and memory load.

This is the log:

24         data big (keep=name age height weight rand);
25          length name $15;
26          set sashelp.class(rename=(name=oldname));
27          do i = 1 to 1000000;
28          rand = ranuni(123456789);
29          name=cats(oldname,i);
30          output big;
31          end;
32          run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.BIG has 19000000 observations and 5 variables.
NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
      real time           6.67 seconds
      cpu time            3.68 seconds
      

33         proc sort data=big;
34          by rand;
35          run;

NOTE: There were 19000000 observations read from the data set WORK.BIG.
NOTE: The data set WORK.BIG has 19000000 observations and 5 variables.
NOTE:  Verwendet wurde: PROZEDUR SORT - (Gesamtverarbeitungszeit):
      real time           12.15 seconds
      cpu time            11.23 seconds
      

36         
37         proc fcmp outlib=work.functions.hash;
38          subroutine GetStats(name $, height,weight,age) ;
39          outargs height, weight, age;
40          declare hash class(dataset:'work.big');
41          rc=class.definekey('name');
42          rc=class.definedata('name','height','weight', 'age');
43          rc=class.definedone();
44          rc=class.find();
45          endsub;
46          run;

NOTE: Function GetStats saved to work.functions.hash.
NOTE:  Verwendet wurde: PROZEDUR FCMP - (Gesamtverarbeitungszeit):
      real time           0.01 seconds
      cpu time            0.00 seconds
      

47         
48         options cmplib=(work.functions);
49         
50         data lookup1;
51          retain height weight age .;
52          name='Alfred23456';
53          call getstats(name,height,weight,age);
54          output;
55          name='Alice98765';
56          call getstats(name,height,weight,age);
57          output;
58          run;

NOTE: The data set WORK.LOOKUP1 has 2 observations and 4 variables.
NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
      real time           37.87 seconds
      cpu time            20.43 seconds
      

59         
60         
61         %macro callgetstats(name=Alfred1);
62          %local name height weight age ;
63          %let height=.;
64          %let weight=.;
65          %let age=.;
66          %syscall getstats(name,height,weight,age);
67          %put &=name &=height &=weight &=age;
68         %mend callgetstats;
69         %callgetstats(name=Barbara98765)
NAME=Barbara98765 HEIGHT=65.3 WEIGHT=98 AGE=13
70         
71         data lookup2;
72          retain height weight age .;
73          name='Alfred23456';
74          call getstats(name,height,weight,age);
75          output;
76          name='Alice98765';
77          call getstats(name,height,weight,age);
78          output;
79          run;

ERROR: Unable to add all rows to HASH object 'class'.
ERROR: Error reported in function 'DEFINEDONE_HASH_' in statement number 5 at line 9 column 3.
       The statement was:
    0     (9:3)      rc = DEFINEDONE_HASH_(  )
NOTE: The data set WORK.LOOKUP2 has 2 observations and 4 variables.
NOTE:  Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit):
      real time           2.63 seconds
      cpu time            1.41 seconds

Since the macro version DOES persist in memory, it prevented the reuse of the hash in the second data step. So one should look for a method to clear the macro hash object.

Running the two data steps without the macro call in between had no problems, but also no advantage. Both steps took the same time.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jimbarbour · Posted 01-10-2020 04:09 PM

Kurt,

Thank you for your replies. That is a good point that someone coming after me might not understand the subtleties of a hash table embedded within a function executed by a macro called at execution time via a RESOLVE().

If I can, I'd still like to get my code working. This technique would be nice to have in reserve. It could provide significant performance benefits in some cases.

In my case, I'm gathering data from 20 separate disk drives. Each time I encounter a unique user ID, I need to do a Windows Domain look up to get their full name. Windows Domain look ups are "expensive" in terms of time. Each of the 20 drives serves the same pool of users. By using a single common hash table, I perform Windows Domain look ups only once per User ID -- all on the first drive. All 19 of the other drives require no Domain look ups because the information is available in the common hash table. The job should therefore run quite a bit faster.

I'll report back if I get any good results.

Thank you again,

Jim

Kurt_Bremser · Posted 01-08-2020 04:50 AM

After reading the paper completely (including the part with the resolve() function), I now think that your use of the hash in the data step caused a similar problem as I found in my previous post, but in your case SAS decided to clear the hash object on its own to make space for your data step. Or some other effect caused the object to be cleared out of memory when a data step that uses it ends, as the object now somehow becomes part of the data step's memory segment and is cleared along with that at the end of data step execution.

While the method(s) described in the paper open the option of persistent hash objects, I'd be very cautious in using such methods "in the wild"; the unwary coder who does not know the inner workings could be completely baffled by the suddenly arising side effects.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Persisting Hash Tables Across Data Step Boundaries