Re: Two copies of the Same Hash Table Simultaneously in Memory -- How ...

jimbarbour · Posted 08-31-2018 06:00 PM

During a run of several steps, I wind up with two copies of the exact same hash table simultaneously in memory. I'd like to just load once and have only one copy of the hash table in memory.

Here's what I'm doing:

1. I compile a subroutine using Proc FCMP. In that subroutine, I declare the hash table and code a find(). I save the compiled subroutine to User.Functions.Hash.

2. Inside a macro, I %SYSCALL the subroutine. I've set my cmplib = User.Functions. I find() three of the records. The first find() takes a minute or so; the hash table is being loaded into memory. The second and third find take less than a second each; the hash table is already loaded. So far, so good.

3. I now execute a DATA step. Again, I've set my cmplib = User.Functions. I find() three of the records -- and the first find() takes a minute or so -- which indicates that the hash table is being loaded (again) into memory. Sadness.

4. Inside a second macro with a different name, I %SYSCALL the subroutine again. This time the first find() takes milliseconds, indicating that the hash table is still in memory.

So, while step 3 is executing, I already have a hash table with millions of rows in it loaded to memory, but SAS loads a second version of the hash table, identical to the first in every respect (they were both created with the same compiled subroutine from the same User.Functions library). I can see that the first hash table remains loaded in memory because subsequent macro calls to the compiled function execute nearly instantaneously, showing that no re-load is going on.

SAS appears to have some way to tell, when called by the second macro, that there's already a hash table in memory. However, in the DATA step, SAS can't seem to tell.

QUESTION: How can I tell SAS "use the hash table already loaded into memory; don't load another copy of the data?"

See code, below. The second macro is essentially the same as the first but with a different name. I'm running on an AIX box with SAS 9.4 M4.

Jim

**	Compile a User Defined Subroutine.	**;
	Proc fcmp outlib=User.Functions.Hash; 
		subroutine GetStats(name $, height,weight,age) ;
			outargs height, weight, age;
			declare hash User.class(dataset:'User.big');
			rc=User.class.definekey('name');
			rc=User.class.definedata('name','height','weight', 'age');
			rc=User.class.definedone();
			rc=User.class.find();
		endsub;
	run;

**	SYSCALL the Compiled Subroutine.	**;
	options cmplib=(User.functions);

	%macro callsub1(nam=Alfred, n=1);
		%local name height weight age;

		%let height=.;
		%let weight=.;
		%let age=.;
		%let strt=%sysfunc(datetime());
		%LET	Current_Time	=	%SYSFUNC( TIME(), TIME8.);
		%PUT	NOTE-  ;
		%PUT	NOTE:  Starting Macro at &Current_Time;

		%do i = 1 %to &n;
			%let rand = %sysevalf(%sysfunc(ranuni(987654))*100000,ceil);
			%let name = &nam&rand;
			%syscall getstats(name,height,weight,age);
			/* process utilizing returned values goes here */
			%LET	Current_Time	=	%SYSFUNC( TIME(), TIME8.);
			%put NOTE- &=name &=age	&=Current_Time;
		%end;

		%let stop=%sysfunc(datetime());
		%let time= %sysevalf(&stop-&strt);

		%put NOTE- &=n &=height &=weight &=age &=time;

		%LET	Current_Time	=	%SYSFUNC( TIME(), TIME8.);
		%PUT	NOTE:  Stopping Macro at &Current_Time;
		%PUT	NOTE-  ;
	%mend callsub1;

	%callsub1(nam=Barbara, n=3)

**	Run a Look Up DATA step that calls the compiled subroutine.	**;
&Lookup_Data1	%MACRO	Lookup_Data1_Mac();
	options cmplib=(User.functions);

	data	User.lookup1;
		retain	height	weight	age	.;
		DROP	_:;

		_Time	=	PUT(TIME(), TIME8.);
		PUTLOG	"NOTE:  Starting Lookup1.     "	_Time=;

		name='Alfred23456';
		call	getstats(name,height,weight,age);
		_Time	=	PUT(TIME(), TIME8.);
		PUTLOG	"NOTE-  Lookup1 #1 completed. "	_Time=;
		output;

		name='Alice98765';
		call getstats(name,height,weight,age);
		_Time	=	PUT(TIME(), TIME8.);
		PUTLOG	"NOTE-  Lookup1 #2 completed. "	_Time=;
		output;

		name='Barbara98765';
		call getstats(name,height,weight,age);
		_Time	=	PUT(TIME(), TIME8.);
		PUTLOG	"NOTE-  Lookup1 #3 completed. "	_Time=;
		output;
	run;

ballardw · Posted 08-31-2018 06:27 PM

This is more of a comment than any suggestion for a solution. I suspect that you are running into the behavior of the Operating System or possibly the SAS memory caching system. The cache may be holding "recently used data" in memory in case it is needed again. But if enough memory is needed for other data then that cache space is used. So sometimes the data stays in memory with the results of quicker access and sometimes it needs to reload.

I wonder how you know you "already have a hash table with millions of rows in it loaded to memory, but SAS loads a second version of the hash table"? This sounds more like a guess than knowledge based on a symptom: changing run times for similar code.

Since you say that your hash table has "millions of rows" that sounds very likely that the memory is used for something else in between uses.

Have you consider indexing the data on that field if you are searching for it frequently?

jimbarbour · Posted 08-31-2018 06:43 PM

How do I know if a hash table is or is not in memory?

If I do a find() and the hash table is not loaded into memory, there's a noticeable delay.

If I do a find() and the hash table is loaded into memory, there response is millisecond.

The pattern of delay vs. millisecond response time is completely consistent, 100% of the time.

For example, the very first time the compiled subroutine is called, the hash table cannot have yet been loaded. There is absolute certainty at this juncture. The noticeable delay occurs. Subsequent calls, no matter how many I do, have millisecond response time so long as they are either in a macro or are any call other than the first call in a DATA step.

Is there another explanation for what I'm observing?

By the way, these are test programs that do nothing but call the hash table subroutine. They are intended as a proof of concept and are doing no other work at this time.

Jim

Quentin · Posted 08-31-2018 09:52 PM

I was going to point you toward Art Carpenter's SGF 2018 Paper "Using Memory Resident Hash Tables to Manage Your Lookup Control Files" https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2018/2399-2018.pdf but in reviewing that paper, I see you have already read it, as your code is taken from the paper.

In that paper, he attempts to make a memory-resident hash table that can be used across DATA steps, but unfortunately was not able to do so. He notes that when a hash table is loaded via the macro language (%SYSCALL ) it is stored in memory for the life of the SAS session. So it can be re-used by later macro calls. But across DATA steps, a hash table cannot be re-used, it must be re-built for each DATA step. He writes:

Unlike in the DATA step where the hash table is only maintained within memory within a DATA step, once the routine is called using %SYSCALL the hash table remains in memory throughout the remainder of the SAS session.

So hash tables can persist across macro calls and be re-used without being re-built. But they cannot persist across DATA steps. Similarly, I would expect there is no way to tell a DATA step to use a hash table that was previously built by a macro. It's a shame.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

jimbarbour · Posted 08-31-2018 11:28 PM

@Quentin, yes, I read Art Carpenter's paper and was intrigued. I've had the same thought on several projects: that it would be really handy to be able to persist a hash table across data steps.

I tried one thing that he didn't mention: Spawning a synchronous sub thread from within the DATA step that instantiates the hash table. I reasoned that there must be some "destructor" method that executes when the DATA step ends. By spawning the sub-thread synchronously, I prevent the instanciating DATA step from ending. I can demonstrate that the original hash table stays resident in memory the entire time the sub-thread is executing, but I could not figure out a way to have the sub-thread reference the already existing hash table. Instead, there were always two identical hash tables simultaneously in memory. I was hoping against hope that someone here might have an angle I hadn't thought of -- although if Art Carpenter isn't aware of a means of referencing an existing hash table, then there probably isn't one.

I'm aware of at least four ways to have a hash table persist in memory while other DATA steps are run, but I have not been able to find a way to reference the original hash table from those other DATA steps.

A DOSUBL executed from within the instanciating DATA step.
A synchronous sub-thread launched from within the instanciating DATA step.
A macro call to a compiled function that instanciates a hash table prior to the running of a DATA step
A separate program that instanciates a hash table and then sleeps (to prevent the instanciating DATA step from ending) while another program is run.

The one thing I haven't tried is a CASLIB.

There is a way to do it. In method #3 in my list, above, subsequent macro calls to the compiled subroutine "know" that a hash table already exists. By what means is this accomplished? That I do not know but would like to.

Jim

KachiM · Posted 09-01-2018 02:06 AM

Jim,

It is surprising to find duplicate records without hash's ADD/REPLACE Methods. Can you show few records of USER.BIG to replicate your results?

DATASP

jimbarbour · Posted 09-01-2018 10:13 AM

@KachiM, I'm not aware of any duplicate records. What is duplicated is the entire hash table. At certain points in my run, there are two copies of the entire hash table resident in memory.

Jim

KachiM · Posted 09-01-2018 10:42 AM

Jim,

Not duplicate records. I meant duplicate items in the hash table.

Thanks

jimbarbour · Posted 09-01-2018 11:16 AM

There are no duplicate items in any one hash table. There are two separate hash tables. The two separate hash tables are identical. Each of the two identical hash tables has the exact same contents as the other -- although I could vary that by deleting and re-defining the dataset USER.BIG between instanciations.

It might go like this:

1. Compile Subroutine

2. Create first version of USER.BIG with, say, 5 million records

3. Call the compiled subroutine via a Macro. We would now have hash table #1 with 5 million rows.

4. Delete USER.BIG

5. Create a new version of USER.BIG with, say 2 million records.

6. Run a DATA step that calls the compiled subroutine. We would now have hash table #2 with 2 million rows. Hash table #1 with 5 million rows would still be in memory. In other words, we would have two hash tables simultaneously resident in memory, both created by the compiled subroutine. The first hash table would have 5 million rows and would be reflective of the first version of USER.BIG. The second hash table would have 2 million rows and would be reflective of the second version of USER.BIG.

7. When the DATA step ends, hash table #2 is deleted.

8. Call the compiled subroutine via a Macro. The response time will be millisecond, indicating that the FIND() searched a hash table already resident in memory. The row count of this hash table will be 5 million, indicating that it is the original (i.e. first) hash table loaded as a result of step 3, above.

Jim

KachiM · Posted 09-01-2018 11:34 AM

Jim,

Use of Macro must be responsible for producing two hash tables. I have no idea how that happens and how to avoid. If you use the Data Step there is no such issue as long as you do not cross the Data Step boundary. It takes time first to load the items from the Data Set into the hash table. Then FIND method doesn't take much time.

When you want to use the FCMP function again, hash table definition is loaded from the CMPLIB library but the hash table has to be rebuilt again from a given Data Set.

If you are not keen to use Macro Function, try Data Step.

I have attempted this:

proc fcmp outlib = work.cmpds.lib;
   subroutine GetVars(Name $, height, weight, age);
   outargs height, weight, age;
   file log;      
      declare hash h(dataset:'class');
      rc = h.definekey('Name');
      rc = h.definedata('height','weight','age');
      rc = h.definedone();
      
      rc = h.find();
      if rc ^= 0 then put 'ERROR: Not Found ' Name;
   endsub;
quit;

data class;
   set sashelp.class;
run;

options cmplib = work.cmpds;

data _null_;
   length Nam $8;

   do Nam = 'Judy','Mary','William','Alice','Paul';
      call missing(Height, weight, Age);
      call GetVars(Nam,height,weight,Age);
      put Nam = height = weight = age =;
   end;
run;

HTH

DATASP

jimbarbour · Posted 09-01-2018 12:16 PM

Well, there's the irony of the situation. I was looking for a way to have multiple data steps access a single hash table.

Instead, I've found a way to have multiple hash tables resident in memory during a single data step.

Jim

P.S. As I think about it, one could have any number of hash tables simultaneously resident in memory simply by by spawning synchronous sub-threads. Thread1 spawns synchronous Thread 2 which spawns synchronous Thread 3, and so on, each thread referencing the same hash table -- but because of the limitations of SAS, another copy of the hash table has to be loaded into memory for each thread. One would have as many hash tables simultaneously as one had threads.

I really do with there were a way to tell SAS to point to a particular hash table in memory.

KachiM · Posted 09-01-2018 12:28 PM

Jim,

What is the utility of having several copies of the same hash table? Is it not a waste of memory? FCMP Hash is designed to work with a function and the hash table is wiped off when the Data Step ends. FCMP Hash, as far as I know, cannot hold the hash table to serve several Data Steps. If you find a way then it will be good for programmers.

regards,

DATASP

jimbarbour · Posted 09-01-2018 01:07 PM

What is the utility of having several copies of the same hash table?

None whatever, and that's exactly my point.

FCMP Hash, as far as I know, cannot hold the hash table to serve several Data Steps. If you find a way then it will be good for programmers. [emphasis added]

That was my goal as was Art Carpenter's. Neither of us found a way to do it. I suspect SAS Institute has to make some changes in order for a hash table to persist across data step boundaries and be addressable. I can get the hash table to persist; I just can't point to it.

Jim

KachiM · Posted 09-01-2018 01:50 PM

Jim,

Hope SI fixes it soon.

In the meanwhile:

If you desire to use the hash table (loaded once) and want it to be used as Table Lookup with several Data Sets I can suggest a way to do it ( in one Data Step). If you want to do Lookup as well as adding newer items to the hash table, that can also be done within one Data Step.

Thanks,

DATASP

jimbarbour · Posted 09-01-2018 06:32 PM

We have a series of jobs that pull off designated records by key. The idea would be to load the hash table and then have the different jobs all use the same hash table for the key lookup. This does not not appear to be possible, at least for now. I'm not sure how much more development will occur on the SAS language now that Viya has been introduced.

Jim

Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Re: Two copies of the Same Hash Table Simultaneously in Memory -- How to Avoid?

Classroom Training Available!