BookmarkSubscribeRSS Feed
ScottBass
Rhodochrosite | Level 12

Hi,

 

I'm trying to improve the performance of a long running job.  It's using 4 multivalued hash objects to essentially build a Cartesian product with a large fact table.  The job already has to run with a -memsize option due to the hash objects.

 

After the Cartesian product is built, there are calculations that are CPU bound, that I believe would benefit from DS2 threading.

 

My question is, if I'm running DS2 threads, can those threads "share" a common hash object?  Or, do these hash objects have to be loaded for each thread?

 

In pseudocode:

 

 

PROC DS2;
 thread mythread;
   method run();
      set mydata;
rc=h.reset_dup();
do while (h.do_over());
/* calculations */
if whatever then output;
end;
end; endthread; run;
data out; dcl thread mythread t;
dcl hash h;
method init();
/* load hash h */
end; method run(); set from t; end; enddata; run; quit;

 

 

That's just rough, quickly thrown together pseudocode from someone new to DS2.  But hopefully the idea of what I want to do is clear?

 

Thanks,

Scott


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
10 REPLIES 10
jcbell
Obsidian | Level 7

I have the same question - Anyone have an idea on this one.  ?

SASJedi
SAS Super FREQ
The short answer is - no, threads can't share a hash object because threads don't share resources.

It helps to think of a thread program as a stand-alone DATA program. Each thread has its own PDV, CPU and all the resources required to completely process an individual row of data. If you think of it that way, it becomes clear that a thread's instance of a HASH object is its own, and that instance can't be shared with other threads.

I hope this helps! All the best,
Mark
Check out my Jedi SAS Tricks for SAS Users
jcbell
Obsidian | Level 7

Ok - Thanks.

 

I think I mis-spoke.  I believe you can use threads with a hash table.  Peter Eberhardt talks about using a threaded DS2 step with a hash table in his book.  The issue I'm running into is a locking of the lookup table by the threads.

 

ERROR: BASE driver, Table RECREATE_OVERRIDES_UPLOAD_HASH is already in use and cannot be locked. If you opened the file with LOCKTABLE=EXCLUSIVE, try the LOCKTABLE=SHARE option

 

I'm wondering if/how I can tell the hash object to open in Share mode.  OR...  Can I somehow reference _threadid_ and dynamically load a different hash table (work.foo__threadid_) for each thread.

 

Perhaps using _NEW_ to create a new HASH instance for each thread?

 

Curious.

 

Thanks,

 

John

jcbell
Obsidian | Level 7

This may help someone...

 

It took a bit of fiddling, but this works pretty slick.

 

This uses DS2 with threads and a hash table.  

 

proc ds2;
thread override / overwrite=yes;
dcl double count;
dcl bigint thisThread;
dcl char(4096) classvartohash;
dcl char(64) hashlookup;
dcl package hash overridehash ();
dcl package hiter hi('overridehash');
dcl char(128) scenario_name ;
dcl char(128) scenario_type;
dcl char(32) OVERRIDE_VARIABLE;
dcl double precision OVERRIDE_VALUE_NUM;

%do hashvars = 1;%to &numhashes;
dcl char(64) hash_&hashvars ;
%end;

drop count;
method init();
if 0 then set val.instvals (locktable=share);
overridehash.definekey('hashlookup');
overridehash.data([SCENARIO_NAME SCENARIO_TYPE SCENARIO_DATE OVERRIDE_VARIABLE OVERRIDE_VALUE_NUM]);
overridehash.dataset('{select * from work.recreate_overrides_upload_hash2 {options LOCKTABLE= SHARE} }');
overridehash.definedone();
end;
method run();
dcl double rc;
set val.instvals ;
%do hashvars = 1 %to &numhashes;
classvartohash=catt(&&classvartohash&hashvars);
hash_&hashvars = SHA256HEX (classvartohash,0);
hashlookup = hash_&hashvars;
rc = overridehash.find();
if (rc = 0) then
output;
%end;
thisThread=_threadid_;
count+1;
end;
method term();
put '**Thread' _threadid_ 'processed' count 'rows:';
end;
endthread;
run;
quit;


proc ds2;
data work.override_instid (keep=(instid scenario_name scenario_type hashlookup %do hashvars = 1 %to &numhashes; hash_&hashvars %end; SCENARIO_DATE OVERRIDE_VARIABLE OVERRIDE_VALUE_NUM) );
dcl thread override frac;
method run();
set from frac threads=&override_threads;
end;
enddata;
run;
quit;

ScottBass
Rhodochrosite | Level 12

Hi @jcbell,

 

Thanks for the useful code sample!  I'm sure I'll use it as a template for future work.

 

Questions:

 

1) If your SAS memsize=4GB, your hash consumed 1GB, and you ran with 4+ threads, I assume your approach would abort with an out-of-memory exception?

 

2) If your hash objects take a long time to load, I assume that long time to load would occur for each thread?

 

Regardless, if those constraints don't apply, then your code example is very useful for utilizing hash objects within threads.  

 

(I also ran into the LOCKTABLE issue.  Why a SELECT statement opens what is obviously a read-only operation in (not)SHARE mode is beyond me)


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
ScottBass
Rhodochrosite | Level 12

Hi,

 

Further to SASJedi's reply...

 

After my initial post, I opened a track with SAS TS with the same question.  Of course, their response was the same as SASJedi's, although they did comment that it sounded like a good enhancement request.

 

On a more global level (i.e. not just restricted to DS2), I wish that SAS supported the concept of a global hash object:  a hash object that would persist beyond the current data step.

 

In my previous job, we had a number of very large nightly ETL jobs.  Many of the fact loads had to source the same dimension tables into hash objects in order to add the surrogate keys.  It could take 10-15 mins for each job just to load the same hash objects.  Wouldn't it be great if we could load the hash objects once, process the fact jobs, then dispose of the hash objects?

 

To that end, I also wish that SAS supported an "always on" server, analogous to many RDBMS's.  This could potentially support a hash object global to the entire server.  In that scenario, I would load the global hash object, run the ETL, then close the hash objects.  Garbage collection could be an issue - perhaps load the hash objects with a timeout value, where they delete themselves if not touched within the timeout period?

 

Of course, this is blue sky, and would require huge architectural changes to SAS.  Unless LASR supports this concept?

 

In my particular problem which triggered this post, there was no way I could load one hash for each DS2 thread - it would exceed the memory capacity of the entire server, let alone the SAS memsize option.  Had I pursued DS2 threading further, I would have investigated using index key lookup for each thread:

 

http://support.sas.com/documentation/cdl/en/lestmtsref/68024/HTML/default/viewer.htm#p00hxg3x8lwivcn..., example 7.

 

 


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
ChrisNZ
Tourmaline | Level 20

>  Many of the fact loads had to source the same dimension tables into hash objects in order to add the surrogate keys.  It could take 10-15 mins for each job just to load the same hash objects.  

 

In the short term (ie. until your proposal is implemented. Mmmm make that medium term then..) , loading your tables in a RAM disk would help to ensure the data can be accessed by all jobs in a speedy manner.

Smijoss
Fluorite | Level 6
%let Conn_String = %nrbquote(%str('DRIVER=BASE;CATALOG=BASE;schema=(name=work;primarypath="%sysfunc(pathname(work))");locktable=share;'));

%put Conn_String=&Conn_String.;
proc ds2 ds2accel=yes conn=%unquote(&Conn_String) nolibs;
	
	thread work.thread_smijo / overwrite=yes;
		declare package hash smijo(8,'WORK.QUERY_FOR_SNACKS');
		declare double SUM_of_QtySold Sales; 
		declare double  cntr;

		method init();
			dcl smallint _rc1;
			_rc1 = smijo.defineKey('Product');
			_rc1 = smijo.defineKey('count5');
			_rc1 = smijo.defineData('SUM_of_QtySold');
			_rc1 = smijo.defineDone();
		end;

		method run();
			dcl smallint _rc1;
			set temp123;

			Sales = (QtySold * Price) ** 3;
			Sales = if sales < 5 then  5 else  Sales ;

			do cntr = 1 to 20000 ; 
				sales = sales + cntr * cntr;
			end;
			
			_rc1 = smijo.find();
		end;

	endthread;

	data smijo / overwrite = yes;
		declare thread work.thread_smijo tsmijo;

		method run();
			set from tsmijo threads=4;
		end;
	enddata;
	run;
quit;
ScottBass
Rhodochrosite | Level 12

@Smijoss wrote:
%let Conn_String = %nrbquote(%str('DRIVER=BASE;CATALOG=BASE;schema=(name=work;primarypath="%sysfunc(pathname(work))");locktable=share;'));

%put Conn_String=&Conn_String.;
proc ds2 ds2accel=yes conn=%unquote(&Conn_String) nolibs;
	
	thread work.thread_smijo / overwrite=yes;
		declare package hash smijo(8,'WORK.QUERY_FOR_SNACKS');
		declare double SUM_of_QtySold Sales; 
		declare double  cntr;

		method init();
			dcl smallint _rc1;
			_rc1 = smijo.defineKey('Product');
			_rc1 = smijo.defineKey('count5');
			_rc1 = smijo.defineData('SUM_of_QtySold');
			_rc1 = smijo.defineDone();
		end;

		method run();
			dcl smallint _rc1;
			set temp123;

			Sales = (QtySold * Price) ** 3;
			Sales = if sales < 5 then  5 else  Sales ;

			do cntr = 1 to 20000 ; 
				sales = sales + cntr * cntr;
			end;
			
			_rc1 = smijo.find();
		end;

	endthread;

	data smijo / overwrite = yes;
		declare thread work.thread_smijo tsmijo;

		method run();
			set from tsmijo threads=4;
		end;
	enddata;
	run;
quit;

 

Hi,

 

Thanks for this.  It's been a long time since I posted this question, and am now working with another client.  I haven't had time to test your code.

 

However, on first glance, wouldn't this create and load one hash per thread?  Instead of a "global" hash object shared among all the threads?  If the former, then this approach would have exceeded the memory constraints of the server at my previous client.  As well as redundantly loading the same data in say 8 threads.


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Smijoss
Fluorite | Level 6

yes,  you are right.  it is a waste of resources.  only if my hash object is relatively small then i would take this approach. 

i hope sas comes out with the Global hash object soon 🙂 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2430 views
  • 4 likes
  • 5 in conversation