How to further improve performance of this 1 to 1 hash table join

Reply
Frequent Contributor
Posts: 127

How to further improve performance of this 1 to 1 hash table join

Hello,

 

I would like to run this code as fast as possible. It currently runs is ~30 secs (for 1/10th of the data I will need in production)

 

I already used these optimization techniques :

 

- use hash tables

- remove all put statements --> EG session crashes without it

- define the lengths of all variables to their absolute minimum for the 2 tables --> gained ~5s

- reduce lengths defined in the data step --> gained ~5s

- store the macro --> gained ~1s

- tried (compress=yes) --> did worse

- tried sasfile open-close statement --> didn't help

- bufsize=16k (default was 4k) --> gained ~1s

- BUFNO=max --> didn't help

 

--> total gain ~25% faster

 

But maybe there is something else to do, maybe related to my programming (maybe the macro, maybe the way loops are made or anything I havn't thought of)

 

Is there anything else I could try to improve speed?

 

(The code is running on a remote BI server from SEG7.1, on windows, with SAS9.3)

 

 


/* ----------------------------------------------------------------------
   DURATION SEVERITY STATE-PROPORTION
   --------------------------------------------------------------------*/

%let propertyList = Duration Severity Proportion;

sasfile work.STATED.data open;

%macro dummy_jhkjlj() / store;
data DALY1(drop=value category1 category2);

	length
		Germ $10
		AgeGroupDALY $10 
		AgeGroupSPMA $10 
		Gender $1
		Category1 $50 
		Category2 $50 
		value 8 
		;

	*** make link to hash table ;
	if _n_=1 then do;

		***modelvalues ----------------;
		declare hash h1(dataset:'modelData_completed');
		h1.definekey ('Germ', 'Category1', 'Category2', 'AgeGroupDALY', 'Gender') ;
		h1.definedata('Value');
		h1.definedone();
		call missing(Germ, Value, Category1, Category2);
		* e.g. rc=h1.find(KEY:Germ, KEY:"State", KEY:"property", KEY:AgeGroupDALY, KEY:Gender);

	end;

	set STATED;
	*put "°°°°° _n_=" _n_;


	%do i=1 %to %sysfunc(countw(&propertyList)); *e.g. Severity Duration Probability;
		%let property = %scan(&propertyList, &i);
		*put "-- &i &property";

		*lookup;
		call missing(Value);
		rc=h1.find(KEY:Germ, KEY:State, KEY:"&property", KEY:AgeGroupDALY, KEY:Gender);

		
		*** DURATION ------------------;
		if index(LOWCASE("&property"),'duration') ge 1 then do;
                        IF lookup fail ;
			if rc ^= 0 then value = .;

			* IF FATAL then take GBD2010;
			if find(state,'fatal','i') ge 1 then do;
				* lookup with format, could also work with join;
				value = input(put(cats(Gender,'_',AgeGroupDALY),Gender_AG2GBD2010_.),10.5);
			end;

			* IF DURATION=-1 then take the SPMA Life Expectancy ;
			else if value=-1 then do;
				value = input(put(cats(Gender,'_',AgeGroupSpma),Gender_AG2SPMA2013_.),10.5);
			end;

		end;


		*** SEVERITY ------------------;
		else if find("&property",'severity','i') ge 1 then do;
			*IF lookup fail ;
			if rc ^= 0 then value = .;

			* IF FATAL;
			if index(LOWCASE(state),'fatal') ge 1 then do;
				value=1;
			end;
		end;


		*** PROPORTION ------------------;
		else if find("&property",'proportion','i') ge 1 then do;
			*value=555;
			if rc ^= 0 then value = 1;
			*IF no value;
			else if value = . then value = 1;
		end;

		&property = value;

	%end;


	germState = cats(germ,'-',state);
	drop rc;
	
	where (not (Gender='M' AND substr(State,1,2) eq 'w_'))
		AND (not (Gender='F' AND substr(State,1,2) eq 'm_'));
run;
%mend dummy_jhkjlj;

%dummy_jhkjlj;

sasfile work.STATED close;

 

Here some intel about the datasets

libname memname filesize filesizeMB nobs nvar compress pcompress bufsize obslen npage maxvar
WORK STATED 360202240 343.52 7782000 5 NO 0 16384 46 21984 12
WORK MODELDATA_COMPLETED 2342912 2.23 26192 12 NO 0 8192 88 285 26
Super User
Posts: 7,762

Re: How to further improve performance of this 1 to 1 hash table join

Run the whole thing with options fullstimer to see if you're CPU- or I/O-bound.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 127

Re: How to further improve performance of this 1 to 1 hash table join

[ Edited ]
Posted in reply to KurtBremser
NOTE: There were 26192 observations read from the data set WORK.MODELDATA_COMPLETED.
NOTE: There were 7249000 observations read from the data set WORK.STATED.
      WHERE ((Gender not = 'M') or (SUBSTR(State, 1, 2) not = 'w_')) and ((Gender not = 'F') or 
      (SUBSTR(State, 1, 2) not = 'm_'));
NOTE: The data set WORK.DALY1 has 7249000 observations and 9 variables.
NOTE: DATA statement used (Total process time):
      real time           28.29 seconds
      user cpu time       22.16 seconds
      system cpu time     6.10 seconds
      memory              4911.75k
      OS Memory           497540.00k
      Timestamp           12/13/2016 01:30:09 PM
Super User
Posts: 7,762

Re: How to further improve performance of this 1 to 1 hash table join

You're completely CPU-bound, so I/O related settings are no longer an issue.

 

I see that you are comparing macro variables with literals in your if conditions, also using data step functions there.

This

if index(LOWCASE("&property"),'duration') ge 1 then do;

causes unnecessary CPU-use during data step execution, and is totally independent from any values in the data. Therefore replace it with a macro condition, so the code is conditionally created (as opposed to conditionally executed) depending on where in the %do loop you are.

The same goes for

else if find("&property",'severity','i') ge 1 then do;

and

else if find("&property",'proportion','i') ge 1 then do;

One of the tools of performance tuning is the avoidance of unnecessarily repeating condition evaluations, as the evaluation of a condition is among the most CPU-costly operations on the microcode level.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 127

Re: How to further improve performance of this 1 to 1 hash table join

Posted in reply to KurtBremser

Removed the %do loop and the if statements which were probably not necessary.

 

Thanks! I gained 4s more with this.

(I need to verify it does exactly the same but I believe it does.)

 


%macro dummy_jhkjlj() / store;
data DALY1(drop=value category1 category2);

	length
		Germ $10
		AgeGroupDALY $10
		AgeGroupSPMA $10
		Gender $1
		Category1 $30
		Category2 $30
		value 8 
		;

	*** make link to hash table ;
	if _n_=1 then do;

		***modelvalues ----------------;
		declare hash h1(dataset:'modelData_completed');
		h1.definekey ('Germ', 'Category1', 'Category2', 'AgeGroupDALY', 'Gender') ;
		h1.definedata('Value');
		h1.definedone();
		call missing(Germ, Value, Category1, Category2);
		* e.g. rc=h1.find(KEY:Germ, KEY:"State", KEY:"property", KEY:AgeGroupDALY, KEY:Gender);

	end;

	set STATED;

		
	*** DURATION ------------------;
		call missing(Value);
		rc=h1.find(KEY:Germ, KEY:State, KEY:"Duration", KEY:AgeGroupDALY, KEY:Gender);

			* IF lookup fail (mostly infection_1 state);
			if rc ^= 0 then value = .;

			* IF FATAL then take GBD2010;
			if find(state,'fatal','i') ge 1 then do;
				* lookup with format, could also work with join;
				value = input(put(cats(Gender,'_',AgeGroupDALY),Gender_AG2GBD2010_.),10.5);
			end;

			* IF DURATION=-1 then take the SPMA Life Expectancy ;
			else if value=-1 then do;
				value = input(put(cats(Gender,'_',AgeGroupSpma),Gender_AG2SPMA2013_.),10.5);
			end;

		Duration = value;



	*** SEVERITY ------------------;
		call missing(Value);
		rc=h1.find(KEY:Germ, KEY:State, KEY:"Severity", KEY:AgeGroupDALY, KEY:Gender);

			* IF lookup fail (mostly when duration=0 , state=infection_1 or asymptimatic_1));
			if rc ^= 0 then value = .;

			* IF FATAL;
			if index(LOWCASE(state),'fatal') ge 1 then do;
				value=1;
			end;

		Severity = value;


	*** PROPORTION ------------------;
		call missing(Value);
		rc=h1.find(KEY:Germ, KEY:State, KEY:"Proportion", KEY:AgeGroupDALY, KEY:Gender);
			
			* IF lookup fail;
			if rc ^= 0 then value = 1;
			* IF no value;
			else if value = . then value = 1;

		Proportion = value;


		* IF RANDPERT;
		* IF RANDUNIFORM;

	germState = cats(germ,'-',state);

	drop rc;
	
	where (not (Gender='M' AND substr(State,1,2) eq 'w_'))
		AND (not (Gender='F' AND substr(State,1,2) eq 'm_'));
run;
%mend dummy_jhkjlj;

 

Frequent Contributor
Posts: 127

Re: How to further improve performance of this 1 to 1 hash table join

Posted in reply to KurtBremser
NOTE: There were 26192 observations read from the data set WORK.MODELDATA_COMPLETED.
NOTE: There were 7249000 observations read from the data set WORK.STATED.
      WHERE ((Gender not = 'M') or (not (State=:'w_'))) and ((Gender not = 'F') or (not 
      (State=:'m_')));
NOTE: The data set WORK.DALY1 has 7249000 observations and 9 variables.
NOTE: DATA statement used (Total process time):
      real time           24.24 seconds
      user cpu time       17.89 seconds
      system cpu time     6.33 seconds
      memory              4730.79k
      OS Memory           497504.00k
      Timestamp           12/13/2016 02:55:16 PM
Super User
Posts: 7,762

Re: How to further improve performance of this 1 to 1 hash table join

I guess that there's not much more you can do.

 

I also think that running it against the whole data should end up with considerably less than 10 minutes runtime, and that is tolerable IMO.

 

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 5,498

Re: How to further improve performance of this 1 to 1 hash table join

It's a small piece, but there are two SUBSTR functions you can get rid of.  In the code:

 

substr(state, 1, 2) = 'w_'

 

A faster version:

 

state =: 'w_'

 

Same applies to the 'm_' comparison as well.

Valued Guide
Posts: 505

Re: How to further improve performance of this 1 to 1 hash table join

Posted in reply to Astounding

A HASH cannot be multi-threaded or multi-tasked. I am having a little difficulty understanding the details of your solution. But it looks like a 70,000,000 nine variable dataset is small and you are compute bound. Compute bound solutions do very well on an inexpensive workstation like my Dell T7400($600) with 64gb ram. I would look into pegging 8 cores at 100%, the elapsed time should be 1/8th a single hasj solution.

 

Keep the 8 peices in a view, have 8 pieces is usually better than one datasets

 

I find a SQL HASH witha huge bufsize(loads HASH table in memory) is sometimes faster than a hand coded HASH.

 

A 70 million dataset with 9 8 byte variables is only 5gb so theorectically I could run 10 tasks simultaneously on my $600 work station.

 

I tend to use a server when a problem needs a single table perm or temp table  > 1tb(Big Data) or in your case more than 16 cores(big computation).

Frequent Contributor
Posts: 127

Re: How to further improve performance of this 1 to 1 hash table join

Posted in reply to rogerjdeangelis
The BI server I'm working on is not very powerful, that's simply where SAS is installed on our systems, we don't use local SAS, even though my laptop is probably more powerful than this server. For example, it uses only 1 of the 4 cores when this is running. (Probably a licence limitation.) That limits the possibilities.
Super User
Posts: 7,762

Re: How to further improve performance of this 1 to 1 hash table join

Not all step types in SAS lend themselves to multithreading.

Note that http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n0czb9vxe72693n1lom0... makes no mention of the data step at all; only some explicitly named procedures and I/O engines are multithreaded.

 

So you might consider to split your large dataset so you can run several data steps in parallel.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Trusted Advisor
Posts: 1,018

Re: How to further improve performance of this 1 to 1 hash table join

Many of your sections have;

 

  1. Unnecessary reset of value to missing when the find has failed.  It's unneccessry because you set value to missing just before the find method, and the find method will only change the contents of VALUE when it is successful.

  2. You have a lot of single statements embeded in IF X ... THEN DO groups.  I would simplify and just make those statements into the THEN clause of an IF statement.

For example, you could change this:

 

*** DURATION ------------------;
		call missing(Value);
		rc=h1.find(KEY:Germ, KEY:State, KEY:"Duration", KEY:AgeGroupDALY, KEY:Gender);

			* IF lookup fail (mostly infection_1 state);
			if rc ^= 0 then value = .;

			* IF FATAL then take GBD2010;
			if find(state,'fatal','i') ge 1 then do;
				* lookup with format, could also work with join;
				value = input(put(cats(Gender,'_',AgeGroupDALY),Gender_AG2GBD2010_.),10.5);
			end;

			* IF DURATION=-1 then take the SPMA Life Expectancy ;
			else if value=-1 then do;
				value = input(put(cats(Gender,'_',AgeGroupSpma),Gender_AG2SPMA2013_.),10.5);
			end;

		Duration = value;

 

to this:

 


*** DURATION ------------------;
		call missing(Value);
		rc=h1.find(KEY:Germ, KEY:State, KEY:"Duration", KEY:AgeGroupDALY, KEY:Gender);

			* IF FATAL then take GBD2010;
			if find(state,'fatal','i') ge 1 then value = input(put(cats(Gender,'_',AgeGroupDALY),Gender_AG2GBD2010_.),10.5);
			else if value=-1 then value = input(put(cats(Gender,'_',AgeGroupSpma),Gender_AG2SPMA2013_.),10.5);

		Duration = value;

 

 

 

 I don't know if it will save much, but it can't hurt:  And the code is more compact.

 

MK

Super User
Posts: 10,020

Re: How to further improve performance of this 1 to 1 hash table join

	declare hash h1(dataset:'modelData_completed');

---->

	declare hash h1(dataset:'modelData_completed', hashexp:20);
Frequent Contributor
Posts: 127

Re: How to further improve performance of this 1 to 1 hash table join

Thanks to all for your replys, they help a lot and every small detail is interesting !

 

I have another question : 

This is only 1 step of a huge EGP (enterprise guide project) composed of ~60 steps (automatic EG tasks + manually programmed steps)

 

the whole program creates 46 datasets in the WORK and maybe 4 of them will need to be multiplied 10k times (I do my tests here with 1000x)

 

Do I have any interest in dropping temporary datasets here ?

It seems I'm CPU bound for this particular step, but I notice that when I'm multiplying the records by 10k for a few datasets, the rest of the process is much much slower. Is droppping big tables a solution or is there something else to do?

 

 

Super User
Posts: 7,762

Re: How to further improve performance of this 1 to 1 hash table join

Dropping tables is necessary when you run out of space or filesystems run into > 90% usage, as some filesystem architectures will slow down when the allocation tables and filespace are heavily fragmented.

Once you detect I/O problems (real time significantly higher than user+system CPU), consider speeding up your WORK (SSD!), physically separating UTILLOC from WORK, and setting up physically separate libraries so that at a given moment one group of disks mainly reads, and another writes.

SSDs alleviate most of these problems, as they don't have latencies caused by R/W head placement and waiting for a disk rotation to finish.

 

Adding RAM so that the system can do more caching for you will also always be helpful, but the system in question should be able to efficiently use it. Unless you are already there, consider migrating to a UNIX system for your server. Windows is crap, IMHO.

 

Our DWH server (supporting ~ 150 EG users and several thousand per web interface) is a venerable IBM p520 with 2(!) physical POWER5 cores and 32G of RAM. AIX lives on a different planet than Windows, design-wise.

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Ask a Question
Discussion stats
  • 14 replies
  • 429 views
  • 4 likes
  • 6 in conversation