Re: Optimization of proc sql to sas hashing

Anshul2 · Posted 02-27-2024 07:18 AM

I am attempting to merge two datasets. The first dataset, named "Maturity_Gap_Master_2," contains 5 million records, while the second dataset, named "want," contains 500 records. The merge operation is based on a condition specified below:

"""
proc sql;
create table output as
select *
from Maturity_Gap_Master_2 a
left join
want b
on a.ACC_NEXT_FIXING_REM_TENOR_IRRBB >= b.Range_Start
and a.ACC_NEXT_FIXING_REM_TENOR_IRRBB <= b.Range_end;
"""
However, the above query takes at least 10 minutes to execute. To optimize the process, I attempted using SAS hashing:
"""

data output;
if _n_=1 then do;
if 0 then set want;
declare hash h (dataset:'want');
h.definekey('Range_start','Range_End');
h.definedata(all:'Y');
h.definedone();
declare hiter hi ('h');
end;

set MATURITY_GAP_MASTER_2;

rc=hi.last();
start_loop = 1;
do while(start_loop);
if ACC_NEXT_FIXING_REM_TENOR_IRRBB >= Range_Start and ACC_NEXT_FIXING_REM_TENOR_IRRBB <= Range_end and rc=0 then start_loop = 0;
else rc=hi.prev();
end;
run;

"""

Despite this optimization attempt, the execution time has not decreased. Could anyone provide assistance in optimizing the query further?

AhmedAl_Attar · Posted 02-27-2024 07:44 AM

@Anshul2

I would recommend expanding the 500 records data set into a slightly larger table, by explicitly specifying the values of the ranges.

i.e. Instead of just having Range_Start & Range_End columns, Add a third column of the actual value, then try to do exact match from your 5 M large table to the new expanded lookup (want) table.

This way, it's a one time range expansion, rather than 5 million times, as you have been doing via your Left Join and Hash Join.

Hope this helps,

Ahmed

AhmedAl_Attar · Posted 02-27-2024 07:57 AM

@Anshul2

Here is an example code

data want;
	length Range_Start Range_End 8;
	Range_Start = '01Jan2023'd; Range_End = '31Jan2023'd; OUTPUT;
	Range_Start = '01Feb2023'd; Range_End = '15Feb2023'd; OUTPUT;
	Range_Start = '01Jun2023'd; Range_End = '30Jul2023'd; OUTPUT;
	FORMAT Range_Start Range_End date9.;
RUN;
data want_exp(KEEP=date);
	Set want;
	do i=Range_Start to Range_End;
		date = i;
		output;
	end;
	FORMAT date date9.;
run;

Use the "want_exp" table in your joins (SQL/Hash) with your 5 Millions record using equi joins (=) rather than Between (>= & <=)

LinusH · Posted 02-27-2024 07:46 AM

For all optimization efforts, you need to know what the bottleneck is.

Attach the following to your code:

options msglevel=i stimer;

For PROC SQL, you can get more information how the internal optimizer work by:

PROC SQL _method _tree;

Data never sleeps

FreelanceReinh · Posted 02-27-2024 08:42 AM

Hello @Anshul2,

If it's not feasible to implement AhmedAl_Attar 's great suggestion of using an equijoin (see also Re: How to improve Efficiency for another example) because the number of potential ACC_NEXT_FIXING_REM_TENOR_IRRBB values in those ranges is too large, you could consider defining one or more formats.

Here's an example with non-overlapping ranges (1 through 10, >10 through 100, >100 through 1000) and only a few (2) "satellite" variables in dataset WANT:

/* Create sample data for demonstration */

data want;
input Range_Start Range_End var1 var2 $;
cards;
1     10 123 first
10   100 456 second
100 1000 789 third
;

/* Create a format assigning concatenated values to the ranges */

data wantfmt;
retain fmtname 'wantfmt';
set want(rename=(range_start=start range_end=end)) end=last;
length label $30;
label=catx('|',var1,var2);
output;
if last then do;
  hlo='O';
  label=' ';
  output;
end;
run;

proc format cntlin=wantfmt;
run;

/* Perform the intended left join by applying the format */

data final(drop=_c);
set Maturity_Gap_Master_2;
length _c $30;
_c=put(ACC_NEXT_FIXING_REM_TENOR_IRRBB,wantfmt.);
var1=input(scan(_c,1,'|'),32.);
length var2 $8;
var2=scan(_c,2,'|');
run;

mkeintz · Posted 02-27-2024 11:54 PM

Questions:

Are the range_start/range_end time spans in dataset WANT mutually exclusive?
What is the granularity of ACC_NEXT_FIXING_REM_TENOR_IRRBB? I.e. are they date values? Time values? Something else?
If the answer to question 1 ("mutually exclusive") is yes, then are there gaps between the ranges?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 02-28-2024 03:34 AM

@Anshul2 Is this a many:many join meaning that more than one row from your table want could meet the condition for a row from table Maturity_Gap_Master_2?

To give us some indication about the data volume can you please share the output of a proc contents over your result table output that your SQL creates?

proc contents data=work.output;
run;

What I'm most interested in is below highlighted info.

Is the number of observation the same or higher than what you've got in your source table Maturity_Gap_Master_2?

Sajid01 · Posted 02-28-2024 10:18 AM

Hello @Anshul2
The SAS SQL optimizer selects the best option for join.
In your case I am sure it selects the hash join algorithm. You can verify running the Proc SQL with _method_ option.
Your log will show that it is using hash join I am sure it will.
I see you are using "select *". Review if you actually need all columns. You can focus on other approaches to review your query.

Optimization of proc sql to sas hashing