I am attempting to merge two datasets. The first dataset, named "Maturity_Gap_Master_2," contains 5 million records, while the second dataset, named "want," contains 500 records. The merge operation is based on a condition specified below:
"""
proc sql;
create table output as
select *
from Maturity_Gap_Master_2 a
left join
want b
on a.ACC_NEXT_FIXING_REM_TENOR_IRRBB >= b.Range_Start
and a.ACC_NEXT_FIXING_REM_TENOR_IRRBB <= b.Range_end;
"""
However, the above query takes at least 10 minutes to execute. To optimize the process, I attempted using SAS hashing:
"""
data output;
if _n_=1 then do;
if 0 then set want;
declare hash h (dataset:'want');
h.definekey('Range_start','Range_End');
h.definedata(all:'Y');
h.definedone();
declare hiter hi ('h');
end;
set MATURITY_GAP_MASTER_2;
rc=hi.last();
start_loop = 1;
do while(start_loop);
if ACC_NEXT_FIXING_REM_TENOR_IRRBB >= Range_Start and ACC_NEXT_FIXING_REM_TENOR_IRRBB <= Range_end and rc=0 then start_loop = 0;
else rc=hi.prev();
end;
run;
"""
Despite this optimization attempt, the execution time has not decreased. Could anyone provide assistance in optimizing the query further?
I would recommend expanding the 500 records data set into a slightly larger table, by explicitly specifying the values of the ranges.
i.e. Instead of just having Range_Start & Range_End columns, Add a third column of the actual value, then try to do exact match from your 5 M large table to the new expanded lookup (want) table.
This way, it's a one time range expansion, rather than 5 million times, as you have been doing via your Left Join and Hash Join.
Hope this helps,
Ahmed
Here is an example code
data want;
length Range_Start Range_End 8;
Range_Start = '01Jan2023'd; Range_End = '31Jan2023'd; OUTPUT;
Range_Start = '01Feb2023'd; Range_End = '15Feb2023'd; OUTPUT;
Range_Start = '01Jun2023'd; Range_End = '30Jul2023'd; OUTPUT;
FORMAT Range_Start Range_End date9.;
RUN;
data want_exp(KEEP=date);
Set want;
do i=Range_Start to Range_End;
date = i;
output;
end;
FORMAT date date9.;
run;
Use the "want_exp" table in your joins (SQL/Hash) with your 5 Millions record using equi joins (=) rather than Between (>= & <=)
For all optimization efforts, you need to know what the bottleneck is.
Attach the following to your code:
options msglevel=i stimer;
For PROC SQL, you can get more information how the internal optimizer work by:
PROC SQL _method _tree;
Hello @Anshul2,
If it's not feasible to implement AhmedAl_Attar 's great suggestion of using an equijoin (see also Re: How to improve Efficiency for another example) because the number of potential ACC_NEXT_FIXING_REM_TENOR_IRRBB values in those ranges is too large, you could consider defining one or more formats.
Here's an example with non-overlapping ranges (1 through 10, >10 through 100, >100 through 1000) and only a few (2) "satellite" variables in dataset WANT:
/* Create sample data for demonstration */
data want;
input Range_Start Range_End var1 var2 $;
cards;
1 10 123 first
10 100 456 second
100 1000 789 third
;
/* Create a format assigning concatenated values to the ranges */
data wantfmt;
retain fmtname 'wantfmt';
set want(rename=(range_start=start range_end=end)) end=last;
length label $30;
label=catx('|',var1,var2);
output;
if last then do;
hlo='O';
label=' ';
output;
end;
run;
proc format cntlin=wantfmt;
run;
/* Perform the intended left join by applying the format */
data final(drop=_c);
set Maturity_Gap_Master_2;
length _c $30;
_c=put(ACC_NEXT_FIXING_REM_TENOR_IRRBB,wantfmt.);
var1=input(scan(_c,1,'|'),32.);
length var2 $8;
var2=scan(_c,2,'|');
run;
Questions:
@Anshul2 Is this a many:many join meaning that more than one row from your table want could meet the condition for a row from table Maturity_Gap_Master_2?
To give us some indication about the data volume can you please share the output of a proc contents over your result table output that your SQL creates?
proc contents data=work.output;
run;
What I'm most interested in is below highlighted info.
Is the number of observation the same or higher than what you've got in your source table Maturity_Gap_Master_2?
Hello @Anshul2
The SAS SQL optimizer selects the best option for join.
In your case I am sure it selects the hash join algorithm. You can verify running the Proc SQL with _method_ option.
Your log will show that it is using hash join I am sure it will.
I see you are using "select *". Review if you actually need all columns. You can focus on other approaches to review your query.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.