I am attempting to merge two datasets. The first dataset, named "Maturity_Gap_Master_2," contains 5 million records, while the second dataset, named "want," contains 500 records. The merge operation is based on a condition specified below:
"""
proc sql;
create table output as
select *
from Maturity_Gap_Master_2 a
left join
want b
on a.ACC_NEXT_FIXING_REM_TENOR_IRRBB >= b.Range_Start
and a.ACC_NEXT_FIXING_REM_TENOR_IRRBB <= b.Range_end;
"""
However, the above query takes at least 10 minutes to execute. To optimize the process, I attempted using SAS hashing:
"""
data output;
if _n_=1 then do;
if 0 then set want;
declare hash h (dataset:'want');
h.definekey('Range_start','Range_End');
h.definedata(all:'Y');
h.definedone();
declare hiter hi ('h');
end;
set MATURITY_GAP_MASTER_2;
rc=hi.last();
start_loop = 1;
do while(start_loop);
if ACC_NEXT_FIXING_REM_TENOR_IRRBB >= Range_Start and ACC_NEXT_FIXING_REM_TENOR_IRRBB <= Range_end and rc=0 then start_loop = 0;
else rc=hi.prev();
end;
run;
"""
Despite this optimization attempt, the execution time has not decreased. Could anyone provide assistance in optimizing the query further?
I would recommend expanding the 500 records data set into a slightly larger table, by explicitly specifying the values of the ranges.
i.e. Instead of just having Range_Start & Range_End columns, Add a third column of the actual value, then try to do exact match from your 5 M large table to the new expanded lookup (want) table.
This way, it's a one time range expansion, rather than 5 million times, as you have been doing via your Left Join and Hash Join.
Hope this helps,
Ahmed
Here is an example code
data want;
length Range_Start Range_End 8;
Range_Start = '01Jan2023'd; Range_End = '31Jan2023'd; OUTPUT;
Range_Start = '01Feb2023'd; Range_End = '15Feb2023'd; OUTPUT;
Range_Start = '01Jun2023'd; Range_End = '30Jul2023'd; OUTPUT;
FORMAT Range_Start Range_End date9.;
RUN;
data want_exp(KEEP=date);
Set want;
do i=Range_Start to Range_End;
date = i;
output;
end;
FORMAT date date9.;
run;
Use the "want_exp" table in your joins (SQL/Hash) with your 5 Millions record using equi joins (=) rather than Between (>= & <=)
For all optimization efforts, you need to know what the bottleneck is.
Attach the following to your code:
options msglevel=i stimer;
For PROC SQL, you can get more information how the internal optimizer work by:
PROC SQL _method _tree;
Hello @Anshul2,
If it's not feasible to implement AhmedAl_Attar 's great suggestion of using an equijoin (see also Re: How to improve Efficiency for another example) because the number of potential ACC_NEXT_FIXING_REM_TENOR_IRRBB values in those ranges is too large, you could consider defining one or more formats.
Here's an example with non-overlapping ranges (1 through 10, >10 through 100, >100 through 1000) and only a few (2) "satellite" variables in dataset WANT:
/* Create sample data for demonstration */
data want;
input Range_Start Range_End var1 var2 $;
cards;
1 10 123 first
10 100 456 second
100 1000 789 third
;
/* Create a format assigning concatenated values to the ranges */
data wantfmt;
retain fmtname 'wantfmt';
set want(rename=(range_start=start range_end=end)) end=last;
length label $30;
label=catx('|',var1,var2);
output;
if last then do;
hlo='O';
label=' ';
output;
end;
run;
proc format cntlin=wantfmt;
run;
/* Perform the intended left join by applying the format */
data final(drop=_c);
set Maturity_Gap_Master_2;
length _c $30;
_c=put(ACC_NEXT_FIXING_REM_TENOR_IRRBB,wantfmt.);
var1=input(scan(_c,1,'|'),32.);
length var2 $8;
var2=scan(_c,2,'|');
run;
Questions:
@Anshul2 Is this a many:many join meaning that more than one row from your table want could meet the condition for a row from table Maturity_Gap_Master_2?
To give us some indication about the data volume can you please share the output of a proc contents over your result table output that your SQL creates?
proc contents data=work.output;
run;
What I'm most interested in is below highlighted info.
Is the number of observation the same or higher than what you've got in your source table Maturity_Gap_Master_2?
Hello @Anshul2
The SAS SQL optimizer selects the best option for join.
In your case I am sure it selects the hash join algorithm. You can verify running the Proc SQL with _method_ option.
Your log will show that it is using hash join I am sure it will.
I see you are using "select *". Review if you actually need all columns. You can focus on other approaches to review your query.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.