BookmarkSubscribeRSS Feed
Anshul2
SAS Employee

I am attempting to merge two datasets. The first dataset, named "Maturity_Gap_Master_2," contains 5 million records, while the second dataset, named "want," contains 500 records. The merge operation is based on a condition specified below:

"""
proc sql;
create table output as
select *
from Maturity_Gap_Master_2 a
left join
want b
on a.ACC_NEXT_FIXING_REM_TENOR_IRRBB >= b.Range_Start
and a.ACC_NEXT_FIXING_REM_TENOR_IRRBB <= b.Range_end;
"""
However, the above query takes at least 10 minutes to execute. To optimize the process, I attempted using SAS hashing:
"""

data output;
if _n_=1 then do;
if 0 then set want;
declare hash h (dataset:'want');
h.definekey('Range_start','Range_End');
h.definedata(all:'Y');
h.definedone();
declare hiter hi ('h');
end;

set MATURITY_GAP_MASTER_2;

rc=hi.last();
start_loop = 1;
do while(start_loop);
if ACC_NEXT_FIXING_REM_TENOR_IRRBB >= Range_Start and ACC_NEXT_FIXING_REM_TENOR_IRRBB <= Range_end and rc=0 then start_loop = 0;
else rc=hi.prev();
end;
run;

"""

 

Despite this optimization attempt, the execution time has not decreased. Could anyone provide assistance in optimizing the query further?

 

7 REPLIES 7
AhmedAl_Attar
Rhodochrosite | Level 12

@Anshul2 

I would recommend expanding the 500 records data set into a slightly larger table, by explicitly specifying the values of the ranges.

i.e. Instead of just having Range_Start & Range_End columns, Add a third column of the actual value, then try to do exact match from your 5 M large table to the new expanded lookup (want) table.

 

This way, it's a one time range expansion, rather than 5 million times, as you have been doing via your Left Join and Hash Join.

Hope this helps,

Ahmed

AhmedAl_Attar
Rhodochrosite | Level 12

@Anshul2 

Here is an example code

data want;
	length Range_Start Range_End 8;
	Range_Start = '01Jan2023'd; Range_End = '31Jan2023'd; OUTPUT;
	Range_Start = '01Feb2023'd; Range_End = '15Feb2023'd; OUTPUT;
	Range_Start = '01Jun2023'd; Range_End = '30Jul2023'd; OUTPUT;
	FORMAT Range_Start Range_End date9.;
RUN;
data want_exp(KEEP=date);
	Set want;
	do i=Range_Start to Range_End;
		date = i;
		output;
	end;
	FORMAT date date9.;
run;

Use the "want_exp" table in your joins (SQL/Hash) with your 5 Millions record using equi joins (=) rather than Between (>= & <=)

LinusH
Tourmaline | Level 20

For all optimization efforts, you need to know what the bottleneck is.

Attach the following to your code:

options msglevel=i stimer;

For PROC SQL, you can get more information how the internal optimizer work by:

PROC SQL _method _tree;
Data never sleeps
FreelanceReinh
Jade | Level 19

Hello @Anshul2,

 

If it's not feasible to implement AhmedAl_Attar 's great suggestion of using an equijoin (see also Re: How to improve Efficiency for another example) because the number of potential ACC_NEXT_FIXING_REM_TENOR_IRRBB values in those ranges is too large, you could consider defining one or more formats.

 

Here's an example with non-overlapping ranges (1 through 10, >10 through 100, >100 through 1000) and only a few (2) "satellite" variables in dataset WANT:

/* Create sample data for demonstration */

data want;
input Range_Start Range_End var1 var2 $;
cards;
1     10 123 first
10   100 456 second
100 1000 789 third
;

/* Create a format assigning concatenated values to the ranges */

data wantfmt;
retain fmtname 'wantfmt';
set want(rename=(range_start=start range_end=end)) end=last;
length label $30;
label=catx('|',var1,var2);
output;
if last then do;
  hlo='O';
  label=' ';
  output;
end;
run;

proc format cntlin=wantfmt;
run;

/* Perform the intended left join by applying the format */

data final(drop=_c);
set Maturity_Gap_Master_2;
length _c $30;
_c=put(ACC_NEXT_FIXING_REM_TENOR_IRRBB,wantfmt.);
var1=input(scan(_c,1,'|'),32.);
length var2 $8;
var2=scan(_c,2,'|');
run;
mkeintz
PROC Star

Questions:

  1. Are the range_start/range_end time spans in dataset WANT mutually exclusive?

  2. What is the granularity of ACC_NEXT_FIXING_REM_TENOR_IRRBB?   I.e. are they date values?  Time values?   Something else?

  3. If the answer to question 1 ("mutually exclusive") is yes, then are there gaps between the ranges?
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Patrick
Opal | Level 21

@Anshul2 Is this a many:many join meaning that more than one row from your table want could meet the condition for a row from table Maturity_Gap_Master_2?

To give us some indication about the data volume can you please share the output of a proc contents over your result table output that your SQL creates?

proc contents data=work.output;
run;

What I'm most interested in is below highlighted info. 

Patrick_0-1709109180996.png

Is the number of observation the same or higher than what  you've got in your source table Maturity_Gap_Master_2?

 

Sajid01
Meteorite | Level 14

Hello @Anshul2 
The SAS SQL optimizer selects the best option for join.
In your case I am sure it selects the hash join algorithm. You can verify running the Proc SQL with _method_ option.
Your log will show that it is using hash join I am sure it will.
I see you are using "select *". Review if you actually need all columns. You can focus on other approaches to review your query.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 583 views
  • 2 likes
  • 7 in conversation