Hi everyone,
Is there any way to optimize the following proc sql? The main issue is that the mylib.main is a huge dataset with more than 25 million observations and it takes the code forever to run. The mylib.tbillrate has about 94,000 observations.
proc sql;
create table mylib.want as
select distinct a.*,b.MaturityDate, b.date, b.treasuryyield, abs(intck("month",a.maturity,b.MaturityDate)) as diff
from mylib.main a, mylib.tbillrate b
where abs(intck('day',a.offering_date,b.date))<=29
group by a.permco, a.offering_date
having abs(a.maturity-b.MaturityDate) eq min(abs(a.maturity-b.MaturityDate))
;
quit;
Thank you!
Is the DISTINCT necessary?
And you don't need the INTCK in the where. As said elsewhere, a simple subtraction will suffice there.
The best optimization of SQL code is usually the replacement with PROC and DATA steps. To give hints for that, we need usable example data in DATA steps with DATALINES, and the expected outcome.
I removed Distinct and intck but did not make a difference.
The first thing the SQL will execute is the implicit join condition (the where clause). With this where clause you've got a many:many relationship and though the resulting table will become HUGE given that you already start with 25M rows in the one table.
To demonstrate what I'm talking about below your code with only ten rows per table - which due to the many:many relationship already results in a 100 rows. ....and it's this table that SAS then needs to write to temporary disk space and to sort for the Group By ...and it then also needs to sort again for the DISTINCT. No wonder this job runs forever.
libname mylib "%sysfunc(pathname(work))";
data mylib.main;
format offering_date date9.;
do offering_date='01jan2023'd to '10jan2023'd;
output;
end;
run;
data mylib.tbillrate;
format date date9.;
do date='01jan2023'd to '10jan2023'd;
output;
end;
run;
proc sql _method;
select count(*) as n_rows
from mylib.main a, mylib.tbillrate b
where abs(intck('day',a.offering_date,b.date))<=29
;
quit;
First thing you need to change is your join condition. Is the join over these two dates really the only thing you can do or are there other columns you could add that will reduce the intermediary result set?
What _method will show you in the SAS log:
https://support.sas.com/resources/papers/proceedings13/200-2013.pdf
@AmirSari wrote:
That's exactly what is happening, but the condition is necessary and I can't change it. I was able to significantly reduce the sample size by grouping variables and now I can run the code in a few minutes.
Thank you!
Then it sounds like you could create some sort of group id and add it to your join condition. ....or just use all the variables that define the group in the join condition.
Are your datasets particularly the one with 25 million rows local to the machine i.e. sas datasets on the SAS server?
If not then use SQL pass through to execute this on the database server rather than doing they way you are doing. Processing should take place where the data resides.
In your method the entire data will have to come to the SAS server from the external database and then execute on the SAS server. The network speed and controls play a role and the query may take a long time to run.
If the same query is executed on the database server, then what you will be getting to SAS server is the result and the process will be faster.
Another advantage of running queries on the database server is the availability of Common table expressions in most RDBMS systems which allow data to be filtered before join.
A part from already proposed suggestions, a few ideas:
- How much memory does job consume? (related to available memory to the SAS session)? (options fullstimer;). If you use it all, try to expand MEMSIZE and SORTSIZE to make use of all available physical memory of your host.
- Not 100% sure of your logic, but is it possible to "blow" up your lookup-table time 30? For each date you have a 30 day span that is a potential match, right? So create a set of roes that represent each possible match date for your date. Then you'll get a equi join, which you can optimize further, perhaps with indexes or try to get a hash join (BUFFERSIZE/UBUFSIZE) if possible/necessary.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.