Re: how to optimize code in proc sql

AmirSari · Posted 02-05-2023 01:13 AM

Hi everyone,

Is there any way to optimize the following proc sql? The main issue is that the mylib.main is a huge dataset with more than 25 million observations and it takes the code forever to run. The mylib.tbillrate has about 94,000 observations.

proc sql;
create table mylib.want as
  select distinct a.*,b.MaturityDate, b.date, b.treasuryyield, abs(intck("month",a.maturity,b.MaturityDate)) as diff
  from mylib.main a, mylib.tbillrate b
  where abs(intck('day',a.offering_date,b.date))<=29
  group by a.permco, a.offering_date
  having abs(a.maturity-b.MaturityDate) eq min(abs(a.maturity-b.MaturityDate))
;
quit;

Thank you!

Kurt_Bremser · Posted 02-05-2023 01:51 AM

Is the DISTINCT necessary?

And you don't need the INTCK in the where. As said elsewhere, a simple subtraction will suffice there.

The best optimization of SQL code is usually the replacement with PROC and DATA steps. To give hints for that, we need usable example data in DATA steps with DATALINES, and the expected outcome.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

AmirSari · Posted 02-05-2023 01:29 PM

I removed Distinct and intck but did not make a difference.

Patrick · Posted 02-05-2023 09:30 AM

The first thing the SQL will execute is the implicit join condition (the where clause). With this where clause you've got a many:many relationship and though the resulting table will become HUGE given that you already start with 25M rows in the one table.

To demonstrate what I'm talking about below your code with only ten rows per table - which due to the many:many relationship already results in a 100 rows. ....and it's this table that SAS then needs to write to temporary disk space and to sort for the Group By ...and it then also needs to sort again for the DISTINCT. No wonder this job runs forever.

libname mylib "%sysfunc(pathname(work))";

data mylib.main;
  format offering_date date9.;
  do offering_date='01jan2023'd to '10jan2023'd;
    output;
  end;
run; 

data mylib.tbillrate;
  format date date9.;
  do date='01jan2023'd to '10jan2023'd;
    output;
  end;
run; 

proc sql _method;
  select count(*) as n_rows
  from mylib.main a, mylib.tbillrate b
  where abs(intck('day',a.offering_date,b.date))<=29
;
quit;

First thing you need to change is your join condition. Is the join over these two dates really the only thing you can do or are there other columns you could add that will reduce the intermediary result set?

What _method will show you in the SAS log:

https://support.sas.com/resources/papers/proceedings13/200-2013.pdf

AmirSari · Posted 02-05-2023 01:54 PM

That's exactly what is happening, but the condition is necessary and I can't change it. I was able to significantly reduce the sample size by grouping variables and now I can run the code in a few minutes.
Thank you!

Patrick · Posted 02-05-2023 07:25 PM

@AmirSari wrote:
That's exactly what is happening, but the condition is necessary and I can't change it. I was able to significantly reduce the sample size by grouping variables and now I can run the code in a few minutes.
Thank you!

Then it sounds like you could create some sort of group id and add it to your join condition. ....or just use all the variables that define the group in the join condition.

Sajid01 · Posted 02-05-2023 01:42 PM

Are your datasets particularly the one with 25 million rows local to the machine i.e. sas datasets on the SAS server?
If not then use SQL pass through to execute this on the database server rather than doing they way you are doing. Processing should take place where the data resides.
In your method the entire data will have to come to the SAS server from the external database and then execute on the SAS server. The network speed and controls play a role and the query may take a long time to run.
If the same query is executed on the database server, then what you will be getting to SAS server is the result and the process will be faster.
Another advantage of running queries on the database server is the availability of Common table expressions in most RDBMS systems which allow data to be filtered before join.

LinusH · Posted 02-06-2023 04:19 AM

A part from already proposed suggestions, a few ideas:

- How much memory does job consume? (related to available memory to the SAS session)? (options fullstimer;). If you use it all, try to expand MEMSIZE and SORTSIZE to make use of all available physical memory of your host.

- Not 100% sure of your logic, but is it possible to "blow" up your lookup-table time 30? For each date you have a 30 day span that is a potential match, right? So create a set of roes that represent each possible match date for your date. Then you'll get a equi join, which you can optimize further, perhaps with indexes or try to get a hash join (BUFFERSIZE/UBUFSIZE) if possible/necessary.

Data never sleeps

SAS Innovate 2025: Call for Content

Classroom Training Available!