BookmarkSubscribeRSS Feed
AmirSari
Quartz | Level 8

Hi everyone,

 

Is there any way to optimize the following proc sql? The main issue is that the mylib.main is a huge dataset with more than 25 million observations and it takes the code forever to run. The mylib.tbillrate has about 94,000 observations. 

proc sql;
create table mylib.want as
  select distinct a.*,b.MaturityDate, b.date, b.treasuryyield, abs(intck("month",a.maturity,b.MaturityDate)) as diff
  from mylib.main a, mylib.tbillrate b
  where abs(intck('day',a.offering_date,b.date))<=29
  group by a.permco, a.offering_date
  having abs(a.maturity-b.MaturityDate) eq min(abs(a.maturity-b.MaturityDate))
;
quit;

Thank you!

7 REPLIES 7
Kurt_Bremser
Super User

Is the DISTINCT necessary?

And you don't need the INTCK in the where. As said elsewhere, a simple subtraction will suffice there.

 

The best optimization of SQL code is usually the replacement with PROC and DATA steps. To give hints for that, we need usable example data in DATA steps with DATALINES, and the expected outcome.

AmirSari
Quartz | Level 8

I removed Distinct and intck but did not make a difference.

Patrick
Opal | Level 21

The first thing the SQL will execute is the implicit join condition (the where clause). With this where clause you've got a many:many relationship and though the resulting table will become HUGE given that you already start with 25M rows in the one table.

 

To demonstrate what I'm talking about below your code with only ten rows per table - which due to the many:many relationship already results in a 100 rows. ....and it's this table that SAS then needs to write to temporary disk space and to sort for the Group By ...and it then also needs to sort again for the DISTINCT. No wonder this job runs forever.

libname mylib "%sysfunc(pathname(work))";

data mylib.main;
  format offering_date date9.;
  do offering_date='01jan2023'd to '10jan2023'd;
    output;
  end;
run; 

data mylib.tbillrate;
  format date date9.;
  do date='01jan2023'd to '10jan2023'd;
    output;
  end;
run; 

proc sql _method;
  select count(*) as n_rows
  from mylib.main a, mylib.tbillrate b
  where abs(intck('day',a.offering_date,b.date))<=29
;
quit;

Patrick_0-1675607250519.png

First thing you need to change is your join condition. Is the join over these two dates really the only thing you can do or are there other columns you could add that will reduce the intermediary result set? 

 

What _method will show you in the SAS log: 

Patrick_1-1675607730948.png

Patrick_0-1675607664467.png

https://support.sas.com/resources/papers/proceedings13/200-2013.pdf 

AmirSari
Quartz | Level 8
That's exactly what is happening, but the condition is necessary and I can't change it. I was able to significantly reduce the sample size by grouping variables and now I can run the code in a few minutes.
Thank you!
Patrick
Opal | Level 21

@AmirSari wrote:
That's exactly what is happening, but the condition is necessary and I can't change it. I was able to significantly reduce the sample size by grouping variables and now I can run the code in a few minutes.
Thank you!

Then it sounds like you could create some sort of group id and add it to your join condition.  ....or just use all the variables that define the group in the join condition.

Sajid01
Meteorite | Level 14

Are your datasets particularly the one with 25 million rows local to the machine i.e. sas datasets  on the SAS server?
If not then use SQL pass through to execute this on the database server rather than doing they way you are doing. Processing should take place where the data resides.
In your method the entire data will have to come to the SAS server from the external database and then execute on the SAS server. The network speed and controls play a role and the query may take a long time to run.
If the same query is executed on the database server, then what you will be getting to SAS server is the result and the process will be faster.
Another advantage of running  queries on the database server is the availability of Common table expressions in most RDBMS systems which allow data to be filtered before join.

LinusH
Tourmaline | Level 20

A part from already proposed suggestions, a few ideas:

- How much memory does job consume? (related to available memory to the SAS session)? (options fullstimer;). If you use it all, try to expand MEMSIZE and SORTSIZE to make use of all available physical memory of your host.

- Not 100% sure of your logic, but is it possible to "blow" up your lookup-table time 30? For each date you have a 30 day span that is a potential match, right? So create a set of roes that represent each possible match date for your date. Then you'll get a equi join, which you can optimize further, perhaps with indexes or try to get a hash join (BUFFERSIZE/UBUFSIZE) if possible/necessary.

Data never sleeps

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1201 views
  • 2 likes
  • 5 in conversation