proc sql optimisation

speaky · Posted 03-28-2012 09:34 AM

Hi,

I'm trying to optimise my proc sql to be as efficient as it can be.

I'm matching a large dataset onto a performance file, but the rows selected with change each month depending on the outcome window.

Proc sql is processed in the following order (I think)

proc sql;

select ...5

from ...1

where ...2

group by ...3

having ...4

order by ...;6

quit;

For efficiency, would it be better to have a where statement in the from line? Would this get processed first?

Thanks,

C

art297 · Posted 03-28-2012 09:48 AM

I don't think it would make much, if any, difference. At least it hasn't, consistently, on any of my own tests.

Haikuo · Posted 03-28-2012 09:53 AM

Not sure I completely understand your question, but here is my 2 cents:

1. 'where' clause is processed even before 'select' clause in term of processing.

2. Sometimes you can't change the writting orders, and in this case, it does not matter anyway.

3. SQL is very efficient if you do it completely natively, say within Oracle by pass-through method.

4. if your datasets are SAS tables, you could explore other options, such as data step merge, or with index, or hash(), it can only be determined on case by case bases.

A good summary paper on 9 ways of joining datasets:

http://www.nesug.org/Proceedings/nesug09/po/po13.pdf

Regards,

Haikuo

speaky · Posted 03-28-2012 11:45 AM

Thanks for your replies! :O)

I'm doing a left join onto a performance table and restricting by outcome month, but the performance file will get very large over time.

My question was around the location of the where clause, at present it's in the where select at the bottom, but I wondered if anyone knew for efficiency if it was better to put this after the table reference in the from statement?

The code below worked using ODBC, but when I run it in enterprise guide it gives a syntax error for the and line - is this because there is more than one variable in the select statement?

At the moment, the datasets are very small, but when we switch portfolio, optimising the code below is key!

proc sql;

create table cj as

select A.*, B.*

from &mthy. a left join perform.&perf_file. b

on a.cs_id = b.cs_id

and (cs_id, arrears) IN

(select cs_id, MAX(arrears) as flag_3

from perform.&perf_file. b

where '01DEC2011'D <= B.outcome_month <= '28FEB2012'D

group by B.cs_id)

;QUIT;