topic improve proc sql join efficiency in SAS Programming

improve proc sql join efficiency

pensarchem — Sun, 02 Aug 2020 07:51:00 GMT

Hi everyone,

Previous, when using SAS, we can have pass through to run the query in Oracle, this process is quite smooth.

However, due to a special nature of safe environment, I have to run all the codes in SAS only. It runs incredibly slow now, even for a small set for data b.

Here is the code, and I will put some labels.

I did some searches, but frankly not sure which will be the best approach like index( in this case will > than 34% unique labels) /hash?).

I tried:

1) data step to extract all related data, then join; take too long and too much spaces;

2) use subquery, not by join. use filter as ---- label in (select distinct label in att.list_low5_&c_year), run quite slow as well.

Any recommendation from advanced users? thank you very much.

proc sql;
create table att.total_list_&p_year as
select distinct a.label, num, datepart(dt) as dt
from hynp.flat a (********************* data in Oracle, not sortable, 80 millions records each year)
inner join att.list_low5_&c_year b (******************* records from thousands to up to 4millions, this macro run ten circles)
on a.label = b.label
where
dt >= &s_date and
dt <='31Mar2019'd
and prov = ..
and tp = ..
and pec = ..
and pnum ^= ..
and cd < ..
and vcd not in (.....)
and vloc not in (....)
and tm not between ... and ...
and tm not in (...)

and num not in (...)

order by a.label,dt desc, num;
quit;

Re: improve proc sql join efficiency

Kurt_Bremser — Sun, 02 Aug 2020 09:19:40 GMT

From which table originate all the variables without an alias?

Does &s_date change frequently?

Do you have any kind of dynamics in the other parts of the WHERE?

Re: improve proc sql join efficiency

pensarchem — Sun, 02 Aug 2020 15:47:48 GMT

1, hynp.flat is the source table; so label, dt, num are all from the source table.

att.list_low5_&c_year b is a label cohort generated from hynp.flat;

2, &s_date does not change and is a certain date for each macro;

3, No, these filters are set, there is no dynamic in WHERE;

Re: improve proc sql join efficiency

Kurt_Bremser — Sun, 02 Aug 2020 20:09:19 GMT

Then you should use a hash for the inner join:

data att.total_list_&p_year;
set hynp.flat;
where /* your where condition */;
if _n_ = 1
then do;
  declare hash l (dataset:"att.list_low5_&c_year (keep=label)");
  l.definekey("label");
  l.definedone();
end;
if l.check() = 0;
run;

Re: improve proc sql join efficiency

Patrick — Sun, 02 Aug 2020 23:09:18 GMT

However, due to a special nature of safe environment, I have to run all the codes in SAS only

Can you explain this a bit further? You're already connecting to Oracle so why exactly can't you push some of the processing to Oracle?

If you have to pull all the data first into SAS before reducing volumes then that's already going to impact quite a bit on performance even if your 2nd table would only contain a few rows.

Re: improve proc sql join efficiency

PGStats — Mon, 03 Aug 2020 01:28:38 GMT

There must be something preventing SAS/SQL from asking Oracle to preprocess your query. I suspect it is the use of function datepart() .

Try adding SQL_FUNCTIONS=ALL to your libname definition of hynp. This should allow the datetime function to be passed to Oracle for processing.

If that doesn't work, write datepart(a.dt) as dt in the select clause and use datetime constants in your where clause (e.g. .... a.dt <='31Mar2019:00:00:00'dt).

Re: improve proc sql join efficiency

ChrisNZ — Mon, 03 Aug 2020 05:15:21 GMT

Please add this line

options sastrace=',,,db' sastraceloc=saslog nostsuffix;

and show us the full log.

Re: improve proc sql join efficiency

pensarchem — Mon, 03 Aug 2020 05:58:47 GMT

thanks. In the process of testing code now, I am dealing an error message. Will get back to you.

Re: improve proc sql join efficiency

pensarchem — Mon, 03 Aug 2020 06:02:29 GMT

Yes, currently I run the code on a lower security level, so I can still connect to oracle.

However, we do have a special safe environment and all the data were imported manually from various resources. The data will be imported through .gz files into sas and run locally only under SAS..

So I am testing my code now and will move to safe environment after everything runs smoothly..I hope this clarifies.

Re: improve proc sql join efficiency

pensarchem — Mon, 03 Aug 2020 06:04:15 GMT

thanks for your input, I answered this above. We normally do run the code in Oracle and the process is quite quick..

Re: improve proc sql join efficiency

Patrick — Mon, 03 Aug 2020 06:37:08 GMT

And WHY exactly can't you run it in Oracle this time? If you have first to transfer all the data into SAS before subsetting then no wonder it takes much longer. At least try and use the proposed hash table approach so the big data set doesn't need sorting as well on the SAS side before reducing the data volumes.

Re: improve proc sql join efficiency

Kurt_Bremser — Mon, 03 Aug 2020 07:40:22 GMT

@pensarchem wrote:

thanks. In the process of testing code now, I am dealing an error message. Will get back to you.

Two thoughts:

The argument about the datepart() function is valid. Using a SAS-only function will force SAS to unload data before applying the function. OTOH, using it in a data step you simply unload the datetime value as is, which should not pose a performance problem. Just make sure that you do not use SAS-specific elements in the WHERE condition, as SAS will send that to the DBMS for execution there, if it can.
Whenever you use a connection to a DBMS, SAS will implicitly run SQL code there. Securitywise, there is no difference to using explicit pass-through: code will be sent from SAS to the DBMS. If the "safe environment" is not hardened against any mishap induced from outside (through proper permissions, performance limits, etc), then it is not safe at all. If there are any doubts, you can only have code (that was checked by the DBA) run in the DBMS to unload to a flat file, and have that sent to SAS. This is, BTW, the method we use.