Solved: improve proc sql join efficiency

pensarchem · Posted 08-02-2020 03:48 AM

Hi everyone,

Previous, when using SAS, we can have pass through to run the query in Oracle, this process is quite smooth.

However, due to a special nature of safe environment, I have to run all the codes in SAS only. It runs incredibly slow now, even for a small set for data b.

Here is the code, and I will put some labels.

I did some searches, but frankly not sure which will be the best approach like index( in this case will > than 34% unique labels) /hash?).

I tried:

1) data step to extract all related data, then join; take too long and too much spaces;

2) use subquery, not by join. use filter as ---- label in (select distinct label in att.list_low5_&c_year), run quite slow as well.

Any recommendation from advanced users? thank you very much.

proc sql;
create table att.total_list_&p_year as
select distinct a.label, num, datepart(dt) as dt
from hynp.flat a (********************* data in Oracle, not sortable, 80 millions records each year)
inner join att.list_low5_&c_year b (******************* records from thousands to up to 4millions, this macro run ten circles)
on a.label = b.label
where
dt >= &s_date and
dt <='31Mar2019'd
and prov = ..
and tp = ..
and pec = ..
and pnum ^= ..
and cd < ..
and vcd not in (.....)
and vloc not in (....)
and tm not between ... and ...
and tm not in (...)

and num not in (...)

order by a.label,dt desc, num;
quit;

Kurt_Bremser · Posted 08-02-2020 04:09 PM

Then you should use a hash for the inner join:

data att.total_list_&p_year;
set hynp.flat;
where /* your where condition */;
if _n_ = 1
then do;
  declare hash l (dataset:"att.list_low5_&c_year (keep=label)");
  l.definekey("label");
  l.definedone();
end;
if l.check() = 0;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 08-02-2020 05:19 AM

From which table originate all the variables without an alias?

Does &s_date change frequently?

Do you have any kind of dynamics in the other parts of the WHERE?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

pensarchem · Posted 08-02-2020 11:47 AM

1, hynp.flat is the source table; so label, dt, num are all from the source table.

att.list_low5_&c_year b is a label cohort generated from hynp.flat;

2, &s_date does not change and is a certain date for each macro;

3, No, these filters are set, there is no dynamic in WHERE;

Kurt_Bremser · Posted 08-02-2020 04:09 PM

Then you should use a hash for the inner join:

data att.total_list_&p_year;
set hynp.flat;
where /* your where condition */;
if _n_ = 1
then do;
  declare hash l (dataset:"att.list_low5_&c_year (keep=label)");
  l.definekey("label");
  l.definedone();
end;
if l.check() = 0;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

pensarchem · Posted 08-03-2020 01:58 AM

thanks. In the process of testing code now, I am dealing an error message. Will get back to you.

Kurt_Bremser · Posted 08-03-2020 03:40 AM

@pensarchem wrote:

thanks. In the process of testing code now, I am dealing an error message. Will get back to you.

Two thoughts:

The argument about the datepart() function is valid. Using a SAS-only function will force SAS to unload data before applying the function. OTOH, using it in a data step you simply unload the datetime value as is, which should not pose a performance problem. Just make sure that you do not use SAS-specific elements in the WHERE condition, as SAS will send that to the DBMS for execution there, if it can.
Whenever you use a connection to a DBMS, SAS will implicitly run SQL code there. Securitywise, there is no difference to using explicit pass-through: code will be sent from SAS to the DBMS. If the "safe environment" is not hardened against any mishap induced from outside (through proper permissions, performance limits, etc), then it is not safe at all. If there are any doubts, you can only have code (that was checked by the DBA) run in the DBMS to unload to a flat file, and have that sent to SAS. This is, BTW, the method we use.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 08-02-2020 07:09 PM

However, due to a special nature of safe environment, I have to run all the codes in SAS only

Can you explain this a bit further? You're already connecting to Oracle so why exactly can't you push some of the processing to Oracle?

If you have to pull all the data first into SAS before reducing volumes then that's already going to impact quite a bit on performance even if your 2nd table would only contain a few rows.

pensarchem · Posted 08-03-2020 02:02 AM

Yes, currently I run the code on a lower security level, so I can still connect to oracle.

However, we do have a special safe environment and all the data were imported manually from various resources. The data will be imported through .gz files into sas and run locally only under SAS..

So I am testing my code now and will move to safe environment after everything runs smoothly..I hope this clarifies.

PGStats · Posted 08-02-2020 09:28 PM

There must be something preventing SAS/SQL from asking Oracle to preprocess your query. I suspect it is the use of function datepart() .

Try adding SQL_FUNCTIONS=ALL to your libname definition of hynp. This should allow the datetime function to be passed to Oracle for processing.

If that doesn't work, write datepart(a.dt) as dt in the select clause and use datetime constants in your where clause (e.g. .... a.dt <='31Mar2019:00:00:00'dt).

PG

pensarchem · Posted 08-03-2020 02:04 AM

thanks for your input, I answered this above. We normally do run the code in Oracle and the process is quite quick..

Patrick · Posted 08-03-2020 02:36 AM

And WHY exactly can't you run it in Oracle this time? If you have first to transfer all the data into SAS before subsetting then no wonder it takes much longer. At least try and use the proposed hash table approach so the big data set doesn't need sorting as well on the SAS side before reducing the data volumes.

ChrisNZ · Posted 08-03-2020 01:15 AM

Please add this line

options sastrace=',,,db' sastraceloc=saslog nostsuffix;

and show us the full log.

High-Performance SAS Coding - Third Edition

improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

Re: improve proc sql join efficiency

SAS Innovate 2025: Call for Content

Classroom Training Available!