Hello experts,
Please, I want to optimise the sas sql query, for the hundred of variables and more than 3 000 000 observations.
when I do like below, but , it will take more than 10 minutes
proc sql;
create table toto as
select
v1 as w1,
v2,
case
...when
...
else
as v3,
/* many cases conditions of case are used */...
...
..,
vn
from ( select * from dat1 where y1='O' and f1 ne 'E') as M1 left join beta as M2
on M1.id=M2.id
left join omega as M3
on M2.tr=M3.tr;
quit;
I think using merge with dataset ? will be more faster ?
Assuming tables Beta and Omega don't have high volumes, a hash lookup could perform quite well.
data toto;
set dat1(where=(y1='O' and f1 ne 'E'));
if _n_=1 then
do;
...define hashes for Beta
end;
...hash lookups
... all your case statements reformulated as if..then..else or as Select ...When ..
run;
You don't give much details about those CASE expressions. First try to remove the subquery like this
....
from
dat1 as M1 left join
beta as M2 on M1.id=M2.id left join
omega as M3 on M2.tr=M3.tr
where M1.y1='O' and M1.f1 ne 'E';
and if column tr exists in table dat1, you should use on M1.tr=M3.tr instead of on M2.tr=M3.tr.
Question about the subquery removal.
Is there any benefit to using a data set where parameter (as per the program fragment below), as opposed to having the "where m1.y1='0' and m1.f1 NE 'E'" sql clause? In the DATA step world, this can often improve performance by outsourcing the filtering process to the data access engine. Perhaps proc sql knows to outsource when feasible.
....
from
dat1 (where=(y1='0'and f1 ^= 'E')) as M1 left join
beta as M2 on M1.id=M2.id left join
omega as M3 on M2.tr=M3.tr
;
Using the where clause as a data set option or as a SQL clause should make no difference.
It is a matter of preference, sticking to the standards if this matters, and legibility.
I like the option when possible as it makes the contribution of each data set more obvious.
Data steps sometimes do behave differently depending on whether a clause or an option is supplied, but they shouldn't, and this behaviour is a defect.
Depending on the CASE statements, a format could be helpful.
Sounds to me like your data modelling is a failure to start with:
" for the hundred of variables"
Have never seen a good working system with so many variables, your select will be huge. Doesn't matter how well you write code, if you input data is rubbish then the code will not work well.
@ll: Thank you all for your answer.@ll: Thank you all for your answer.
To pricise some points
> clause when in sas sql, the case, I use is like this:
case
when v1 > 12 and W1=‘O’ then 1
when V2 >= 0 and W2=‘F’ then 2
.....
when vn < 0 or Hn not in (‘1’,‘2’) then .
else 100
end as w_n
I use it with left join.
I want to have a real time cpu <1 minutes, now I have 10 minutes.
Thank you again
@ChrisNZ: Thank you for your answer.
as I said before it is about 10 minutes
Feel free to not supply the information requested and not get the help you want.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.