BookmarkSubscribeRSS Feed

Following from thread

https://communities.sas.com/t5/SAS-Programming/Optimize-SAS-Code/m-p/532716#M145990

When SAS sees this code:

 

data CLASS(index=(AGE)); 
  set SASHELP.CLASS;
run;
  
proc sql; 
  select * from CLASS 
  where AGE=12 or (AGE=11 & NAME='Alice');
quit; 

it decides to not use indexes (because the OR operator is used).

It would of course be beneficial that only ages 11 and 12 be selected using an index, and then the NAME filter applied.

 

9 Comments
ChrisNZ
Tourmaline | Level 20

Note that a previous request for better utilisation of indexes was already made.

yabwon
Onyx | Level 15

Hi @ChrisNZ 

 

if you are interested in solving that problem in Base SAS you can check out my quick-tip talk at SAS GF2019 in April,

 

If you can't be there let me know on priv-message, I'll share a preview-article.

 

All the best

Bart

ScottBass
Rhodochrosite | Level 12

Perhaps related?

 

I once had to do a fact + 5 dim table join.  All the data sets were indexed on the appropriate keys.  All the tables were "large".

 

In pseudo code:

 

 

proc sql;
   create table final as
   select *
   from fact
   left join dim1
   on fact.key1=dim1.key1
   left join dim2
   on fact.key2=dim2.key2
   left join dim3
   on fact.key3=dim3.key3
   left join dim4
   on fact.key4=dim4.key4
   left join dim5
   on fact.key5=dim5.key5
   where fact.somevar between x and y
   ;
quit;

The query characteristics were the where clause greatly filtered the fact table - something like 50M rows to 5000 rows, and did so very quickly due to the index on somevar.

 

The dim tables ranged in size from 5M to 30M rows.  

 

So, after applying the where clause to the fact table, I needed to do 5000 lookups against the 5 indexed dim tables.

 

The data volumes precluded a hash object approach.

 

The 6 table join completed confused the SQL Server query optimizer, which resorted to sorting each table, resulting in horrid performance.

 

The workaround was to join the tables "one at a time".  Using this workaround, PROC SQL was happy to use the indexes, and the join was many orders of magnitude faster.

 

I tried IDXNAME= and IDXWHERE= options, made no difference.

 

Oh yeah, and all the tables were SPDE tables.

 

Apologies if this isn't exactly related to the original suggestion, although it certainly fits under the suggestion title

 

ChrisNZ
Tourmaline | Level 20

It's hard to replicate without data, but one would expect proc sql to use indexes regardless of whether there are two or more tables. Have you tried using proc sql's _method and magic options?

ScottBass
Rhodochrosite | Level 12

_method yes, magic no.  I tried everything I knew to do.  It might have had something to do with SPDE tables?  But as soon as I added the third table, SQL ignored the indexes and started sorting.

ChrisNZ
Tourmaline | Level 20

MAGIC does not have a suggestion value for using an index. 😞

The behaviour has nothing to do with SPDE it seems, more to do with the left join.

An inner join uses one index while a left join sorts.

This happens even with 2 tables in this test.

Weird.

 

 

libname W spde "%sysfunc(pathname(WORK))";

data W.T1( rename=(I=A J=W) index=(A))

     W.T2( rename=(I=B J=X) index=(B) )

     W.T3( rename=(I=C J=Y) index=(C) )

     W.T4( rename=(I=D J=Z) index=(D) )

       T1( rename=(I=A J=W) index=(A) )

       T2( rename=(I=B J=X) index=(B) )

       T3( rename=(I=C J=Y) index=(C) )

       T4( rename=(I=D J=Z) index=(D) );

  do I=1 to 1e6;

    J=I;

    output;

  end;

run;                        

options msglevel=I;

                 

%* BASE Engine - inner join - 2 tables;

proc sql _method ;

  create table T as select *

  from T1 inner join T2 on T1.A=T2.B

  where 100000 < T1.W < 105000;

quit;

 

%* BASE Engine - left join - 2 tables;

proc sql _method ;

  create table T as select *

  from T1 left join T2 on T1.A=T2.B

  where 100000 < T1.W < 105000;

quit;

 

 

%* SPDE Engine - inner join - 2 tables;

proc sql _method ;

  create table W.T as select *

  from T1 inner join W.T2 on T1.A=T2.B

  where 100000 < T1.W < 105000;

quit;

 

%* SPDE Engine - left join - 2 tables;

proc sql _method ;

  create table W.T as select *

  from T1 left join W.T2 on T1.A=T2.B

  where 100000 < T1.W < 105000;

quit;

 

 

%* BASE Engine - inner join - 4 tables;

proc sql _method ;

  create table T as select *

  from T1 inner join T2 on T1.A=T2.B

          inner join T3 on T1.A=T3.C

          inner join T4 on T1.A=T4.D

  where 100000 < T1.W < 105000;

quit;

 

 

 

%* BASE Engine - left join - 4 tables;

proc sql _method ;

  create table T as select *

  from T1 left join T2 on T1.A=T2.B

          left join T3 on T1.A=T3.C

          left join T4 on T1.A=T4.D

  where 100000 < T1.W < 105000;

quit;

 

 

 

 

 

 

 

%* SPDE Engine - inner join - 4 tables;

proc sql _method ;

  create table W.T as select *

  from T1 inner join W.T2 on T1.A=T2.B

          inner join W.T3 on T1.A=T3.C

          inner join W.T4 on T1.A=T4.D

  where 100000 < T1.W < 105000;

quit;

 

 

 

%* SPDE Engine - left join - 4 tables;

proc sql _method ;

  create table W.T as select *

  from T1 left join W.T2 on T1.A=T2.B

          left join W.T3 on T1.A=T3.C

          left join W.T4 on T1.A=T4.D

  where 100000 < T1.W < 105000;

quit;

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

42         %* BASE Engine - inner join - 2 tables;

INFO: Index B of SQL table WORK.T2 selected for SQL WHERE clause (join) optimization.

      sqxcrta

          sqxjndx

              sqxsrc( WORK.T1 )

              sqxsrc( WORK.T2 )

 

48         %* BASE Engine - left join - 2 tables;

      sqxcrta

          sqxjm

              sqxsort

                  sqxsrc( WORK.T2 )

              sqxsort

                  sqxsrc( WORK.T1 )

 

54         %* SPDE Engine - inner join - 2 tables;

INFO: Index B of SQL table W.T2 selected for SQL WHERE clause (join) optimization.

      sqxcrta

          sqxjndx

              sqxsrc( WORK.T1 )

              sqxsrc( W.T2 )

 

60         %* SPDE Engine - left join - 2 tables;

      sqxcrta

          sqxjm

              sqxsort

                  sqxsrc( W.T2 )

              sqxsort

                  sqxsrc( WORK.T1 )

 

66         %* BASE Engine - inner join - 4 tables;

INFO: Index B of SQL table WORK.T2 selected for SQL WHERE clause (join) optimization.

      sqxcrta

          sqxjhsh

              sqxjhsh

                  sqxjndx

                      sqxsrc( WORK.T1 )

                      sqxsrc( WORK.T2 )

                  sqxsrc( WORK.T3 )

              sqxsrc( WORK.T4 )

 

74         %* BASE Engine - left join - 4 tables;

      sqxcrta

          sqxjm

              sqxsort

                  sqxsrc( WORK.T4 )

              sqxjm

                  sqxsort

                      sqxsrc( WORK.T3 )

                  sqxjm

                      sqxsort

                          sqxsrc( WORK.T2 )

                      sqxsort

                          sqxsrc( WORK.T1 )

 

82         %* SPDE Engine - inner join - 4 tables;

INFO: Index B of SQL table W.T2 selected for SQL WHERE clause (join) optimization.

      sqxcrta

          sqxjhsh

              sqxjhsh

                  sqxjndx

                      sqxsrc( WORK.T1 )

                      sqxsrc( W.T2 )

                  sqxsrc( W.T3 )

              sqxsrc( W.T4 )

 

90         %* SPDE Engine - left join - 4 tables;

      sqxcrta

          sqxjm

              sqxsort

                  sqxsrc( W.T4 )

              sqxjm

                  sqxsort

                      sqxsrc( W.T3 )

                  sqxjm

                      sqxsort

                          sqxsrc( W.T2 )

                      sqxsort

                          sqxsrc( WORK.T1 )

       

 

 

You can use all indexes if you separate the joins *and* you use inner joins.


%* SPDE Engine - left join - 4 tables; 
proc sql _method ; 
  create table W.T as select X1.*
  from            (select * from T1 inner join W.T2 on T1.A=T2.B where 100000 < T1.W < 105000 ) as X1
       inner join (select * from T1 inner join W.T3 on T1.A=T3.C where 100000 < T1.W < 105000 ) as X2 on X1.A=X2.A
       inner join (select * from T1 inner join W.T4 on T1.A=T4.D where 100000 < T1.W < 105000 ) as X3 on X1.A=X3.A
  ;
quit;

INFO: Index D of SQL table W.T4 selected for SQL WHERE clause (join) optimization.
INFO: Index C of SQL table W.T3 selected for SQL WHERE clause (join) optimization.
INFO: Index B of SQL table W.T2 selected for SQL WHERE clause (join) optimization.

 

That's suboptimal for sure in this case. I guess SAS decides on the join method before subsetting the fact table, so isn't aware of the volumes involved.

It would be nice if there was at least a way to force the join method.

Another way with so few values might be something like this:

 

proc sql noprint; 
  select A into :a_list separated by ',' from W.T1 where 100000 < T1.W < 105000;
  create table W.T as select * 
  from T1 left join (select * from W.T2 where T2.B in(&a_list.) ) T2 on T1.A=T2.B 
          left join (select * from W.T3 where T3.C in(&a_list.) ) T3 on T1.A=T3.C 
          left join (select * from W.T4 where T4.D in(&a_list.) ) T4 on T1.A=T4.D 
  where 100000 < T1.W < 105000;
quit;
ScottBass
Rhodochrosite | Level 12

I'm pretty sure I needed left joins - the dim tables *should* have had a match for all fact SK's but I couldn't risk deleting fact rows.  This was about 4-5 years ago.

 

I feel like I hijacked your SASWare Ballot Idea.  Sorry about that.  

 

P.S.:  Nice examples for SAS QA and/or R&D to run with if they feel this is incorrect behavior 😉

ChrisNZ
Tourmaline | Level 20

That's fine, Scott, I don't expect much to happen with the ballot entries 🙂

yabwon
Onyx | Level 15

Hi Chris,

 

I know it's been a while since you posted this, but maybe workaround from here:

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3722-2019.pdf

can help you with OR condition it the WHERE clause

 

All the best

Bart