DATA Step, Macro, Functions and more

Why doesn't PROC SQL use an index for multiple left joins?

Reply
Super Contributor
Posts: 376

Why doesn't PROC SQL use an index for multiple left joins?

Hi,

Apologies for the length of this SAS log below; I couldn't easily create a simpler example. 

There are two SQL joins in this SAS log:

* The first join uses multiple LEFT JOINs in the one SQL step.  In this example, PROC SQL uses an index for only one of the joins, and uses a sort for the remaining joins.  It takes 25 seconds to execute.

* The second join is similar to the first, but uses a single LEFT JOIN with multiple SQL steps.  In this example, PROC SQL uses an index for all of the joins.  It takes 1.5 seconds to execute.

See questions inline with the log output below:

1897  * sql join ;

1898  proc sql _method;

1899    create table workspde.final3 as

1900    select

1901      %seplist(covno suffix,prefix=t1.)

1902      ,%seplist(title givenname surname dob sex statusInd,prefix=t2.)

1903      ,%seplist(addressLine1 addressLine2 addressLine3 addressLine4 location state postcode,prefix=t3.)

1904      ,%seplist(currentRecord contactType contactDesc phonefaxemail,prefix=t4.)

1905    from

1906      workspde.source t1

1907    left join

1908      claimdm.memberdim (keep=covno suffix currentRecord title givenname surname dob sex statusInd) t2

1909    on

1910      t1.covno=t2.covno and t1.suffix=t2.suffix

1911    left join

1912      claimdm.postaldim (keep=covno currentRecord addressLine: location state postcode) t3

1913    on

1914      t1.covno=t3.covno

1915    left join

1916      claimdm.contactdim (keep=covno suffix currentRecord contactType contactDesc phonefaxemail) t4

1917    on

1918      t1.covno=t4.covno and t1.suffix=t4.suffix

1919    where

1920      t2.currentRecord="Y" and t3.currentRecord="Y" and t4.currentRecord="Y"

1921    ;

INFO: Index primary of SQL table CLAIMDM.MEMBERDIM (alias = T2) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

      sqxcrta

          sqxfil

              sqxjm

                  sqxsort

                      sqxsrc( CLAIMDM.CONTACTDIM(alias = T4) )

                  sqxsort

                      sqxfil

                          sqxjm

                              sqxsort

                                  sqxsrc( CLAIMDM.POSTALDIM(alias = T3) )

                              sqxsort

                                  sqxfil

                                      sqxjndx

                                          sqxsrc( WORKSPDE.SOURCE(alias = T1) )

                                          sqxsrc( CLAIMDM.MEMBERDIM(alias = T2) )

NOTE: SAS threaded sort was used.

NOTE: Compressing data set WORKSPDE.FINAL3 decreased size by 85.05 percent.

NOTE: Table WORKSPDE.FINAL3 created, with 4712 rows and 19 columns.

1922  quit;

NOTE: PROCEDURE SQL used (Total process time):

      real time        25.24 seconds

      user cpu time       14.96 seconds

      system cpu time     11.66 seconds

      memory              274318.92k

      OS Memory           299728.00k

      Timestamp            2/01/2014 03:16:20 PM

Here is my understanding of the SQL _method output:

* SQL uses an index to join WORKSPDE.SOURCE (the transaction table) and CLAIMDM.MEMBERDIM (the dimension lookup table)  (sqxjndx)

* It writes this to an intermediate, utility file (sqxfil)

* It then sorts this file (sqxsort)

* It sorts the remaining dimension lookup tables and intermediate files to create the final output using match merge (sqxjm). 


It's the sort of the large dimension tables that causes the performance hit.

1923

1924  * what if we split the SQL join into pieces to avoid the sorting? ;

1927  proc sql _method;

1928    create table workspde.temp1 as

1929    select

1930      %seplist(covno suffix,prefix=t1.)

1931      ,%seplist(title givenname surname dob sex statusInd,prefix=t2.)

1932    from

1933      workspde.source t1

1934    left join

1935      claimdm.memberdim (keep=covno suffix currentRecord title givenname surname dob sex statusInd) t2

1936    on

1937      t1.covno=t2.covno and t1.suffix=t2.suffix

1938    where

1939      t2.currentRecord="Y"

1940    ;

INFO: Index primary of SQL table CLAIMDM.MEMBERDIM (alias = T2) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

      sqxcrta

          sqxfil

              sqxjndx

                  sqxsrc( WORKSPDE.SOURCE(alias = T1) )

                  sqxsrc( CLAIMDM.MEMBERDIM(alias = T2) )

NOTE: Compressing data set WORKSPDE.TEMP1 decreased size by 73.73 percent.

NOTE: Table WORKSPDE.TEMP1 created, with 10935 rows and 8 columns.

1941    create table workspde.temp2 as

1942    select

1943      t1.*

1944      ,%seplist(addressLine1 addressLine2 addressLine3 addressLine4 location state postcode,prefix=t3.)

1945    from

1946      workspde.temp1 t1

1947    left join

1948      claimdm.postaldim (keep=covno currentRecord addressLine: location state postcode) t3

1949    on

1950      t1.covno=t3.covno

1951    where

1952      t3.currentRecord="Y"

1953    ;

INFO: Index covNo of SQL table CLAIMDM.POSTALDIM (alias = T3) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

      sqxcrta

          sqxfil

              sqxjndx

                  sqxsrc( WORKSPDE.TEMP1(alias = T1) )

                  sqxsrc( CLAIMDM.POSTALDIM(alias = T3) )

NOTE: Compressing data set WORKSPDE.TEMP2 decreased size by 82.20 percent.

NOTE: Table WORKSPDE.TEMP2 created, with 10935 rows and 15 columns.

1954    create table workspde.final4 as

1955    select

1956      t1.*

1957      ,%seplist(currentRecord contactType contactDesc phonefaxemail,prefix=t4.)

1958    from

1959      workspde.temp2 t1

1960    left join

1961      claimdm.contactdim (keep=covno suffix currentRecord contactType contactDesc phonefaxemail) t4

1962    on

1963      t1.covno=t4.covno and t1.suffix=t4.suffix

1964    where

1965      t4.currentRecord="Y"

1966    ;

INFO: Index member of SQL table CLAIMDM.CONTACTDIM (alias = T4) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

      sqxcrta

          sqxfil

              sqxjndx

                  sqxsrc( WORKSPDE.TEMP2(alias = T1) )

                  sqxsrc( CLAIMDM.CONTACTDIM(alias = T4) )

NOTE: Compressing data set WORKSPDE.FINAL4 decreased size by 85.04 percent.

NOTE: Table WORKSPDE.FINAL4 created, with 4712 rows and 19 columns.

1967  quit;

NOTE: PROCEDURE SQL used (Total process time):

      real time        1.51 seconds

      user cpu time       1.26 seconds

      system cpu time     0.26 seconds

      memory              2629.60k

      OS Memory           29632.00k

      Timestamp            2/01/2014 03:16:22 PM

I know the intermediate, utility files created by SQL are not indexed, but then neither are my "explicit utility files" created in the second example.  My question is:  why is PROC SQL smart enough to use indexes in the second example, but not the first?  I've investigated the IDXNAME option but it doesn't apply in this scenario.

Thanks,

Scott

Super User
Posts: 5,260

Re: Why doesn't PROC SQL use an index for multiple left joins?

Great example. It pictures quite clear something that often can be described as a gut feeling - Proc SQL is good/great on optimizing simple queries. More complicated cases needs to be tested and developed by the user/programmer.

as for the question ("Why.."), the answer is probably that this kind of intelligence is not implemented in the query planer of Pro SQL.

This question is addressed to the developer at SAS, so I hope they will come forward on this issue.

Data never sleeps
Ask a Question
Discussion stats
  • 1 reply
  • 531 views
  • 0 likes
  • 2 in conversation