Why doesn't PROC SQL use an index for multiple left joins?

ScottBass · Posted 01-01-2014 11:58 PM

Hi,

Apologies for the length of this SAS log below; I couldn't easily create a simpler example.

There are two SQL joins in this SAS log:

* The first join uses multiple LEFT JOINs in the one SQL step. In this example, PROC SQL uses an index for only one of the joins, and uses a sort for the remaining joins. It takes 25 seconds to execute.

* The second join is similar to the first, but uses a single LEFT JOIN with multiple SQL steps. In this example, PROC SQL uses an index for all of the joins. It takes 1.5 seconds to execute.

See questions inline with the log output below:

1897 * sql join ;

1898 proc sql _method;

1899 create table workspde.final3 as

1900 select

1901 %seplist(covno suffix,prefix=t1.)

1902 ,%seplist(title givenname surname dob sex statusInd,prefix=t2.)

1903 ,%seplist(addressLine1 addressLine2 addressLine3 addressLine4 location state postcode,prefix=t3.)

1904 ,%seplist(currentRecord contactType contactDesc phonefaxemail,prefix=t4.)

1905 from

1906 workspde.source t1

1907 left join

1908 claimdm.memberdim (keep=covno suffix currentRecord title givenname surname dob sex statusInd) t2

1909 on

1910 t1.covno=t2.covno and t1.suffix=t2.suffix

1911 left join

1912 claimdm.postaldim (keep=covno currentRecord addressLine: location state postcode) t3

1913 on

1914 t1.covno=t3.covno

1915 left join

1916 claimdm.contactdim (keep=covno suffix currentRecord contactType contactDesc phonefaxemail) t4

1917 on

1918 t1.covno=t4.covno and t1.suffix=t4.suffix

1919 where

1920 t2.currentRecord="Y" and t3.currentRecord="Y" and t4.currentRecord="Y"

1921 ;

INFO: Index primary of SQL table CLAIMDM.MEMBERDIM (alias = T2) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

sqxcrta

sqxfil

sqxjm

sqxsort

sqxsrc( CLAIMDM.CONTACTDIM(alias = T4) )

sqxsort

sqxfil

sqxjm

sqxsort

sqxsrc( CLAIMDM.POSTALDIM(alias = T3) )

sqxsort

sqxfil

sqxjndx

sqxsrc( WORKSPDE.SOURCE(alias = T1) )

sqxsrc( CLAIMDM.MEMBERDIM(alias = T2) )

NOTE: SAS threaded sort was used.

NOTE: Compressing data set WORKSPDE.FINAL3 decreased size by 85.05 percent.

NOTE: Table WORKSPDE.FINAL3 created, with 4712 rows and 19 columns.

1922 quit;

NOTE: PROCEDURE SQL used (Total process time):

real time 25.24 seconds

user cpu time 14.96 seconds

system cpu time 11.66 seconds

memory 274318.92k

OS Memory 299728.00k

Timestamp 2/01/2014 03:16:20 PM

Here is my understanding of the SQL _method output:

* SQL uses an index to join WORKSPDE.SOURCE (the transaction table) and CLAIMDM.MEMBERDIM (the dimension lookup table) (sqxjndx)

* It writes this to an intermediate, utility file (sqxfil)

* It then sorts this file (sqxsort)

* It sorts the remaining dimension lookup tables and intermediate files to create the final output using match merge (sqxjm).

It's the sort of the large dimension tables that causes the performance hit.

1923

1924 * what if we split the SQL join into pieces to avoid the sorting? ;

1927 proc sql _method;

1928 create table workspde.temp1 as

1929 select

1930 %seplist(covno suffix,prefix=t1.)

1931 ,%seplist(title givenname surname dob sex statusInd,prefix=t2.)

1932 from

1933 workspde.source t1

1934 left join

1935 claimdm.memberdim (keep=covno suffix currentRecord title givenname surname dob sex statusInd) t2

1936 on

1937 t1.covno=t2.covno and t1.suffix=t2.suffix

1938 where

1939 t2.currentRecord="Y"

1940 ;

INFO: Index primary of SQL table CLAIMDM.MEMBERDIM (alias = T2) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

sqxcrta

sqxfil

sqxjndx

sqxsrc( WORKSPDE.SOURCE(alias = T1) )

sqxsrc( CLAIMDM.MEMBERDIM(alias = T2) )

NOTE: Compressing data set WORKSPDE.TEMP1 decreased size by 73.73 percent.

NOTE: Table WORKSPDE.TEMP1 created, with 10935 rows and 8 columns.

1941 create table workspde.temp2 as

1942 select

1943 t1.*

1944 ,%seplist(addressLine1 addressLine2 addressLine3 addressLine4 location state postcode,prefix=t3.)

1945 from

1946 workspde.temp1 t1

1947 left join

1948 claimdm.postaldim (keep=covno currentRecord addressLine: location state postcode) t3

1949 on

1950 t1.covno=t3.covno

1951 where

1952 t3.currentRecord="Y"

1953 ;

INFO: Index covNo of SQL table CLAIMDM.POSTALDIM (alias = T3) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

sqxcrta

sqxfil

sqxjndx

sqxsrc( WORKSPDE.TEMP1(alias = T1) )

sqxsrc( CLAIMDM.POSTALDIM(alias = T3) )

NOTE: Compressing data set WORKSPDE.TEMP2 decreased size by 82.20 percent.

NOTE: Table WORKSPDE.TEMP2 created, with 10935 rows and 15 columns.

1954 create table workspde.final4 as

1955 select

1956 t1.*

1957 ,%seplist(currentRecord contactType contactDesc phonefaxemail,prefix=t4.)

1958 from

1959 workspde.temp2 t1

1960 left join

1961 claimdm.contactdim (keep=covno suffix currentRecord contactType contactDesc phonefaxemail) t4

1962 on

1963 t1.covno=t4.covno and t1.suffix=t4.suffix

1964 where

1965 t4.currentRecord="Y"

1966 ;

INFO: Index member of SQL table CLAIMDM.CONTACTDIM (alias = T4) selected for SQL WHERE clause (outer join) optimization.

NOTE: SQL execution methods chosen are:

sqxcrta

sqxfil

sqxjndx

sqxsrc( WORKSPDE.TEMP2(alias = T1) )

sqxsrc( CLAIMDM.CONTACTDIM(alias = T4) )

NOTE: Compressing data set WORKSPDE.FINAL4 decreased size by 85.04 percent.

NOTE: Table WORKSPDE.FINAL4 created, with 4712 rows and 19 columns.

1967 quit;

NOTE: PROCEDURE SQL used (Total process time):

real time 1.51 seconds

user cpu time 1.26 seconds

system cpu time 0.26 seconds

memory 2629.60k

OS Memory 29632.00k

Timestamp 2/01/2014 03:16:22 PM

I know the intermediate, utility files created by SQL are not indexed, but then neither are my "explicit utility files" created in the second example. My question is: why is PROC SQL smart enough to use indexes in the second example, but not the first? I've investigated the IDXNAME option but it doesn't apply in this scenario.

Thanks,

Scott

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

LinusH · Posted 01-02-2014 07:06 AM

Great example. It pictures quite clear something that often can be described as a gut feeling - Proc SQL is good/great on optimizing simple queries. More complicated cases needs to be tested and developed by the user/programmer.

as for the question ("Why.."), the answer is probably that this kind of intelligence is not implemented in the query planer of Pro SQL.

This question is addressed to the developer at SAS, so I hope they will come forward on this issue.

Data never sleeps

Why doesn't PROC SQL use an index for multiple left joins?

Re: Why doesn't PROC SQL use an index for multiple left joins?

Click image to register for webinar

Classroom Training Available!