Hi,
Apologies for the length of this SAS log below; I couldn't easily create a simpler example.
There are two SQL joins in this SAS log:
* The first join uses multiple LEFT JOINs in the one SQL step. In this example, PROC SQL uses an index for only one of the joins, and uses a sort for the remaining joins. It takes 25 seconds to execute.
* The second join is similar to the first, but uses a single LEFT JOIN with multiple SQL steps. In this example, PROC SQL uses an index for all of the joins. It takes 1.5 seconds to execute.
See questions inline with the log output below:
1897 * sql join ;
1898 proc sql _method;
1899 create table workspde.final3 as
1900 select
1901 %seplist(covno suffix,prefix=t1.)
1902 ,%seplist(title givenname surname dob sex statusInd,prefix=t2.)
1903 ,%seplist(addressLine1 addressLine2 addressLine3 addressLine4 location state postcode,prefix=t3.)
1904 ,%seplist(currentRecord contactType contactDesc phonefaxemail,prefix=t4.)
1905 from
1906 workspde.source t1
1907 left join
1908 claimdm.memberdim (keep=covno suffix currentRecord title givenname surname dob sex statusInd) t2
1909 on
1910 t1.covno=t2.covno and t1.suffix=t2.suffix
1911 left join
1912 claimdm.postaldim (keep=covno currentRecord addressLine: location state postcode) t3
1913 on
1914 t1.covno=t3.covno
1915 left join
1916 claimdm.contactdim (keep=covno suffix currentRecord contactType contactDesc phonefaxemail) t4
1917 on
1918 t1.covno=t4.covno and t1.suffix=t4.suffix
1919 where
1920 t2.currentRecord="Y" and t3.currentRecord="Y" and t4.currentRecord="Y"
1921 ;
INFO: Index primary of SQL table CLAIMDM.MEMBERDIM (alias = T2) selected for SQL WHERE clause (outer join) optimization.
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxjm
sqxsort
sqxsrc( CLAIMDM.CONTACTDIM(alias = T4) )
sqxsort
sqxfil
sqxjm
sqxsort
sqxsrc( CLAIMDM.POSTALDIM(alias = T3) )
sqxsort
sqxfil
sqxjndx
sqxsrc( WORKSPDE.SOURCE(alias = T1) )
sqxsrc( CLAIMDM.MEMBERDIM(alias = T2) )
NOTE: SAS threaded sort was used.
NOTE: Compressing data set WORKSPDE.FINAL3 decreased size by 85.05 percent.
NOTE: Table WORKSPDE.FINAL3 created, with 4712 rows and 19 columns.
1922 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 25.24 seconds
user cpu time 14.96 seconds
system cpu time 11.66 seconds
memory 274318.92k
OS Memory 299728.00k
Timestamp 2/01/2014 03:16:20 PM
Here is my understanding of the SQL _method output:
* SQL uses an index to join WORKSPDE.SOURCE (the transaction table) and CLAIMDM.MEMBERDIM (the dimension lookup table) (sqxjndx)
* It writes this to an intermediate, utility file (sqxfil)
* It then sorts this file (sqxsort)
* It sorts the remaining dimension lookup tables and intermediate files to create the final output using match merge (sqxjm).
It's the sort of the large dimension tables that causes the performance hit.
1923
1924 * what if we split the SQL join into pieces to avoid the sorting? ;
1927 proc sql _method;
1928 create table workspde.temp1 as
1929 select
1930 %seplist(covno suffix,prefix=t1.)
1931 ,%seplist(title givenname surname dob sex statusInd,prefix=t2.)
1932 from
1933 workspde.source t1
1934 left join
1935 claimdm.memberdim (keep=covno suffix currentRecord title givenname surname dob sex statusInd) t2
1936 on
1937 t1.covno=t2.covno and t1.suffix=t2.suffix
1938 where
1939 t2.currentRecord="Y"
1940 ;
INFO: Index primary of SQL table CLAIMDM.MEMBERDIM (alias = T2) selected for SQL WHERE clause (outer join) optimization.
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxjndx
sqxsrc( WORKSPDE.SOURCE(alias = T1) )
sqxsrc( CLAIMDM.MEMBERDIM(alias = T2) )
NOTE: Compressing data set WORKSPDE.TEMP1 decreased size by 73.73 percent.
NOTE: Table WORKSPDE.TEMP1 created, with 10935 rows and 8 columns.
1941 create table workspde.temp2 as
1942 select
1943 t1.*
1944 ,%seplist(addressLine1 addressLine2 addressLine3 addressLine4 location state postcode,prefix=t3.)
1945 from
1946 workspde.temp1 t1
1947 left join
1948 claimdm.postaldim (keep=covno currentRecord addressLine: location state postcode) t3
1949 on
1950 t1.covno=t3.covno
1951 where
1952 t3.currentRecord="Y"
1953 ;
INFO: Index covNo of SQL table CLAIMDM.POSTALDIM (alias = T3) selected for SQL WHERE clause (outer join) optimization.
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxjndx
sqxsrc( WORKSPDE.TEMP1(alias = T1) )
sqxsrc( CLAIMDM.POSTALDIM(alias = T3) )
NOTE: Compressing data set WORKSPDE.TEMP2 decreased size by 82.20 percent.
NOTE: Table WORKSPDE.TEMP2 created, with 10935 rows and 15 columns.
1954 create table workspde.final4 as
1955 select
1956 t1.*
1957 ,%seplist(currentRecord contactType contactDesc phonefaxemail,prefix=t4.)
1958 from
1959 workspde.temp2 t1
1960 left join
1961 claimdm.contactdim (keep=covno suffix currentRecord contactType contactDesc phonefaxemail) t4
1962 on
1963 t1.covno=t4.covno and t1.suffix=t4.suffix
1964 where
1965 t4.currentRecord="Y"
1966 ;
INFO: Index member of SQL table CLAIMDM.CONTACTDIM (alias = T4) selected for SQL WHERE clause (outer join) optimization.
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxjndx
sqxsrc( WORKSPDE.TEMP2(alias = T1) )
sqxsrc( CLAIMDM.CONTACTDIM(alias = T4) )
NOTE: Compressing data set WORKSPDE.FINAL4 decreased size by 85.04 percent.
NOTE: Table WORKSPDE.FINAL4 created, with 4712 rows and 19 columns.
1967 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1.51 seconds
user cpu time 1.26 seconds
system cpu time 0.26 seconds
memory 2629.60k
OS Memory 29632.00k
Timestamp 2/01/2014 03:16:22 PM
I know the intermediate, utility files created by SQL are not indexed, but then neither are my "explicit utility files" created in the second example. My question is: why is PROC SQL smart enough to use indexes in the second example, but not the first? I've investigated the IDXNAME option but it doesn't apply in this scenario.
Thanks,
Scott
Great example. It pictures quite clear something that often can be described as a gut feeling - Proc SQL is good/great on optimizing simple queries. More complicated cases needs to be tested and developed by the user/programmer.
as for the question ("Why.."), the answer is probably that this kind of intelligence is not implemented in the query planer of Pro SQL.
This question is addressed to the developer at SAS, so I hope they will come forward on this issue.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.