Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

fsdfsd · Posted 12-07-2018 11:13 AM

This question has been discussed before but I can't find that it was ever resolved:

The issue: PROC SQL seems to have some sort of bug that causes performance degradation when a WHERE clause makes use of a sub-select query when reading from an indexed dataset.

Suppose you've got a big dataset that's indexed on one field (e.g., CustomerID). You need to retrieve a record for a particular customer ID (e.g., 42). The query below will run lightning fast:

SELECT * FROM BigDataSet WHERE CustomerID IN (42);

But if you replace the number 42 with a subquery that resolves to this same value of 42, the query runs much more slowly. Important: this occurs even if the subquery is reading from a tiny dataset. Example:

DATA TinyDataSet;
CustomerID = 42;
RUN;

PROC SQL;
SELECT *

FROM BigDataSet

WHERE CustomerID IN (SELECT CustomerID From TinyDataSet);
QUIT;

The attached code, which generates a large play dataset, demonstrates the issue. You really only notice this speed difference if you're working with big data (e.g., 50,000,000+ observations).

Has anyone ever figured out why this is the case? Indexing is a really useful feature when working with big data, but it seems the use of a sub-query in the WHERE clause causes SAS to get mixed up and perhaps not make use of the index. I also work with SQL Server and Oracle databases and have not encountered this issue on those platforms.

You can run the attached code and you'll immediately see the issue. For example, the query with the hard-coded WHERE clauses takes 0.01 seconds to run on my Windows 10 Dell desktop, while the query with the sub-select takes nearly 5 seconds.

Thanks in advance for any insights you can share!

*------------------------------------------------------------------*;
data MyBigDataset (index = (CustomerID)); 
    do CustomerID = 1 to 50000000;
        output;
    end;
run;
*------------------------------------------------------------------*;
*FAST QUERY;
proc sql;
    select * from MyBigDataset where CustomerID in (42);
quit;
*------------------------------------------------------------------*;
*SLOW QUERY;
data MySubsetOfIDs;
    CustomerID = 42;
run;
 
proc sql;
    select  * 
    from    MyBigDataset 
    where   CustomerID in (select CustomerID from MySubsetOfIDs);
quit;
*------------------------------------------------------------------*;

PGStats · Posted 12-07-2018 11:53 AM

Yet,

proc sql;
    select  MyBigDataset.* 
    from 
        MyBigDataset inner join 
        MySubsetOfIDs 
            on MyBigDataset.CustomerID = MySubsetOfIDs.CustomerID;
quit;

proc sql;
    select  MyBigDataset.* 
    from 
        MyBigDataset inner join 
        (select CustomerID from MySubsetOfIDs) as b 
            on MyBigDataset.CustomerID = b.CustomerID;
quit;

are very fast!

PG

fsdfsd · Posted 12-07-2018 12:51 PM

Yes, thank you, the INNER JOIN does indeed run very quickly. So SAS clearly makes efficient use of the dataset's index when doing the join. But apparently it does not make efficient use of the index when you use a sub-query in a WHERE clause.

Because sub-queries like the one in my example are used so frequently, it would be great to know why this is the case.

Thanks again for your response.

Tom · Posted 12-07-2018 11:53 AM

That does not sound surprising.

Did you try just converting to an inner join?

What happens if you add a similar index to the small dataset?

fsdfsd · Posted 12-07-2018 01:11 PM

Thanks for your response.

Yes, I did try adding an index to the small dataset used in the sub-query. The result was that it quadrupled the time required for the query to run (and yes, this a very strange result, but you can try it with the sample of code I provided. Just change the statement data MySubsetOfIDs to data MySubsetOfIDs (index=(CustomerID)). This suggests a bug in SAS's implementation of simple indexes.

And yes, converting to an INNER JOIN gets around the issue, but the question remains as to why sub-queries (which are used so very often in SQL) should so negatively impact query performance in big data applications.

Thanks again.

Tom · Posted 12-07-2018 01:31 PM

Since you told SQL to process the full large dataset and check whether each row is in the smaller dataset to optimize that query SAS would need to know how essentially to convert it to an inner join.
To process the inner join it can see that it just needs to read the small dataset and check each id against the index of the big dataset.
Unlike a RDBMS where SQL is the ONLY language SAS does not need to invest as much resources into optimizing SQL queries. If you want to optimize the query write your own DATA step.

fsdfsd · Posted 12-07-2018 02:07 PM

Thanks for your reply. Not sure if SAS developers would agree with your assessment. They've invested much effort into supporting subqueries (for example: https://go.documentation.sas.com/?docsetId=casfedsql&docsetTarget=n07d1sue0iwb3xn1dgye2wae60z7.htm&d...).

All that aside, if we go back to my simple code example, consider two subquery approaches:

1. proc sql;
select CustomerID into :ID_list separated by "," from MySubsetOfIDs;
select * from MyBigDataset where CustomerID in (&ID_List);
quit;

2. proc sql;
select * from MyBigDataset where CustomerID in

(select CustomerID from MySubsetOfIDs);
quit;

My question reduces to this: Why should the code in #1 run much faster than the code in #2? (which is indeed the case).

Tom · Posted 12-07-2018 02:42 PM

The format of those queries is totally different. In one you have a hard coded list of values. So passing those values to the index is an obvious way to process them. In the other you do not necessarily know when compiling the query what values to pass to the index or even how many values need to be passed. When you format it as an inner join then it is easier to see that you should go through the small dataset and check for matches in the large dataset. I suspect that there are SQL implementations that can tell that the sub-query and inner join are actually equivalent. Apparently SAS cannot.

ballardw · Posted 12-07-2018 03:55 PM

@fsdfsd wrote:

Thanks for your reply. Not sure if SAS developers would agree with your assessment. They've invested much effort into supporting subqueries (for example: https://go.documentation.sas.com/?docsetId=casfedsql&docsetTarget=n07d1sue0iwb3xn1dgye2wae60z7.htm&d...).

All that aside, if we go back to my simple code example, consider two subquery approaches:

1. proc sql;
select CustomerID into :ID_list separated by "," from MySubsetOfIDs;
select * from MyBigDataset where CustomerID in (&ID_List);
quit;

2. proc sql;
select * from MyBigDataset where CustomerID in

(select CustomerID from MySubsetOfIDs);
quit;

My question reduces to this: Why should the code in #1 run much faster than the code in #2? (which is indeed the case).

My guess is that the basic subquery is executed for each record of MYBigDataset. Since a sub query might include additional subqueries I think that sort of makes sense.

Try placing the values of custormerid from Mysubsetofids into a single macro variable and use that instead of the subquery. I suspect you'll see a significant different in time as well since the comparison will be to a fixed list of items.

fsdfsd · Posted 12-07-2018 05:12 PM

Yes, thanks, I do know about the macro variable approach ( see msg #7 in this thread. In that messsage, I asked why that approach runs so much faster than a sub-query). And I'm pretty certain that a subquery is not executed every time a record is read from the large dataset. That would be terribly inefficient and would be radically different from every other SQL implementation I've worked with.

Thanks.

PGStats · Posted 12-07-2018 04:13 PM

Your link is for FedSql doc. Have you tested the situation there?

PG

Tom · Posted 12-07-2018 03:05 PM

Here are a number of scenarios so you can compare the performances.

284  data MyBigDataset (index = (CustomerID))  NoIndex;
285      do CustomerID = 1 to 50000000;
286          output;
287      end;
288  run;

NOTE: The data set WORK.MYBIGDATASET has 50000000 observations and 1 variables.
INFO: Multiple concurrent threads will be used to create the index.
NOTE: Simple index CustomerID has been defined.
NOTE: The data set WORK.NOINDEX has 50000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           13.86 seconds
      cpu time            24.27 seconds


289
290  proc sort data=noindex out=sorted;
291    by customerid ;
292  run;

NOTE: There were 50000000 observations read from the data set WORK.NOINDEX.
NOTE: SAS threaded sort was used.
NOTE: The data set WORK.SORTED has 50000000 observations and 1 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           5.69 seconds
      cpu time            16.54 seconds


293
294  data MySubsetOfIDs;
295      CustomerID = 42;
296  run;

NOTE: The data set WORK.MYSUBSETOFIDS has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


297
298  proc sql;
299    create table test0 as
300      select  *
301        from MyBigDataset a
302        where a.CustomerID in (42)
303    ;
INFO: Index CustomerID selected for WHERE clause optimization.
NOTE: Table WORK.TEST0 created, with 1 rows and 1 columns.

304  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.17 seconds
      cpu time            0.00 seconds


305
306  proc sql;
307    create table test1 as
308      select  *
309        from MyBigDataset a
310        where a.CustomerID in (select CustomerID from MySubsetOfIDs)
311    ;
NOTE: Table WORK.TEST1 created, with 1 rows and 1 columns.

312  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.26 seconds
      cpu time            6.08 seconds


313
314  proc sql;
315    create table test2 as
316      select  a.*
317        from MyBigDataset a
318        inner join MySubsetOfIDs b
319        on a.customerid = b.customerid
320    ;
INFO: Index CustomerID of SQL table WORK.MYBIGDATASET (alias = A) selected for SQL WHERE clause
      (join) optimization.
NOTE: Table WORK.TEST2 created, with 1 rows and 1 columns.

321  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.23 seconds
      cpu time            0.00 seconds


322
323  proc sql;
324    create table test0 as
325      select  *
326        from NoIndex a
327        where a.CustomerID in (42)
328    ;
NOTE: Table WORK.TEST0 created, with 1 rows and 1 columns.

329  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           1.10 seconds
      cpu time            1.05 seconds


330
331  proc sql;
332    create table test4 as
333      select  *
334        from NoIndex a
335        where a.CustomerID in (select CustomerID from MySubsetOfIDs)
336    ;
NOTE: Table WORK.TEST4 created, with 1 rows and 1 columns.

337  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.23 seconds
      cpu time            6.03 seconds


338
339  proc sql;
340    create table test5 as
341      select  a.*
342        from NoIndex a
343        inner join MySubsetOfIDs b
344        on a.customerid = b.customerid
345    ;
NOTE: Table WORK.TEST5 created, with 1 rows and 1 columns.

346  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           3.51 seconds
      cpu time            3.33 seconds


347
348
349  proc sql;
350    create table test6 as
351      select  *
352        from Sorted a
353        where a.CustomerID in (42)
354    ;
NOTE: Table WORK.TEST6 created, with 1 rows and 1 columns.

355  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.18 seconds
      cpu time            0.00 seconds


356
357  proc sql;
358    create table test7 as
359      select  *
360        from Sorted a
361        where a.CustomerID in (select CustomerID from MySubsetOfIDs)
362    ;
NOTE: Table WORK.TEST7 created, with 1 rows and 1 columns.

363  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           6.21 seconds
      cpu time            6.10 seconds


364
365  proc sql;
366    create table test8 as
367      select  a.*
368        from Sorted a
369        inner join MySubsetOfIDs b
370        on a.customerid = b.customerid
371    ;
NOTE: Table WORK.TEST8 created, with 1 rows and 1 columns.

372  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.13 seconds
      cpu time            0.00 seconds


373  data test9;
374   merge sorted MySubsetOfIDs(in=in1);
375   by customerid;
376   if in1;
377  run;

NOTE: There were 50000000 observations read from the data set WORK.SORTED.
NOTE: There were 1 observations read from the data set WORK.MYSUBSETOFIDS.
NOTE: The data set WORK.TEST9 has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           4.30 seconds
      cpu time            4.30 seconds


378  data test10;
379   merge MYBIGDATASET MySubsetOfIDs(in=in1);
380   by customerid;
381   if in1;
382  run;

INFO: Index CustomerID selected for BY clause processing.
NOTE: There were 50000000 observations read from the data set WORK.MYBIGDATASET.
NOTE: There were 1 observations read from the data set WORK.MYSUBSETOFIDS.
NOTE: The data set WORK.TEST10 has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           17.69 seconds
      cpu time            17.66 seconds


383  data test11;
384    set MySubsetOfIDs ;
385    set MYBIGDATASET key=customerid;
386    if not _iorc_ then output;
387    _error_=0;
388  run;

NOTE: There were 1 observations read from the data set WORK.MYSUBSETOFIDS.
NOTE: The data set WORK.TEST11 has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.23 seconds
      cpu time            0.01 seconds


389  data test12;
390    set MYBIGDATASET ;
391    where customerid in (42);
INFO: Index CustomerID selected for WHERE clause optimization.
392  run;

NOTE: There were 1 observations read from the data set WORK.MYBIGDATASET.
      WHERE customerid=42;
NOTE: The data set WORK.TEST12 has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.09 seconds
      cpu time            0.00 seconds

SASKiwi · Posted 12-07-2018 05:02 PM

Great post everyone. If I were to summarise the findings it would be that SQL compilers in different products don't necessarily perform equally. There will be some types of queries that will perform better in one product than another. This doesn't come as a surprise to me and most likely to the other posters as well. There are many occasions when I've had to modify my query approach to improve performance, regardless of the SQL language / product I was using.

@fsdfsd - I'd suggest you track your sub-query performance issue to SAS Tech Support and add their response to this post. That way we will all keep learning new stuff together which to me is the great benefit of SAS Community involvement!

fsdfsd · Posted 12-07-2018 05:13 PM

I will submit this as a question to SAS's support team and post their response here. Thanks.

Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Re: Performance Issue with PROC SQL queries that use sub-select in WHERE clause

Classroom Training Available!