About fsdfsd

fsdfsd · ‎12-07-2018

I will submit this as a question to SAS's support team and post their response here. Thanks.

fsdfsd · ‎12-07-2018

Yes, thanks, I do know about the macro variable approach ( see msg #7 in this thread. In that messsage, I asked why that approach runs so much faster than a sub-query). And I'm pretty certain that a subquery is not executed every time a record is read from the large dataset. That would be terribly inefficient and would be radically different from every other SQL implementation I've worked with. Thanks.

fsdfsd · ‎12-07-2018

Thanks for your reply. Not sure if SAS developers would agree with your assessment. They've invested much effort into supporting subqueries (for example: https://go.documentation.sas.com/?docsetId=casfedsql&docsetTarget=n07d1sue0iwb3xn1dgye2wae60z7.htm&docsetVersion=3.2&locale=en). All that aside, if we go back to my simple code example, consider two subquery approaches: 1. proc sql; select CustomerID into :ID_list separated by "," from MySubsetOfIDs; select * from MyBigDataset where CustomerID in (&ID_List); quit; 2. proc sql; select * from MyBigDataset where CustomerID in (select CustomerID from MySubsetOfIDs); quit; My question reduces to this: Why should the code in #1 run much faster than the code in #2? (which is indeed the case).

fsdfsd · ‎12-07-2018

Thanks for your response. Yes, I did try adding an index to the small dataset used in the sub-query. The result was that it quadrupled the time required for the query to run (and yes, this a very strange result, but you can try it with the sample of code I provided. Just change the statement data MySubsetOfIDs to data MySubsetOfIDs (index=(CustomerID)). This suggests a bug in SAS's implementation of simple indexes. And yes, converting to an INNER JOIN gets around the issue, but the question remains as to why sub-queries (which are used so very often in SQL) should so negatively impact query performance in big data applications. Thanks again.

fsdfsd · ‎12-07-2018

Yes, thank you, the INNER JOIN does indeed run very quickly. So SAS clearly makes efficient use of the dataset's index when doing the join. But apparently it does not make efficient use of the index when you use a sub-query in a WHERE clause. Because sub-queries like the one in my example are used so frequently, it would be great to know why this is the case. Thanks again for your response.

fsdfsd · ‎12-07-2018

This question has been discussed before but I can't find that it was ever resolved: The issue: PROC SQL seems to have some sort of bug that causes performance degradation when a WHERE clause makes use of a sub-select query when reading from an indexed dataset. Suppose you've got a big dataset that's indexed on one field (e.g., CustomerID). You need to retrieve a record for a particular customer ID (e.g., 42). The query below will run lightning fast: SELECT * FROM BigDataSet WHERE CustomerID IN (42); But if you replace the number 42 with a subquery that resolves to this same value of 42, the query runs much more slowly. Important: this occurs even if the subquery is reading from a tiny dataset. Example: DATA TinyDataSet; CustomerID = 42; RUN; PROC SQL; SELECT * FROM BigDataSet WHERE CustomerID IN (SELECT CustomerID From TinyDataSet); QUIT; The attached code, which generates a large play dataset, demonstrates the issue. You really only notice this speed difference if you're working with big data (e.g., 50,000,000+ observations). Has anyone ever figured out why this is the case? Indexing is a really useful feature when working with big data, but it seems the use of a sub-query in the WHERE clause causes SAS to get mixed up and perhaps not make use of the index. I also work with SQL Server and Oracle databases and have not encountered this issue on those platforms. You can run the attached code and you'll immediately see the issue. For example, the query with the hard-coded WHERE clauses takes 0.01 seconds to run on my Windows 10 Dell desktop, while the query with the sub-select takes nearly 5 seconds. Thanks in advance for any insights you can share! *------------------------------------------------------------------*; data MyBigDataset (index = (CustomerID)); do CustomerID = 1 to 50000000; output; end; run; *------------------------------------------------------------------*; *FAST QUERY; proc sql; select * from MyBigDataset where CustomerID in (42); quit; *------------------------------------------------------------------*; *SLOW QUERY; data MySubsetOfIDs; CustomerID = 42; run; proc sql; select * from MyBigDataset where CustomerID in (select CustomerID from MySubsetOfIDs); quit; *------------------------------------------------------------------*;

Online Status	Offline
Date Last Visited	‎11-20-2021 08:58 PM

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Performance Issue with PROC SQL queries that use sub-select in WHERE c...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Re: Performance Issue with PROC SQL queries that use sub-select in WHE...

Performance Issue with PROC SQL queries that use sub-select in WHERE c...