This question has been discussed before but I can't find that it was ever resolved: The issue: PROC SQL seems to have some sort of bug that causes performance degradation when a WHERE clause makes use of a sub-select query when reading from an indexed dataset. Suppose you've got a big dataset that's indexed on one field (e.g., CustomerID). You need to retrieve a record for a particular customer ID (e.g., 42). The query below will run lightning fast: SELECT * FROM BigDataSet WHERE CustomerID IN (42); But if you replace the number 42 with a subquery that resolves to this same value of 42, the query runs much more slowly. Important: this occurs even if the subquery is reading from a tiny dataset. Example: DATA TinyDataSet; CustomerID = 42; RUN; PROC SQL; SELECT * FROM BigDataSet WHERE CustomerID IN (SELECT CustomerID From TinyDataSet); QUIT; The attached code, which generates a large play dataset, demonstrates the issue. You really only notice this speed difference if you're working with big data (e.g., 50,000,000+ observations). Has anyone ever figured out why this is the case? Indexing is a really useful feature when working with big data, but it seems the use of a sub-query in the WHERE clause causes SAS to get mixed up and perhaps not make use of the index. I also work with SQL Server and Oracle databases and have not encountered this issue on those platforms. You can run the attached code and you'll immediately see the issue. For example, the query with the hard-coded WHERE clauses takes 0.01 seconds to run on my Windows 10 Dell desktop, while the query with the sub-select takes nearly 5 seconds. Thanks in advance for any insights you can share! *------------------------------------------------------------------*;
data MyBigDataset (index = (CustomerID));
do CustomerID = 1 to 50000000;
output;
end;
run;
*------------------------------------------------------------------*;
*FAST QUERY;
proc sql;
select * from MyBigDataset where CustomerID in (42);
quit;
*------------------------------------------------------------------*;
*SLOW QUERY;
data MySubsetOfIDs;
CustomerID = 42;
run;
proc sql;
select *
from MyBigDataset
where CustomerID in (select CustomerID from MySubsetOfIDs);
quit;
*------------------------------------------------------------------*;
... View more