This question has been discussed before but I can't find that it was ever resolved:
The issue: PROC SQL seems to have some sort of bug that causes performance degradation when a WHERE clause makes use of a sub-select query when reading from an indexed dataset.
Suppose you've got a big dataset that's indexed on one field (e.g., CustomerID). You need to retrieve a record for a particular customer ID (e.g., 42). The query below will run lightning fast:
SELECT * FROM BigDataSet WHERE CustomerID IN (42);
But if you replace the number 42 with a subquery that resolves to this same value of 42, the query runs much more slowly. Important: this occurs even if the subquery is reading from a tiny dataset. Example:
DATA TinyDataSet;
CustomerID = 42;
RUN;
PROC SQL;
SELECT *
FROM BigDataSet
WHERE CustomerID IN (SELECT CustomerID From TinyDataSet);
QUIT;
The attached code, which generates a large play dataset, demonstrates the issue. You really only notice this speed difference if you're working with big data (e.g., 50,000,000+ observations).
Has anyone ever figured out why this is the case? Indexing is a really useful feature when working with big data, but it seems the use of a sub-query in the WHERE clause causes SAS to get mixed up and perhaps not make use of the index. I also work with SQL Server and Oracle databases and have not encountered this issue on those platforms.
You can run the attached code and you'll immediately see the issue. For example, the query with the hard-coded WHERE clauses takes 0.01 seconds to run on my Windows 10 Dell desktop, while the query with the sub-select takes nearly 5 seconds.
Thanks in advance for any insights you can share!
*------------------------------------------------------------------*;
data MyBigDataset (index = (CustomerID));
do CustomerID = 1 to 50000000;
output;
end;
run;
*------------------------------------------------------------------*;
*FAST QUERY;
proc sql;
select * from MyBigDataset where CustomerID in (42);
quit;
*------------------------------------------------------------------*;
*SLOW QUERY;
data MySubsetOfIDs;
CustomerID = 42;
run;
proc sql;
select *
from MyBigDataset
where CustomerID in (select CustomerID from MySubsetOfIDs);
quit;
*------------------------------------------------------------------*;
Yet,
proc sql;
select MyBigDataset.*
from
MyBigDataset inner join
MySubsetOfIDs
on MyBigDataset.CustomerID = MySubsetOfIDs.CustomerID;
quit;
proc sql;
select MyBigDataset.*
from
MyBigDataset inner join
(select CustomerID from MySubsetOfIDs) as b
on MyBigDataset.CustomerID = b.CustomerID;
quit;
are very fast!
Yes, thank you, the INNER JOIN does indeed run very quickly. So SAS clearly makes efficient use of the dataset's index when doing the join. But apparently it does not make efficient use of the index when you use a sub-query in a WHERE clause.
Because sub-queries like the one in my example are used so frequently, it would be great to know why this is the case.
Thanks again for your response.
That does not sound surprising.
Did you try just converting to an inner join?
What happens if you add a similar index to the small dataset?
Thanks for your response.
Yes, I did try adding an index to the small dataset used in the sub-query. The result was that it quadrupled the time required for the query to run (and yes, this a very strange result, but you can try it with the sample of code I provided. Just change the statement data MySubsetOfIDs to data MySubsetOfIDs (index=(CustomerID)). This suggests a bug in SAS's implementation of simple indexes.
And yes, converting to an INNER JOIN gets around the issue, but the question remains as to why sub-queries (which are used so very often in SQL) should so negatively impact query performance in big data applications.
Thanks again.
Thanks for your reply. Not sure if SAS developers would agree with your assessment. They've invested much effort into supporting subqueries (for example: https://go.documentation.sas.com/?docsetId=casfedsql&docsetTarget=n07d1sue0iwb3xn1dgye2wae60z7.htm&d...).
All that aside, if we go back to my simple code example, consider two subquery approaches:
1. proc sql;
select CustomerID into :ID_list separated by "," from MySubsetOfIDs;
select * from MyBigDataset where CustomerID in (&ID_List);
quit;
2. proc sql;
select * from MyBigDataset where CustomerID in
(select CustomerID from MySubsetOfIDs);
quit;
My question reduces to this: Why should the code in #1 run much faster than the code in #2? (which is indeed the case).
The format of those queries is totally different. In one you have a hard coded list of values. So passing those values to the index is an obvious way to process them. In the other you do not necessarily know when compiling the query what values to pass to the index or even how many values need to be passed. When you format it as an inner join then it is easier to see that you should go through the small dataset and check for matches in the large dataset. I suspect that there are SQL implementations that can tell that the sub-query and inner join are actually equivalent. Apparently SAS cannot.
@fsdfsd wrote:
Thanks for your reply. Not sure if SAS developers would agree with your assessment. They've invested much effort into supporting subqueries (for example: https://go.documentation.sas.com/?docsetId=casfedsql&docsetTarget=n07d1sue0iwb3xn1dgye2wae60z7.htm&d...).
All that aside, if we go back to my simple code example, consider two subquery approaches:
1. proc sql;
select CustomerID into :ID_list separated by "," from MySubsetOfIDs;
select * from MyBigDataset where CustomerID in (&ID_List);
quit;
2. proc sql;
select * from MyBigDataset where CustomerID in(select CustomerID from MySubsetOfIDs);
quit;
My question reduces to this: Why should the code in #1 run much faster than the code in #2? (which is indeed the case).
My guess is that the basic subquery is executed for each record of MYBigDataset. Since a sub query might include additional subqueries I think that sort of makes sense.
Try placing the values of custormerid from Mysubsetofids into a single macro variable and use that instead of the subquery. I suspect you'll see a significant different in time as well since the comparison will be to a fixed list of items.
Yes, thanks, I do know about the macro variable approach ( see msg #7 in this thread. In that messsage, I asked why that approach runs so much faster than a sub-query). And I'm pretty certain that a subquery is not executed every time a record is read from the large dataset. That would be terribly inefficient and would be radically different from every other SQL implementation I've worked with.
Thanks.
Your link is for FedSql doc. Have you tested the situation there?
Here are a number of scenarios so you can compare the performances.
284 data MyBigDataset (index = (CustomerID)) NoIndex; 285 do CustomerID = 1 to 50000000; 286 output; 287 end; 288 run; NOTE: The data set WORK.MYBIGDATASET has 50000000 observations and 1 variables. INFO: Multiple concurrent threads will be used to create the index. NOTE: Simple index CustomerID has been defined. NOTE: The data set WORK.NOINDEX has 50000000 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 13.86 seconds cpu time 24.27 seconds 289 290 proc sort data=noindex out=sorted; 291 by customerid ; 292 run; NOTE: There were 50000000 observations read from the data set WORK.NOINDEX. NOTE: SAS threaded sort was used. NOTE: The data set WORK.SORTED has 50000000 observations and 1 variables. NOTE: PROCEDURE SORT used (Total process time): real time 5.69 seconds cpu time 16.54 seconds 293 294 data MySubsetOfIDs; 295 CustomerID = 42; 296 run; NOTE: The data set WORK.MYSUBSETOFIDS has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 297 298 proc sql; 299 create table test0 as 300 select * 301 from MyBigDataset a 302 where a.CustomerID in (42) 303 ; INFO: Index CustomerID selected for WHERE clause optimization. NOTE: Table WORK.TEST0 created, with 1 rows and 1 columns. 304 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.17 seconds cpu time 0.00 seconds 305 306 proc sql; 307 create table test1 as 308 select * 309 from MyBigDataset a 310 where a.CustomerID in (select CustomerID from MySubsetOfIDs) 311 ; NOTE: Table WORK.TEST1 created, with 1 rows and 1 columns. 312 quit; NOTE: PROCEDURE SQL used (Total process time): real time 6.26 seconds cpu time 6.08 seconds 313 314 proc sql; 315 create table test2 as 316 select a.* 317 from MyBigDataset a 318 inner join MySubsetOfIDs b 319 on a.customerid = b.customerid 320 ; INFO: Index CustomerID of SQL table WORK.MYBIGDATASET (alias = A) selected for SQL WHERE clause (join) optimization. NOTE: Table WORK.TEST2 created, with 1 rows and 1 columns. 321 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.23 seconds cpu time 0.00 seconds 322 323 proc sql; 324 create table test0 as 325 select * 326 from NoIndex a 327 where a.CustomerID in (42) 328 ; NOTE: Table WORK.TEST0 created, with 1 rows and 1 columns. 329 quit; NOTE: PROCEDURE SQL used (Total process time): real time 1.10 seconds cpu time 1.05 seconds 330 331 proc sql; 332 create table test4 as 333 select * 334 from NoIndex a 335 where a.CustomerID in (select CustomerID from MySubsetOfIDs) 336 ; NOTE: Table WORK.TEST4 created, with 1 rows and 1 columns. 337 quit; NOTE: PROCEDURE SQL used (Total process time): real time 6.23 seconds cpu time 6.03 seconds 338 339 proc sql; 340 create table test5 as 341 select a.* 342 from NoIndex a 343 inner join MySubsetOfIDs b 344 on a.customerid = b.customerid 345 ; NOTE: Table WORK.TEST5 created, with 1 rows and 1 columns. 346 quit; NOTE: PROCEDURE SQL used (Total process time): real time 3.51 seconds cpu time 3.33 seconds 347 348 349 proc sql; 350 create table test6 as 351 select * 352 from Sorted a 353 where a.CustomerID in (42) 354 ; NOTE: Table WORK.TEST6 created, with 1 rows and 1 columns. 355 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.18 seconds cpu time 0.00 seconds 356 357 proc sql; 358 create table test7 as 359 select * 360 from Sorted a 361 where a.CustomerID in (select CustomerID from MySubsetOfIDs) 362 ; NOTE: Table WORK.TEST7 created, with 1 rows and 1 columns. 363 quit; NOTE: PROCEDURE SQL used (Total process time): real time 6.21 seconds cpu time 6.10 seconds 364 365 proc sql; 366 create table test8 as 367 select a.* 368 from Sorted a 369 inner join MySubsetOfIDs b 370 on a.customerid = b.customerid 371 ; NOTE: Table WORK.TEST8 created, with 1 rows and 1 columns. 372 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.13 seconds cpu time 0.00 seconds 373 data test9; 374 merge sorted MySubsetOfIDs(in=in1); 375 by customerid; 376 if in1; 377 run; NOTE: There were 50000000 observations read from the data set WORK.SORTED. NOTE: There were 1 observations read from the data set WORK.MYSUBSETOFIDS. NOTE: The data set WORK.TEST9 has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 4.30 seconds cpu time 4.30 seconds 378 data test10; 379 merge MYBIGDATASET MySubsetOfIDs(in=in1); 380 by customerid; 381 if in1; 382 run; INFO: Index CustomerID selected for BY clause processing. NOTE: There were 50000000 observations read from the data set WORK.MYBIGDATASET. NOTE: There were 1 observations read from the data set WORK.MYSUBSETOFIDS. NOTE: The data set WORK.TEST10 has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 17.69 seconds cpu time 17.66 seconds 383 data test11; 384 set MySubsetOfIDs ; 385 set MYBIGDATASET key=customerid; 386 if not _iorc_ then output; 387 _error_=0; 388 run; NOTE: There were 1 observations read from the data set WORK.MYSUBSETOFIDS. NOTE: The data set WORK.TEST11 has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.23 seconds cpu time 0.01 seconds 389 data test12; 390 set MYBIGDATASET ; 391 where customerid in (42); INFO: Index CustomerID selected for WHERE clause optimization. 392 run; NOTE: There were 1 observations read from the data set WORK.MYBIGDATASET. WHERE customerid=42; NOTE: The data set WORK.TEST12 has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.09 seconds cpu time 0.00 seconds
Great post everyone. If I were to summarise the findings it would be that SQL compilers in different products don't necessarily perform equally. There will be some types of queries that will perform better in one product than another. This doesn't come as a surprise to me and most likely to the other posters as well. There are many occasions when I've had to modify my query approach to improve performance, regardless of the SQL language / product I was using.
@fsdfsd - I'd suggest you track your sub-query performance issue to SAS Tech Support and add their response to this post. That way we will all keep learning new stuff together which to me is the great benefit of SAS Community involvement!
I will submit this as a question to SAS's support team and post their response here. Thanks.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.