04-12-2017 03:39 PM
I have two datasets - Big (100 million rows) and small (200,000 rows). The Big dataset is indexed by a unique key.
I have two queries in sequence doing a left join from the small to the big dataset which looks like this:
proc sql; select * from a left join b on a.key=b.key; quit;
The first time the query runs for 20-30 minutes, but the second time takes only 2-3 minutes. Why does this happen? Is it possible to do something to reduce the runtime for the first query as well?
04-12-2017 04:29 PM
The big dataset does need that much time to run.
Below is some information might help.
There are a number of common reasons for slow-running queries and updates:
Slow network communication.
Inadequate memory in the server computer, or not enough memory available .
Lack of useful statistics
Lack of useful indexes.
Lack of useful indexed views.
Lack of useful data striping.
Lack of useful partitioning.
04-12-2017 04:33 PM
The most likely reason for the faster second join is that the data has been cached by your server hardware / OS. You could prove this by swapping the queries around, if possible, to see a similar effect.
A faster solution would be to use a SAS format built from the small table to do the data lookup. That way you avoid joining entirely.
04-12-2017 04:35 PM
Your "small" dataset was probably found in the filesystem cache the second time. A normal behaviour that speeds up access to frequently used files by keeping them in memory.
04-12-2017 11:16 PM