Re: Proc sql join performance with Index

set_all__ · Posted 04-12-2017 03:39 PM

I have two datasets - Big (100 million rows) and small (200,000 rows). The Big dataset is indexed by a unique key.

I have two queries in sequence doing a left join from the small to the big dataset which looks like this:

proc sql;
select *
from a left join b
on a.key=b.key;
quit;

The first time the query runs for 20-30 minutes, but the second time takes only 2-3 minutes. Why does this happen? Is it possible to do something to reduce the runtime for the first query as well?

Bal23 · Posted 04-12-2017 04:29 PM

The big dataset does need that much time to run.

Below is some information might help.

http://analytics.ncsu.edu/sesug/2011/BB08.Lafler.pdf

There are a number of common reasons for slow-running queries and updates:

Slow network communication.
Inadequate memory in the server computer, or not enough memory available .
Lack of useful statistics
Lack of useful indexes.
Lack of useful indexed views.
Lack of useful data striping.
Lack of useful partitioning.

SASKiwi · Posted 04-12-2017 04:33 PM

The most likely reason for the faster second join is that the data has been cached by your server hardware / OS. You could prove this by swapping the queries around, if possible, to see a similar effect.

A faster solution would be to use a SAS format built from the small table to do the data lookup. That way you avoid joining entirely.

jklaverstijn · Posted 04-12-2017 04:35 PM

Your "small" dataset was probably found in the filesystem cache the second time. A normal behaviour that speeds up access to frequently used files by keeping them in memory.

LinusH · Posted 04-12-2017 11:16 PM

Before jumping to any conclusions you need more information.
Add
Options msglevel =i fullstimer;
This will give more information about basic processing and systems constraints.

Proc SQL _method;
Will give you the SQL plan.

Data never sleeps

Proc sql join performance with Index