DATA Step, Macro, Functions and more

Proc sql join performance with Index

Reply
Occasional Contributor
Posts: 8

Proc sql join performance with Index

I have two datasets - Big (100 million rows) and small (200,000 rows). The Big dataset is indexed by a unique key.

 

I have two queries in sequence doing a left join from the small to the big dataset which looks like this:

 

 

proc sql;
select *
from a left join b
on a.key=b.key;
quit;

The first time the query runs for 20-30 minutes, but the second time takes only 2-3 minutes. Why does this happen? Is it possible to do something to reduce the runtime for the first query as well?

Super Contributor
Posts: 345

Re: Proc sql join performance with Index

Posted in reply to set_all__

The big dataset does need that much time to run.

 

Below is some information might help.

 

http://analytics.ncsu.edu/sesug/2011/BB08.Lafler.pdf

 

 

There are a number of common reasons for slow-running queries and updates:

  • Slow network communication.

  • Inadequate memory in the server computer, or not enough memory available .

  • Lack of useful statistics

  • Lack of useful indexes.

  • Lack of useful indexed views.

  • Lack of useful data striping.

  • Lack of useful partitioning.

Super User
Posts: 3,257

Re: Proc sql join performance with Index

Posted in reply to set_all__

The most likely reason for the faster second join is that the data has been cached by your server hardware / OS. You could prove this by swapping the queries around, if possible, to see a similar effect.

 

A faster solution would be to use a SAS format built from the small table to do the data lookup. That way you avoid joining entirely. 

Super Contributor
Posts: 441

Re: Proc sql join performance with Index

Posted in reply to set_all__

Your "small" dataset was probably found in the filesystem cache the second time. A normal behaviour that speeds up access to frequently used files by keeping them in memory.

Super User
Posts: 5,434

Re: Proc sql join performance with Index

Posted in reply to set_all__
Before jumping to any conclusions you need more information.
Add
Options msglevel =i fullstimer;
This will give more information about basic processing and systems constraints.

Proc SQL _method;
Will give you the SQL plan.
Data never sleeps
Ask a Question
Discussion stats
  • 4 replies
  • 467 views
  • 0 likes
  • 5 in conversation