SAS(R) 9.3 Language Reference: Concepts, Second Edition

Doyleuk · Posted 01-17-2013 10:33 AM

Hi,

Im looking to understand why a merge statement is not utilising an index in a merge statement.

I have a dataset (3300 obs) which has been sorted by key field "Account_ID" and then merging to master file (89382152 obs) which has index on Account_ID already created.

Data Temp;

Merge Data1 (in=a)

Master (in=b);

by account_Id;

if a;

run;

The merge statement is taking rough 1 hour.

I then run an SQL statement :

Proc Sql;

Create Table Temp as

select *

from Data1 as a

left join

master as b

on a.account_id = b.account_id;

quit;

I have included the options msglevel=I at the beginning of my code and only the SQL statement usilises the pre defined index. Why doesnt merge utilise it?

Any help would be greatly appreciated in understanding why this occurs and potential resolution.

Regards

Daniel

LinusH · Posted 01-17-2013 11:09 AM

Since it's just doesn't, ever.

Is there an isue of using SQL instead?

Data never sleeps

Doyleuk · Posted 01-17-2013 01:22 PM

The server administrators are not keen on us using SQL as the utility files that are built when using SQL can use quite a lot of resources.

I need to discuss furhter with them and find out how much is uses when using datasets which have indices.

RobPatelsky · Posted 01-17-2013 02:02 PM

Create a small subset of the master dataset.

The set statement has an POINT= option to use direct (and not sequential) access against the Master dataset. See http://support.sas.com/documentation/cdl/en/lestmtsref/63323/HTML/default/viewer.htm#p00hxg3x8lwivcn...

This feature will use the index you constructed.

RobPatelsky · Posted 01-17-2013 11:53 AM

A BY statement does not use an index in these situations if the data file is physically stored in sorted order based on the variables specified in the BY statement.

This is in the SAS documentation.

Doyleuk · Posted 01-17-2013 01:18 PM

Can you supply the link as I've tried searching and cannot find what I am looking for.

Is there any options that can be forced upon it to use an index.

Astounding · Posted 01-17-2013 01:36 PM

Yes you can force MERGE to use the index, but why would you want to do that? Your program will just take forever to run that way, and will produce the same result. The default action is to check whether the data is sorted, and if it is then rely on the sorted order. If it's not, then use the index. Why would you want to change that behavior? If you want to test, try:

data temp;

merge Data1 (in=a)

Master (in=b sortedby=some_other_variable_in_master);

by account_id;

if a;

run;

Retrieving an entire data set using an index will take a long time ... Ian Whitlock once wrote a related paper titled something like "Why Did This Code Take 24 Hours to Run?"

Good luck.

RobPatelsky · Posted 01-17-2013 01:37 PM

Documentation link is http://support.sas.com/documentation/cdl/en/lrcon/65287/HTML/default/viewer.htm#n06cy7dznbx6gen1q9ma...

This is documented in

SAS(R) 9.3 Language Reference: Concepts, Second Edition

section is:

Understanding SAS Indexes

towards the bottom of this section is

Using an Index for BY Processing

Tom · Posted 01-17-2013 03:18 PM

Use the KEY= option to lookup in the master database.

data want;

set data1 ;

set master key=account_id;

if _error_ then do;

* no match found, might need to use call missing to clear variables retreived from MASTER ;

end;

* Reset _ERROR_ to prevent notes in SAS log ;

_error_=0;

run;

RobPatelsky · Posted 01-17-2013 03:27 PM

I stand corrected by Tom - key is the correct SET option to use.

Doyleuk · Posted 01-24-2013 09:55 AM

Many Thanks to all those who replied.

Tom using the key option. This looks to be what I am after. However I notice using this it will only pull out the a single observation from the master table. Is there a what to pull out multiple obs from this dataset.

If there is an account_id with 5 obs in the master, Can I pull out all 5 obs and match that to the one observation from the data1 dataset.

Cheers

LinusH · Posted 01-24-2013 10:32 AM

Well let's sort this out a bit. Indexes are mainly used for to purposes:

Subseting (including join optimization)
Sorting

Generally, sorting is NOT recommended for large tables, as stated by Astounding. It will indeed take very long time.

So, most merging/joining needs to in some way have the input tables sorted in some way.

SQL handles this by sorting chunks of the input data, and sometimes it the SQL planners can do this smart enough the whole tables doesn't need sorting. And in some cases indexes can help optimize choosing which chunks should be read.

So, you are likely to have te data sorted some time during the process, and SQL have a spill file for this. Can't see taht this is worse than a temporary sorted table in saswork, or a other SAS files swapped to saswork. You could minimize this spill file by increasing the MEMSIZE and SORTSIZE to utilize as much as possible of RAM (and minimizing the need for spilling/swapping data to disk).

About KEY=. It's intended use is to do a single record look-up. Maybe you can tweak this by some odd data step programming, but I shouldn't recommend it.

Data never sleeps

Tom · Posted 01-24-2013 10:45 AM

Just put it in a DO loop.

data want;

set data1 ;

do until (_error_) ;

set master key=account_id;

if not _error_ then output;

else do;

* no match found, might need to use call missing to clear variables retreived from MASTER ;

end;

* Reset _ERROR_ to prevent notes in SAS log ;

_error_=0;

run;

Astounding · Posted 01-24-2013 12:39 PM

Daniel,

It sounds like your original MERGE step gets you the right answer, but takes a while to run. You are hoping that using an index would speed things up.

If that's an accurate picture of what you are trying to do, there are some possibilities that don't involve an index. Answers to these questions would help:

Does Data1 ever contain multiple records for the same account_id?

Is every account_id in Data1 guaranteed to appear in Master?

For some approaches: If account_id is character, what is its length?

Is Master both sorted by account_id and has an index on account_id?

Does this sort of merge occur many times for the same version of Master?

Depending on your answers, significant savings may be possible.

DaveBirch · Posted 01-24-2013 09:17 PM

Daniel,

As stated by Astounding, you can force a merge statement to use an index. However, the merge statement always attempts to read every observation in each dataset. As a rule of thumb, this takes about 5 times as long as a simple sequential read. Consider the following:

data temp;

merge Data1 (in=A)

Master (in=B sortedby=_null_);

by account_id;

if not A;

run;

Clearly, the value of account_id in Data1 is not used to determine which record(s) to read in Master. It's just a sequential match/merge.

So, if you want to use the value of account_id in Data1 to determine which record(s) to read in Master, then Tom's answer is the correct solution. Or, using SQL - you might try:

options fullstimer;

proc sql;

create table temp as

select select A.*, B.* /* better to explicitly list desired variables */

from Data1 as A

inner join Master(sortedby=_null_) as B

on A.account_id eq B.account_id;

quit;

to test whether the server administrators concerns about resources are valid. (I doubt it.)

Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

SAS(R) 9.3 Language Reference: Concepts, Second Edition

Understanding SAS Indexes

Using an Index for BY Processing

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Re: Merge Versus SQL - Utilising Index

Catch up on SAS Innovate 2026

SAS(R) 9.3 Language Reference: Concepts, Second Edition

Understanding SAS Indexes

Using an Index for BY Processing

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away