BookmarkSubscribeRSS Feed
Doyleuk
Calcite | Level 5

Hi,

Im looking to understand why a merge statement is not utilising an index in a merge statement.

I have a dataset (3300 obs) which has been sorted by key field "Account_ID" and then merging to master file (89382152 obs) which has index on Account_ID already created.

Data Temp;

Merge Data1 (in=a)

          Master (in=b);

     by account_Id;

if a;

run;

The merge statement is taking rough 1 hour.

I then run an SQL statement :

Proc Sql;

Create Table Temp as

select *

from Data1 as a

          left join

        master as b

on a.account_id = b.account_id;

quit;

I have included the options msglevel=I at the beginning of my code and only the SQL statement usilises the pre defined index.  Why doesnt merge utilise it?

Any help would be greatly appreciated in understanding why this occurs and potential resolution.

Regards

Daniel

15 REPLIES 15
LinusH
Tourmaline | Level 20

Since it's just doesn't, ever.

Is there an isue of using SQL instead?

Data never sleeps
Doyleuk
Calcite | Level 5


The server administrators are not keen on us using SQL as the utility files that are built when using SQL can use quite a lot of resources.

I need to discuss furhter with them and find out how much is uses when using datasets which have indices.

RobPatelsky
Calcite | Level 5

Create a small subset of the master dataset. 

The set statement has an POINT= option to use direct (and not sequential) access against the Master dataset. See http://support.sas.com/documentation/cdl/en/lestmtsref/63323/HTML/default/viewer.htm#p00hxg3x8lwivcn...

This feature will use the index you constructed.

RobPatelsky
Calcite | Level 5

A  BY statement does not use an index in these situations if the data file is physically stored in sorted order based on the variables specified in the BY statement.

This is in the SAS documentation.

Doyleuk
Calcite | Level 5

Can you supply the link as I've tried searching and cannot find what I am looking for.

Is there any options that can be forced upon it to use an index.

Astounding
PROC Star

Yes you can force MERGE to use the index, but why would you want to do that?  Your program will just take forever to run that way, and will produce the same result.  The default action is to check whether the data is sorted, and if it is then rely on the sorted order.  If it's not, then use the index.  Why would you want to change that behavior?  If you want to test, try:

data temp;

   merge Data1 (in=a)

             Master (in=b sortedby=some_other_variable_in_master);

   by account_id;

   if a;

run;

Retrieving an entire data set using an index will take a long time ... Ian Whitlock once wrote a related paper titled something like "Why Did This Code Take 24 Hours to Run?"

Good luck.

RobPatelsky
Calcite | Level 5

Documentation link is http://support.sas.com/documentation/cdl/en/lrcon/65287/HTML/default/viewer.htm#n06cy7dznbx6gen1q9ma...

This is documented in

SAS(R) 9.3 Language Reference: Concepts, Second Edition

section is:

Understanding SAS Indexes

towards the bottom of this section is

Using an Index for BY Processing

Tom
Super User Tom
Super User

Use the KEY= option to lookup in the master database.

data want;

  set data1 ;

  set master key=account_id;

  if _error_ then do;

     * no match found, might need to use call missing to clear variables retreived from MASTER ;

  end;

   * Reset _ERROR_ to prevent notes in SAS log ;

  _error_=0;

run;

RobPatelsky
Calcite | Level 5

I stand corrected by Tom - key is the correct SET option to use. 

Doyleuk
Calcite | Level 5

Many Thanks to all those who replied.

Tom using the key option. This looks to be what I am after. However I notice using this it will only pull out the a single observation from the master table.  Is there a what to pull out multiple obs from this dataset.

If there is an account_id with 5 obs in the master, Can I pull out all 5 obs and match that to the one observation from the data1 dataset.

Cheers

LinusH
Tourmaline | Level 20

Well let's sort this out a bit. Indexes are mainly used for to purposes:

  • Subseting (including join optimization)
  • Sorting

Generally, sorting is NOT recommended for large tables, as stated by Astounding. It will indeed take very long time.

So, most merging/joining needs to in some way have the input tables sorted in some way.

SQL handles this by sorting chunks of the input data, and sometimes it the SQL planners can do this smart enough the whole tables doesn't need sorting. And in some cases indexes can help optimize choosing which chunks should be read.

So, you are likely to have te data sorted some time during the process, and SQL have a spill file for this. Can't see taht this is worse than a temporary sorted table in saswork, or a other SAS files swapped to saswork. You could minimize this spill file by increasing the MEMSIZE and SORTSIZE to utilize as much as possible of RAM (and minimizing the need for spilling/swapping data to disk).

About KEY=. It's intended use is to do a single record look-up. Maybe you can tweak this by some odd data step programming, but I shouldn't recommend it.

Data never sleeps
Tom
Super User Tom
Super User

Just put it in a DO loop.

data want;

  set data1 ;

  do until (_error_) ;

    set master key=account_id;

    if not _error_ then output;

    else do;

     * no match found, might need to use call missing to clear variables retreived from MASTER ;

    end;

  end;

   * Reset _ERROR_ to prevent notes in SAS log ;

  _error_=0;

run;

Astounding
PROC Star

Daniel,

It sounds like your original MERGE step gets you the right answer, but takes a while to run.  You are hoping that using an index would speed things up.

If that's an accurate picture of what you are trying to do, there are some possibilities that don't involve an index.  Answers to these questions would help:

Does Data1 ever contain multiple records for the same account_id?

Is every account_id in Data1 guaranteed to appear in Master?

For some approaches:  If account_id is character, what is its length?

Is Master both sorted by account_id and has an index on account_id?

Does this sort of merge occur many times for the same version of Master?

Depending on your answers, significant savings may be possible.

DaveBirch
Obsidian | Level 7

Daniel,

As stated by Astounding, you can force a merge statement to use an index.  However, the merge statement always attempts to read every observation in each dataset.  As a rule of thumb, this takes about 5 times as long as a simple sequential read.  Consider the following:

data temp;

   merge Data1 (in=A)

             Master (in=B sortedby=_null_);

   by account_id;

   if not A;

run;

Clearly, the value of account_id in Data1 is not used to determine which record(s) to read in Master.  It's just a sequential match/merge.

So, if you want to use the value of account_id in Data1 to determine which record(s) to read in Master, then Tom's answer is the correct solution.  Or, using SQL - you might try:

options fullstimer;

proc sql;

  create table temp as

  select select A.*, B.*  /* better to explicitly list desired variables */

  from Data1 as A

  inner join Master(sortedby=_null_) as B

    on A.account_id eq B.account_id;

quit;

to test whether the server administrators concerns about resources are valid.  (I doubt it.)

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 2686 views
  • 0 likes
  • 6 in conversation