Index data sets before Merge

Ronein · Posted 06-22-2021 04:09 PM

Hello,

Is is better to index data sets before merging with Data step ?

Is is better to index data sets before merging with Proc SQL ?

Let's say that I want to do inner Join of SASHELP.prdsal12 with SASHELP.us_data.

The merge will be done by state field from SASHELP.prdsal12 with statename field from SASHELP.us_data.

How Can I know if data set SASHELP.prdsal12 is already indexed?

How Can I know if data set SASHELP.us_data is already indexed?

Reeza · Posted 06-22-2021 04:21 PM

@Ronein wrote:

Hello,

Is is better to index data sets before merging with Data step ?

Is is better to index data sets before merging with Proc SQL ?

Let's say that I want to do inner Join of SASHELP.prdsal12 with SASHELP.us_data.

Better than what? Not doing an Index? Depends on data set size, for smaller data sets it's not worth creating an index and creating the index may take more time than the merge.

@Ronein wrote:

How Can I know if data set SASHELP.prdsal12 is already indexed?

How Can I know if data set SASHELP.us_data is already indexed?

Run a proc contents on the data set and examine the output. One of the displayed information is if any index exists. You can also query SASHELP.VTABLE to see if any indexes are present.

The CONTENTS Procedure

Data Set Name	SASHELP.PRDSAL2	Observations	23040
Member Type	DATA	Variables	11
Engine	V9	Indexes	0
Created	10/24/2018 20:14:37	Observation Length	120
Last Modified	10/24/2018 20:14:37	Deleted Observations	0
Protection	 	Compressed	NO
Data Set Type	 	Sorted	YES
Label	Furniture sales data	 	 
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64	 	 
Encoding	us-ascii ASCII (ANSI)

Or query the SASHELP.VTABLE or VINDEX data sets.

proc contents data=SASHELP.prdsal2;
run;

proc sql;
create table want as
select * from sashelp.vindex 
where libname='SASHELP' 
and memname = 'PRDSAL2';
quit;

ballardw · Posted 06-22-2021 04:26 PM

"Better" would need some description on what is considered "better". Indexing takes resources. With large data sets it may take a fair amount of time to add the index(s). With small sets the performance difference of indexed vs non-indexed may be trivial.

You can tell if sets are indexed with Proc Contents, which reports number of indexes, or Dictionary.Tables "Type of Indexes", or Dictionary.Indexes which will have the name of the index and more information about the index.

SASKiwi · Posted 06-22-2021 04:34 PM

SASHELP datasets are too small for indexes to be of any use. Also it is good policy not to tamper with the original SASHELP datasets so they work the same for all users.

Reeza · Posted 06-22-2021 04:36 PM

I assumed you were referring to PRDSAL2 by the way, I don't have a dataset name PRDSAL12.

tarheel13 · Posted 06-22-2021 05:55 PM

If you are doing a join with SQL, you don't need to sort by state first.

Astounding · Posted 06-22-2021 08:00 PM

I'll address just one of the ignored points here. When would you index a data set before a DATA step MERGE? Absolutely never, unless you have no choice.

The MERGE will be horribly inefficient, if you use an index to read in the observations in order. Ian Whitlock once wrote a Coder's Corner paper along the lines of "Why Did This Code Take 24 Hours to Run?" I can give you technical details about why this is true, if you are interested. Just sort your data, then merge it.

ChrisNZ · Posted 06-23-2021 12:06 AM

To shine a light from another view point than the valid ones already mentioned:

Production tables, as we assume they will be used over and over, should be optimised for their use. Typically, they should be sorted by the most frequent merge key(s), indexed by other useful merge key(s), indexed by frequent query key(s) (such as: give me all records matching condition A and B). The idea is that the cost of creating the table will be offset many times by repeatedly using these sort and index properties.

Additionally, production tables could have other desirable properties such as: suitable compression, buffer size, password protection, labelling.

To further answer your question, index relative efficiency depends on many factors such as observation length, cardinality, expected subset size (do you expect to extract 1% or 100% of the table?). So asking if having an index is better (than what?) is like asking how long a piece of string is: It depends.

Note that merging in a data step requires either a sorted source dataset, and indexed source data set or an SPDE source data set. SQL does not have these requirements, but only because it will be doing the sorting itself: The sorting still takes place.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 06-23-2021 01:53 AM

Indexes only help when you need to extract a small (10% or so) subset of a dataset. When more (or all) of the dataset is needed, indexes will slow down the merge to a point where sorting + merging will be faster than just merging with the index.

This is because reading the index file as such will cause additional I/O, and reading the dataset is done in random order, which is always slower than sequential.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 06-23-2021 02:15 AM

Just add the 10% figure is a rule-of-thumb only.

A solid state storage will not suffer from random access as much as disks, and disks also depend on the RAID level.

A narrow table will be less expensive to sort than a wide table, for a given number of observations.

And so on. Many factors to consider. As usual.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 06-23-2021 02:37 AM

What one must never forget: on a physical level, SAS does not read observations, but dataset pages.

So, if you consider a dataset with an observation size of 1000 and the typical page size of 128K, you have more than 100 obs in a page, and statistically that means that even a small subset of obs will have to read the whole physical dataset. And that's why indexes usually slow everything down, unless obs size approaches page size.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Re: Index data sets before Merge

Registration is open

SAS Training: Just a Click Away