topic Re: Index data sets before Merge in SAS Programming

Index data sets before Merge

Ronein — Tue, 22 Jun 2021 20:09:27 GMT

Hello,

Is is better to index data sets before merging with Data step ?

Is is better to index data sets before merging with Proc SQL ?

Let's say that I want to do inner Join of SASHELP.prdsal12 with SASHELP.us_data.

The merge will be done by state field from SASHELP.prdsal12 with statename field from SASHELP.us_data.

How Can I know if data set SASHELP.prdsal12 is already indexed?

How Can I know if data set SASHELP.us_data is already indexed?

Re: Index data sets before Merge

Reeza — Tue, 22 Jun 2021 20:21:37 GMT

@Ronein wrote:

Hello,

Is is better to index data sets before merging with Data step ?

Is is better to index data sets before merging with Proc SQL ?

Let's say that I want to do inner Join of SASHELP.prdsal12 with SASHELP.us_data.

Better than what? Not doing an Index? Depends on data set size, for smaller data sets it's not worth creating an index and creating the index may take more time than the merge.

@Ronein wrote:

How Can I know if data set SASHELP.prdsal12 is already indexed?

How Can I know if data set SASHELP.us_data is already indexed?

Run a proc contents on the data set and examine the output. One of the displayed information is if any index exists. You can also query SASHELP.VTABLE to see if any indexes are present.

The CONTENTS Procedure

Data Set Name	SASHELP.PRDSAL2	Observations	23040
Member Type	DATA	Variables	11
Engine	V9	Indexes	0
Created	10/24/2018 20:14:37	Observation Length	120
Last Modified	10/24/2018 20:14:37	Deleted Observations	0
Protection	 	Compressed	NO
Data Set Type	 	Sorted	YES
Label	Furniture sales data	 	 
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64	 	 
Encoding	us-ascii ASCII (ANSI)

Or query the SASHELP.VTABLE or VINDEX data sets.

proc contents data=SASHELP.prdsal2;
run;

proc sql;
create table want as
select * from sashelp.vindex 
where libname='SASHELP' 
and memname = 'PRDSAL2';
quit;

Re: Index data sets before Merge

ballardw — Tue, 22 Jun 2021 20:26:55 GMT

"Better" would need some description on what is considered "better". Indexing takes resources. With large data sets it may take a fair amount of time to add the index(s). With small sets the performance difference of indexed vs non-indexed may be trivial.

You can tell if sets are indexed with Proc Contents, which reports number of indexes, or Dictionary.Tables "Type of Indexes", or Dictionary.Indexes which will have the name of the index and more information about the index.

Re: Index data sets before Merge

SASKiwi — Tue, 22 Jun 2021 20:34:04 GMT

SASHELP datasets are too small for indexes to be of any use. Also it is good policy not to tamper with the original SASHELP datasets so they work the same for all users.

Re: Index data sets before Merge

Reeza — Tue, 22 Jun 2021 20:36:15 GMT

I assumed you were referring to PRDSAL2 by the way, I don't have a dataset name PRDSAL12.

Re: Index data sets before Merge

tarheel13 — Tue, 22 Jun 2021 21:55:19 GMT

If you are doing a join with SQL, you don't need to sort by state first.

Re: Index data sets before Merge

Astounding — Wed, 23 Jun 2021 00:00:32 GMT

I'll address just one of the ignored points here. When would you index a data set before a DATA step MERGE? Absolutely never, unless you have no choice.

The MERGE will be horribly inefficient, if you use an index to read in the observations in order. Ian Whitlock once wrote a Coder's Corner paper along the lines of "Why Did This Code Take 24 Hours to Run?" I can give you technical details about why this is true, if you are interested. Just sort your data, then merge it.

Re: Index data sets before Merge

ChrisNZ — Wed, 23 Jun 2021 04:06:53 GMT

To shine a light from another view point than the valid ones already mentioned:

Production tables, as we assume they will be used over and over, should be optimised for their use. Typically, they should be sorted by the most frequent merge key(s), indexed by other useful merge key(s), indexed by frequent query key(s) (such as: give me all records matching condition A and B). The idea is that the cost of creating the table will be offset many times by repeatedly using these sort and index properties.

Additionally, production tables could have other desirable properties such as: suitable compression, buffer size, password protection, labelling.

To further answer your question, index relative efficiency depends on many factors such as observation length, cardinality, expected subset size (do you expect to extract 1% or 100% of the table?). So asking if having an index is better (than what?) is like asking how long a piece of string is: It depends.

Note that merging in a data step requires either a sorted source dataset, and indexed source data set or an SPDE source data set. SQL does not have these requirements, but only because it will be doing the sorting itself: The sorting still takes place.

Re: Index data sets before Merge

Kurt_Bremser — Wed, 23 Jun 2021 05:53:12 GMT

Indexes only help when you need to extract a small (10% or so) subset of a dataset. When more (or all) of the dataset is needed, indexes will slow down the merge to a point where sorting + merging will be faster than just merging with the index.

This is because reading the index file as such will cause additional I/O, and reading the dataset is done in random order, which is always slower than sequential.

Re: Index data sets before Merge

ChrisNZ — Wed, 23 Jun 2021 06:15:25 GMT

Just add the 10% figure is a rule-of-thumb only.

A solid state storage will not suffer from random access as much as disks, and disks also depend on the RAID level.

A narrow table will be less expensive to sort than a wide table, for a given number of observations.

And so on. Many factors to consider. As usual.

Re: Index data sets before Merge

Kurt_Bremser — Wed, 23 Jun 2021 06:37:36 GMT

What one must never forget: on a physical level, SAS does not read observations, but dataset pages.

So, if you consider a dataset with an observation size of 1000 and the typical page size of 128K, you have more than 100 obs in a page, and statistically that means that even a small subset of obs will have to read the whole physical dataset. And that's why indexes usually slow everything down, unless obs size approaches page size.