BookmarkSubscribeRSS Feed
Ronein
Onyx | Level 15

Hello,

Is is better to index data sets before merging with Data step ? 

Is is better to index data sets before merging with Proc SQL ? 

Let's say that I want to do inner Join of SASHELP.prdsal12  with SASHELP.us_data.

The merge will be done by  state field from   SASHELP.prdsal12  with statename field from SASHELP.us_data.

How Can I know if data set SASHELP.prdsal12  is already indexed?

How Can I know if data set SASHELP.us_data  is already indexed?

 

10 REPLIES 10
Reeza
Super User

@Ronein wrote:

Hello,

Is is better to index data sets before merging with Data step ? 

Is is better to index data sets before merging with Proc SQL ? 

Let's say that I want to do inner Join of SASHELP.prdsal12  with SASHELP.us_data.


Better than what? Not doing an Index? Depends on data set size, for smaller data sets it's not worth creating an index and creating the index may take more time than the merge. 

 


@Ronein wrote:

 

How Can I know if data set SASHELP.prdsal12  is already indexed?

How Can I know if data set SASHELP.us_data  is already indexed?


Run a proc contents on the data set and examine the output. One of the displayed information is if any index exists. You can also query SASHELP.VTABLE to see if any indexes are present.

 

The CONTENTS Procedure

Data Set Name	SASHELP.PRDSAL2	Observations	23040
Member Type	DATA	Variables	11
Engine	V9	Indexes	0
Created	10/24/2018 20:14:37	Observation Length	120
Last Modified	10/24/2018 20:14:37	Deleted Observations	0
Protection	 	Compressed	NO
Data Set Type	 	Sorted	YES
Label	Furniture sales data	 	 
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64	 	 
Encoding	us-ascii ASCII (ANSI)	 	 

Or query the SASHELP.VTABLE or VINDEX data sets.

proc contents data=SASHELP.prdsal2;
run;

proc sql;
create table want as
select * from sashelp.vindex 
where libname='SASHELP' 
and memname = 'PRDSAL2';
quit;
ballardw
Super User

"Better" would need some description on what is considered "better". Indexing takes resources. With large data sets it may take a fair amount of time to add the index(s). With small sets the performance difference of indexed vs non-indexed may be trivial.

 

You can tell if sets are indexed with Proc Contents, which reports number of indexes, or Dictionary.Tables "Type of Indexes", or Dictionary.Indexes which will have the name of the index and more information about the index.

SASKiwi
PROC Star

SASHELP datasets are too small for indexes to be of any use. Also it is good policy not to tamper with the original SASHELP datasets so they work the same for all users.

Reeza
Super User
I assumed you were referring to PRDSAL2 by the way, I don't have a dataset name PRDSAL12.
tarheel13
Rhodochrosite | Level 12

If you are doing a join with SQL, you don't need to sort by state first.

Astounding
PROC Star

I'll address just one of the ignored points here.  When would you index a data set before a DATA step MERGE?  Absolutely never, unless you have no choice.

 

The MERGE will be horribly inefficient, if you use an index to read in the observations in order.  Ian Whitlock once wrote a Coder's Corner paper along the lines of "Why Did This Code Take 24 Hours to Run?"  I can give you technical details about why this is true, if you are interested.  Just sort your data, then merge it.

ChrisNZ
Tourmaline | Level 20

To shine a light from another view point than the valid ones already mentioned:

Production tables, as we assume they will be used over and over, should be optimised for their use. Typically, they should be sorted by the most frequent merge key(s), indexed by other useful merge key(s), indexed by frequent query key(s) (such as: give me all records matching condition A and B). The idea is that the cost of creating the table will be offset many times by repeatedly using these sort and index properties. 

Additionally, production tables could have other desirable properties such as: suitable compression, buffer size, password protection, labelling.

 

To further answer your question, index relative efficiency depends on many factors such as observation length, cardinality, expected subset size (do you expect to extract 1% or 100% of the table?). So asking if having an index is better (than what?) is like asking how long a piece of string is: It depends.

 

Note that merging in a data step requires either a sorted source dataset, and indexed source data set or an SPDE source data set. SQL does not have these requirements, but only because it will be doing the sorting itself: The sorting still takes place.

Kurt_Bremser
Super User

Indexes only help when you need to extract a small (10% or so) subset of a dataset. When more (or all) of the dataset is needed, indexes will slow down the merge to a point where sorting + merging will be faster than just merging with the index.

This is because reading the index file as such will cause additional I/O, and reading the dataset is done in random order, which is always slower than sequential.

ChrisNZ
Tourmaline | Level 20

Just add the 10% figure is a rule-of-thumb only.

A solid state storage will not suffer from random access as much as disks, and disks also depend on the RAID level.

A narrow table will be less expensive to sort than a wide table, for a given number of observations.

And so on. Many factors to consider. As usual.

Kurt_Bremser
Super User

What one must never forget: on a physical level, SAS does not read observations, but dataset pages.

So, if you consider a dataset with an observation size of 1000 and the typical page size of 128K, you have more than 100 obs in a page, and statistically that means that even a small subset of obs will have to read the whole physical dataset. And that's why indexes usually slow everything down, unless obs size approaches page size.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2500 views
  • 2 likes
  • 8 in conversation