Hello,
Is is better to index data sets before merging with Data step ?
Is is better to index data sets before merging with Proc SQL ?
Let's say that I want to do inner Join of SASHELP.prdsal12 with SASHELP.us_data.
The merge will be done by state field from SASHELP.prdsal12 with statename field from SASHELP.us_data.
How Can I know if data set SASHELP.prdsal12 is already indexed?
How Can I know if data set SASHELP.us_data is already indexed?
@Ronein wrote:
Hello,
Is is better to index data sets before merging with Data step ?
Is is better to index data sets before merging with Proc SQL ?
Let's say that I want to do inner Join of SASHELP.prdsal12 with SASHELP.us_data.
Better than what? Not doing an Index? Depends on data set size, for smaller data sets it's not worth creating an index and creating the index may take more time than the merge.
@Ronein wrote:
How Can I know if data set SASHELP.prdsal12 is already indexed?
How Can I know if data set SASHELP.us_data is already indexed?
Run a proc contents on the data set and examine the output. One of the displayed information is if any index exists. You can also query SASHELP.VTABLE to see if any indexes are present.
The CONTENTS Procedure Data Set Name SASHELP.PRDSAL2 Observations 23040 Member Type DATA Variables 11 Engine V9 Indexes 0 Created 10/24/2018 20:14:37 Observation Length 120 Last Modified 10/24/2018 20:14:37 Deleted Observations 0 Protection Compressed NO Data Set Type Sorted YES Label Furniture sales data Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 Encoding us-ascii ASCII (ANSI)
Or query the SASHELP.VTABLE or VINDEX data sets.
proc contents data=SASHELP.prdsal2;
run;
proc sql;
create table want as
select * from sashelp.vindex
where libname='SASHELP'
and memname = 'PRDSAL2';
quit;
"Better" would need some description on what is considered "better". Indexing takes resources. With large data sets it may take a fair amount of time to add the index(s). With small sets the performance difference of indexed vs non-indexed may be trivial.
You can tell if sets are indexed with Proc Contents, which reports number of indexes, or Dictionary.Tables "Type of Indexes", or Dictionary.Indexes which will have the name of the index and more information about the index.
SASHELP datasets are too small for indexes to be of any use. Also it is good policy not to tamper with the original SASHELP datasets so they work the same for all users.
If you are doing a join with SQL, you don't need to sort by state first.
I'll address just one of the ignored points here. When would you index a data set before a DATA step MERGE? Absolutely never, unless you have no choice.
The MERGE will be horribly inefficient, if you use an index to read in the observations in order. Ian Whitlock once wrote a Coder's Corner paper along the lines of "Why Did This Code Take 24 Hours to Run?" I can give you technical details about why this is true, if you are interested. Just sort your data, then merge it.
To shine a light from another view point than the valid ones already mentioned:
Production tables, as we assume they will be used over and over, should be optimised for their use. Typically, they should be sorted by the most frequent merge key(s), indexed by other useful merge key(s), indexed by frequent query key(s) (such as: give me all records matching condition A and B). The idea is that the cost of creating the table will be offset many times by repeatedly using these sort and index properties.
Additionally, production tables could have other desirable properties such as: suitable compression, buffer size, password protection, labelling.
To further answer your question, index relative efficiency depends on many factors such as observation length, cardinality, expected subset size (do you expect to extract 1% or 100% of the table?). So asking if having an index is better (than what?) is like asking how long a piece of string is: It depends.
Note that merging in a data step requires either a sorted source dataset, and indexed source data set or an SPDE source data set. SQL does not have these requirements, but only because it will be doing the sorting itself: The sorting still takes place.
Indexes only help when you need to extract a small (10% or so) subset of a dataset. When more (or all) of the dataset is needed, indexes will slow down the merge to a point where sorting + merging will be faster than just merging with the index.
This is because reading the index file as such will cause additional I/O, and reading the dataset is done in random order, which is always slower than sequential.
Just add the 10% figure is a rule-of-thumb only.
A solid state storage will not suffer from random access as much as disks, and disks also depend on the RAID level.
A narrow table will be less expensive to sort than a wide table, for a given number of observations.
And so on. Many factors to consider. As usual.
What one must never forget: on a physical level, SAS does not read observations, but dataset pages.
So, if you consider a dataset with an observation size of 1000 and the typical page size of 128K, you have more than 100 obs in a page, and statistically that means that even a small subset of obs will have to read the whole physical dataset. And that's why indexes usually slow everything down, unless obs size approaches page size.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.