Efficiency Matters: Using Indexes in SAS

2 Likes

At some point, every SAS programmer working with large data asks the same question: “How can I make this run faster?” If you find yourself frequently filtering on certain variables, indexing may be the answer. Indexes give SAS a faster path to the data you need—no more scanning millions of rows to locate a handful of observations. In this post, I break down what indexes are, how to create them, and real examples of when they improve performance. Example code and data can be downloaded here.

What are Indexes?

An index is an optional file that can be created for a SAS data set that enhances the way SAS locates and retrieves data. Typically, SAS processes data in sequential order with “row by row” processing. In comparison, an index provides direct access to specific observations.

SAS indexes can be compared to an index in the back of a book.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In this book, if I want to know more about the ALL keyword, I can navigate to the back of the book and see that the ALL keyword appears on pages 2-57, 3-49-3-51, 3-62, 3-68, and 3-71. I can turn directly to each page to find the information. Without the index, I would have to flip through the entire book page by page, which would be much more time-consuming.

An index is a separate file, related to the data set. It has the same name as the data set with the file extension .sas7bndx.

Use Indexes If...

Indexes are best to use if:

you frequently use WHERE statements or BY group processing
the data set is large, but you only need a small subset of observations
you perform joins or merges on variables

Types of Indexes

There are two types of indexes: simple and composite.

A simple index consists of one variable. These are best to use when your data is typically accessed using a single variable.
A composite index consists of more than one variable. These are best to use when data is frequently accessed using multiple variables on the same WHERE expression. When creating a composite index, the first variable listed becomes the key variable. SAS can select the index if data is filtered on all variables in the index or only on the key variable.

How to Create Indexes

Indexes can be created using the DATA step, PROC SQL or PROC DATASETS. Indexes can be dropped only with PROC SQL and PROC DATASETS. It’s also important to note that indexes work best when the data is sorted by the indexed variable before indexing.

About the Data

The following examples use three main data sets: EFFI.ORDERS_TWOMILLION, EFFI.EMPLOYEE_MASTER and EFFI.EMPLOYEEPHONES.

EFFI.ORDERS_TWOMILLION contains data about customers and their orders from a fictitious retail company. The data set has two million observations and is not pre-sorted. Note the screenshot below only includes relevant variables.

EFFI.EMPLOYEE_MASTER contains information about employees at a fictitious company. There is one observation per employee ID. Note the screenshot below only includes relevant variables.

EFFI.EMPLOYEEPHONES contains employee home, work and cell phone numbers.

Sorting Data

Indexes work best when the data is first sorted. The PROC SORT step sorts the data and creates the output data set OUT.ORDERS_SORT.

proc sort data=effi.orders_twomillion out=out.orders_sort;
     by Continent Country State_Province Customer_Group
        Customer_Type Customer_ID Order_ID;
run;

Creating Indexes in the DATA Step

When creating indexes in the DATA step, the indexes are created on the output data set. It’s also possible to manipulate the data at the same time. In the example below, the data is filtered for orders where Country is equal to United States and two indexes are created on the output data set US_ORDERS: the simple index on State_Province and the composite index CustOrd composed of Customer_ID and Order_ID.

The log does not automatically display notes about index creation. The OPTIONS MSGLEVEL=i; statement adds those details.

options msglevel=i;
data us_orders (index=(State_Province) 
                CustOrd=(Customer_ID Order_ID)));
     set out.orders_sort;
     where Country=”United States”;
run;

Creating Indexes with PROC SQL

When creating indexes with PROC SQL, you can only create indexes on existing data. Because of this, I created a copy of the sorted data set named SQL_EX. In PROC SQL, CREATE INDEX statements are used to create indexes. Again, the simple index State_Province and composite index, CustOrd are created. Notice in PROC SQL, you must provide a name for each index. The log provides notes automatically about the indexes defined.

data sql_ex;
     set out.orders_sort;
run;

proc sql;
create index State_Province
     on sql_ex(State_Province);
create index CustOrd
     on sql_ex(Customer_ID, Order_ID);
quit;

Creating Indexes with PROC DATASETS

PROC DATASETS also requires that the data already exists. The first DATA step again creates a copy of the sorted data named DATASETS_EX. INDEX CREATE statements create the same simple and composite indexes as the other methods. If I already have a data set, PROC DATASETS is my preferred method because the code is less repetitive than PROC SQL. The log provides notes automatically about the indexes defined.

data datasets_ex;
     set out.orders_sort;
run;

proc datasets library=work nolist;
     modify datasets_ex;
     index create State_Province;
     index create CustOrd=(Customer_ID Order_ID);
quit;

When to Use Indexes

Now that the indexes exist, let’s see a few examples of how SAS can use them.

Example 1: Improving Efficiency in the WHERE Statement

Indexes can be used to improve efficiency in the WHERE statement.

In the first DATA step, the WHERE statement filters the original (non-indexed) data set:

data ex_1;
     set effi.orders_twomillion;
     where Customer_ID=24474;
run;

Sequential processing takes about 23 seconds in real time.

In comparison, the second DATA step filters the indexed version of the data. Remember, the CustOrd index is composed of Customer_ID and Order_ID. Customer_ID is the key variable in the index, therefore it is possible for SAS to use the index in this step.

options msglevel=i;
data ex_1;
     set datasets_ex;
     where Customer_ID=24474;
run;

The CustOrd composite index is selected for processing, and the step takes 0 seconds to run- a huge efficiency gain.

Example 2: Using an Index in PROC SQL with the WHERE Clause

Indexes can also be used in PROC SQL during a join or in the WHERE clause.

In the following DATA step, EMPLOYEE_MASTER is a copy of EFFI.EMPLOYEE_MASTER with an index on EmployeeID. This variable matches EmployeeID in the data set EFFI.EMPLOYEEPHONES.

data employee_master(index=(EmployeeID));
     set effi.employee_master (rename=(Employee_ID=EmployeeID));
run;

The following PROC SQL step performs an inner join that produces a report of a specific employee’s home, work and/or cell phone numbers. The index on EmployeeID is used for processing.

options msglevel=i;
proc sql;
select m.employeeID, m.employee_name, phoneType, phoneNumber
     from employee_master as m 
     inner join effi.employeephones as p
     on m.employeeID= p.employeeID
     where m.EmployeeID=120178;
quit;

Example 3: Using an Index on the BY Statement

Finally, indexes can be used for BY processing. Typically, when there is a BY statement in a procedure, the data set must be sorted by the BY variable(s) listed. However, if a data set is indexed, on the BY variable, SAS can avoid sorting. In the following DATA step, ORDERS_NOT_SORTED is a copy of EFFI.ORDERS_TWOMILLION with an index on State_Province.

data orders_not_sorted(index=(State_Province));
     set effi.orders_twomillion;
run;

When this data set is used in a MEANS procedure with the BY statement, the procedure runs and the index is selected even though the data set was not sorted by State_Province.

proc means data=orders_not_sorted mean;
     var quantity;
     by State_Province;  
run;

In the log, notice the procedure took about 30 seconds to run.

Even though this works, remember- indexes work best on sorted data. This is because the indexed values are in ascending order in the index, meaning when SAS looks them up, the observations are on consecutive data set pages. I ran the same means procedure on the sorted, indexed data set DATASETS_EX.

proc means data=datasets_ex mean;
     var quantity;
     by State_Province;  
run;

Notice in the log, the State_Province index was selected again, but the step only took about 3 seconds to run- a much more efficient run.

Conclusion

Indexes are a powerful tool for improving SAS performance, especially when working with large data. By creating simple or composite indexes, you allow SAS to locate observations efficiently without scanning the entire data set. Whether you are filtering with a WHERE statement, joining tables, or performing BY-group processing, indexes can reduce run time dramatically. With flexible creation methods in the DATA step, PROC SQL, or PROC DATASETS, indexes give you an easy way to optimize processing and make your programs run faster.

Find more articles from SAS Global Enablement and Learning here.