The SAS Communities Blog Efficiency Matters: Using Indexes in SAS provides an excellent review of what indexes are, how to create them, and how they can improve SAS query performance.
Its summary captures the essence well:
“Indexes are a powerful tool for improving SAS performance, especially when working with large data. By creating simple or composite indexes, you allow SAS to locate observations efficiently without scanning the entire data set. Whether you are filtering with a WHERE statement, joining tables, or performing BY-
group processing, indexes can reduce run time dramatically.”
Since many SAS users query large datasets in databases such as Microsoft SQL Server via SAS/ACCESS interfaces, I wanted to validate whether similar performance gains could be expected when indexes exist on database tables. Using scenarios similar to those in the SAS Communities blog, I found that querying SQL Server tables via SAS/ACCESS was significantly faster when indexes were in place:
This blog outlines the validation steps taken to arrive at these findings.
_______________________________________________________________________________________________________________________
Test Specifications
To test the three scenarios, I created duplicate tables in Microsoft SQL Server—one set with indexes and one without. I downloaded data from the Efficiency Matters blog on GitHub: link
For testing, I chose non-clustered indexes because they can be created on single or multiple columns, don’t require sorting, and allow multiple indexes per table. By contrast, a clustered index forces rows to be stored in the order of the index key, limiting each table to one clustered index.
Creating Indexes via SQL Passthrough
proc sql; connect using MSSQLSVR /*previously defined LIBNAME*/ as SQLDB; EXECUTE(CREATE NONCLUSTERED INDEX SPINDEX ON orders_10m_indx (State_Province)) BY SQLDB; EXECUTE(CREATE NONCLUSTERED INDEX CUSTINDEX ON orders_10m_indx (Customer_ID)) BY SQLDB; EXECUTE(CREATE NONCLUSTERED INDEX EMPINDEX ON orders_10m_indx (Employee_ID)) BY SQLDB; EXECUTE(CREATE NONCLUSTERED INDEX EMPINDEXO ON emp_master_indx (Employee_ID)) BY SQLDB; disconnect fromSQLDB;
quit;
To check or validate that indexes exist
proc sql;
connect using MSSQLSVR /*previously defined LIBNAME*/ as SQLDB;
create table indexed_fields as select * from connection to SQLDB
(select t.name as table_name, i.name as index_name, c.name as column_name,
i.type_desc as index_type from sys.tables t join sys.indexes i on
t.object_id=i.object_id join sys.index_columns ic on i.object_id=
ic.object_id and i.index_id=ic.index_id join sys.columns c on
ic.object_id=c.object_id and ic.column_id=c.column_id
where t.name in ('emp_master_indx','orders_10m_indx'));
disconnect from SQLDB;
quit;
Example output:
If you don’t have permission to create indexes, work with your SQL Server administrator.
_______________________________________________________________________________________________________________________
Environment Specifications
_______________________________________________________________________________________________________________________
Example 1: Effect of Index on WHERE Clause
data ex_1;
set MSSQLSVR.orders_10M;
where Customer_ID=10860;
run;
Results:
_______________________________________________________________________________________________________________________
Example 2: Joining Tables with WHERE Clause
proc sql;
create table ex10 as
select distinct(m.Employee_Name) as Name,
count(p.Order_ID) as orders
from mssqlsvr.emp_master as m
inner join mssqlsvr.orders_10M as p
on m.Employee_ID=p.Employee_ID
where m.Employee_ID=120178;
quit;
Results:
______________________________________________________________________________________________________________________
Example 3: BY-Group Processing
proc means data=mssqlsvr.orders_10M mean;
var quantity;
by State_Province;
run;
Results:
Why no Improvement? Ordering by a non-clustered index isn’t always faster because queries often need extra columns not in the index, forcing expensive lookups. If the index covers the query, performance improves.
To test this, I created a new covering index that included both variables in the Proc Means example (State_Province & Quantity).
proc sql;
connect using MSSQLSVR /*previously defined LIBNAME*/ as SQLDB;
EXECUTE(CREATE NONCLUSTERED INDEX COMBO_INDEX
ON orders_10m_indx (State_Province, Quantity)) BY SQLDB;
disconnect from SQLDB;
quit;
Results after rerunning Proc Means:
_______________________________________________________________________________________________________________________
Summary
Indexes in Microsoft SQL Server deliver substantial performance improvements when accessed through SAS/ACCESS, mirroring the benefits seen in native SAS datasets.
For SAS users working with large database tables, collaborating with database administrators to ensure proper indexing can dramatically reduce query times and improve overall efficiency.
_______________________________________________________________________________________________________________________
Resources
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.