Today we are going to be examining SAS’ newest ACCESS Engine which uses DuckDB to provide high-performance on SQL Queries on parquet, JSON, Iceberg and other modern file types. These files can be directly stored in a Viya environment, or they can be stored in blob storage (i.e. AWS S3, Azure Blob, Google Cloud Storage). This means that all the processes are executed in SAS runtime, and they don’t have to push out to external OLAP systems. SAS users now have the flexibility to use open file formats and run analytics using PROC SQL and data step logic on modern data lakes without leaving the Viya environment. To test these capabilities, a workload was created around synthetic Medicare data which is described more below.
The CMS DE-SynPUF (Data Entrepreneurs' Synthetic Public Use File) is a synthetic dataset derived from de-identified Medicare claims data. It contains data for approximately 2.33 million synthetic patients and is structured to closely resemble real Medicare claims while ensuring patient privacy.
This dataset has been transformed into the OMOP Common Data Model (CDM) by the Observation Health Data Sciences and Informatics (OHDSI) community. Typically, observational data is represented differently across organizations; to combat this, OMOP CDM provides a data standardization method that formats data into 37 tables with 394 fields. This makes it a valuable resource for:
It is widely used as a public testbed for observational research and analytics pipelines, especially in environments where access to real patient data is restricted.
For the test queries only 22 of the 37 tables were used; across the 22 tables there are over 1.6 billion observations. This data takes approximately 90GB of storage when kept in sas7bdats and approximately 26GB when kept in parquet files, which is a 3.46x size reduction! This is a huge improvement that could save millions of dollars in storage costs, with ever increasing cloud costs.
To test these new capabilities, a workload was created using 4 queries on the OMOP data. The queries are:
The third and fourth queries include Common Table Expressions (CTEs), while the fourth also includes windowing functions; both functions were not originally possible in PROC SQL, but are now possible with the DuckDB ACCESS Engine. To view the original queries, visit the Query Library from OHDSI.
The first three queries were compared side by side – for runtime, I/O operations, and compute utilization – with the two sets of operations:
The fourth query represents a new capability of windowing functions that are now possible with explicit pass through to the DuckDB database and therefore wasn’t included in the analysis. These tests were all run in a SAS Viya environment with 125 GB of RAM across 8 cores/16 vCPU. These tables summarize the results taken from Enterprise Session Monitor.
Examining all these tables together, it is clear that the DuckDB ACCESS Engine, is much faster due to two factors: parallelization and more efficient reading of data sets. DuckDB automatically parallelizes to use all the compute resources that are available to it which allows for it to use over 1,000% CPU in SAS studio splitting the queries across multiple cores. The second reason of more efficient reading of data can be seen by Analysis 1 and 3 where only 6.5% and 2.6% of the source data is read. This is because DuckDB takes advantage of parquet files’ metadata and uses projection and filter pushdown. In short, projection pushdown allows for a query to select only the columns that are being queried, allowing for columns to be completely skipped over; filter pushdown takes advantage of parquet metadata (which provide min and max values of each row group) to skip over full row groups. To learn more about efficient querying with DuckDB look at this blog post. It should be noted that during the second run, there is no data read which is due to caching of the query results that can be accessed at a moment’s notice. These results are also stored in very small native DuckDB files and therefore are very few write actions happening and being recorded by ESM.
After running these tests, there are a few key takeaways around SAS’ newfound performance and abilities due to DuckDB:
In summary, SAS/ACCESS to DuckDB represents a significant leap forward in performance, flexibility, and efficiency for SAS users. By enabling advanced SQL capabilities, supporting modern data formats like Parquet, and leveraging automatic parallelization, DuckDB empowers users to work with larger datasets more efficiently and cost-effectively. These enhancements not only streamline data workflows but also open the door to more scalable and modern analytics within the SAS ecosystem.
- Github Repository containing scripts and results for the case study.
- Data from AWS S3 Bucket
- Article from OHDSI about OMOP CDM
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.