About EmmettSmith24

EmmettSmith24

Introduction: Today we are going to be examining SAS’ newest ACCESS Engine which uses DuckDB to provide high-performance on SQL Queries on parquet, JSON, Iceberg and other modern file types. These files can be directly stored in a Viya environment, or they can be stored in blob storage (i.e. AWS S3, Azure Blob, Google Cloud Storage). This means that all the processes are executed in SAS runtime, and they don’t have to push out to external OLAP systems. SAS users now have the flexibility to use open file formats and run analytics using PROC SQL and data step logic on modern data lakes without leaving the Viya environment. To test these capabilities, a workload was created around synthetic Medicare data which is described more below. Data Background and Size: The CMS DE-SynPUF (Data Entrepreneurs' Synthetic Public Use File) is a synthetic dataset derived from de-identified Medicare claims data. It contains data for approximately 2.33 million synthetic patients and is structured to closely resemble real Medicare claims while ensuring patient privacy. This dataset has been transformed into the OMOP Common Data Model (CDM) by the Observation Health Data Sciences and Informatics (OHDSI) community. Typically, observational data is represented differently across organizations; to combat this, OMOP CDM provides a data standardization method that formats data into 37 tables with 394 fields. This makes it a valuable resource for: Testing and validating OHDSI tools Developing and benchmarking patient-level prediction models Training machine learning models on observational health data Demonstrating cohort definitions, treatment pathways, and outcome analyses It is widely used as a public testbed for observational research and analytics pipelines, especially in environments where access to real patient data is restricted. For the test queries only 22 of the 37 tables were used; across the 22 tables there are over 1.6 billion observations. This data takes approximately 90GB of storage when kept in sas7bdats and approximately 26GB when kept in parquet files, which is a 3.46x size reduction! This is a huge improvement that could save millions of dollars in storage costs, with ever increasing cloud costs. Testing Parameters To test these new capabilities, a workload was created using 4 queries on the OMOP data. The queries are: CO05 (Analysis 1): Health conditions by gender and age of patients with nausea as a symptom CE03 (Analysis 2): Top 10 comorbidities for patients with diabetes CE07 (Analysis 3): Condition duration stratified by age and gender Cohort Generation for Atrial Fibrillation (AFib) with Ischemic Heart Disease The third and fourth queries include Common Table Expressions (CTEs), while the fourth also includes windowing functions; both functions were not originally possible in PROC SQL, but are now possible with the DuckDB ACCESS Engine. To view the original queries, visit the Query Library from OHDSI. The first three queries were compared side by side – for runtime, I/O operations, and compute utilization – with the two sets of operations: Using the base SAS engine with PROC SQL to query sas7bdats and output into new sas7bdat files Using DuckDB ACCESS Engine with PROC SQL to query parquet files and outputting into native DuckDB files The fourth query represents a new capability of windowing functions that are now possible with explicit pass through to the DuckDB database and therefore wasn’t included in the analysis. These tests were all run in a SAS Viya environment with 125 GB of RAM across 8 cores/16 vCPU. These tables summarize the results taken from Enterprise Session Monitor. Runtime comparison Compute comparison I/O comparison Examining all these tables together, it is clear that the DuckDB ACCESS Engine, is much faster due to two factors: parallelization and more efficient reading of data sets. DuckDB automatically parallelizes to use all the compute resources that are available to it which allows for it to use over 1,000% CPU in SAS studio splitting the queries across multiple cores. The second reason of more efficient reading of data can be seen by Analysis 1 and 3 where only 6.5% and 2.6% of the source data is read. This is because DuckDB takes advantage of parquet files’ metadata and uses projection and filter pushdown. In short, projection pushdown allows for a query to select only the columns that are being queried, allowing for columns to be completely skipped over; filter pushdown takes advantage of parquet metadata (which provide min and max values of each row group) to skip over full row groups. To learn more about efficient querying with DuckDB look at this blog post. It should be noted that during the second run, there is no data read which is due to caching of the query results that can be accessed at a moment’s notice. These results are also stored in very small native DuckDB files and therefore are very few write actions happening and being recorded by ESM. After running these tests, there are a few key takeaways around SAS’ newfound performance and abilities due to DuckDB: SAS/ACCESS to DuckDB enables advanced SQL: including CTEs and window functions not supported in SAS V9. Performance is competitive: SAS/ACCESS to DuckDB shows faster real-time execution in most cases, especially when reading from Parquet. Modern data formats: SAS/ACCESS to DuckDB supports Parquet natively, avoiding the need for conversion to sas7bdat. Less Data Storage Costs: Parquet allows for ~75% size reduction in data size which can also be stored in blob storage at reduced costs. Less I/O: Using Parquet and Native DuckDB files takes advantage of more efficient querying and writing to smaller format sizes. Seamless Parallelization: DuckDB automatically queries data in parallel, using more compute power In summary, SAS/ACCESS to DuckDB represents a significant leap forward in performance, flexibility, and efficiency for SAS users. By enabling advanced SQL capabilities, supporting modern data formats like Parquet, and leveraging automatic parallelization, DuckDB empowers users to work with larger datasets more efficiently and cost-effectively. These enhancements not only streamline data workflows but also open the door to more scalable and modern analytics within the SAS ecosystem. References: - Github Repository containing scripts and results for the case study. - Data from AWS S3 Bucket - Article from OHDSI about OMOP CDM

Online Status	Offline
Date Last Visited	Thursday

Speeding Up Workloads with Ducks(?!): Case Study of New DuckDB ACCESS ...

Speeding Up Workloads with Ducks(?!): Case Study of New DuckDB ACCESS ...

Speeding Up Workloads with Ducks(?!): Case Study of New DuckDB ACCESS ...

SAS Viya Copilot Private Preview