Performant Aggregations on Parquet using DuckDB

1 Like

Storage and compute costs remain persistent concerns for enterprises. This motivates adoption of open formats like Parquet, which reduce data footprint and enable efficient retrieval, alongside query engines like DuckDB for high performance with minimal data movement.

Rather than perceiving DuckDB as a conventional RDBMS, we should view it more as a broad-based query processing engine that works with many file formats. The SQL dialect used to interact with DuckDB therefore contains many unique features that distinguish it from other SQL dialects.

In addition to a dedicated engine and features to read and write parquet files, SAS also offers a SAS/Access Interface to DuckDB to access DuckDB-supported data sources from a SAS program. Parquet is one of many formats that DuckDB supports natively, enabling zero-copy reads without importing data first.

Many SAS programs employ proc sql to query datasets. Proc SQL, as the name implies, is a procedure to process SQL queries that serves two purposes. It can process queries in a SAS compute environment, and it can also "pass through" queries to run in-database, a situation that's more desirable since this means processing takes place closer to where the data is stored.

A common choice is between implicit and explicit passthrough. Implicit passthroughs use Proc SQL syntax on datasets referred by a libname, automatically sending optimised SQL to the database when possible. However, if SAS-specific functions are used or if equivalent functions or keywords are not recognised by the database, this triggers fetching data into SAS for processing, incurring data movement and processing costs.

Herein lies a challenge faced by SAS users who want to run SQL in DuckDB :- to truly realise the benefits of DuckDB on Parquet, they need to rely more on explicit passthrough methods, which may contain different syntactic patterns and require knowledge of the DuckDB SQL dialect. We shouldn't assume that all users possess this knowledge uniformly.

To bridge this gap, a new SAS Studio custom step enables users to run DuckDB aggregation queries on Parquet without writing SQL. SAS Studio custom steps are lightweight UI wrappers for SAS programs that are executed in SAS Studio, either standalone or part of a flow.

Access the custom step here (an updated repo as part of SAS's open source contributions might also be provided in future)
Access the SAS program called by this custom step here

What happens behind the scenes

To keep things simple, we focus on aggregations and offer 12 single-argument aggregate functions that are widely used in analytics. With a little modification, this custom step can be tweaked to process other DuckDB aggregation functions too, a planned activity which you are welcome to beat me to.

A quick walkthrough of this custom step is available in the video above. The custom step uses SAS/ACCESS to DuckDB, passing parameters via macro variables while DuckDB handles the optimised Parquet scan. In addition to the convenience and interoperability that the custom step provides, the main advantages that make this an attractive proposition can be attributed to both Parquet's inherent storage advantages and DuckDB's optimised query engine.

On the Parquet side

Parquet is a columnar file format. What this means, in simple terms, is that data for each column (obviously forming part of a set of observations) is accompanied by some useful statistics located in the column's metadata, available in the footer section of a parquet file. These statistics are pre-calculated (at the time of file creation) over chunks of observations called row groups and various statistics (such as min, max, null counts) are available in zone maps within these row groups. The key advantages when we perform an aggregation function are:

Only the columns relevant to the aggregation (i.e. the column getting aggregated and any associated columns such as group by columns and WHERE columns) need to be read into memory. This is referred to as column pruning.
Also, only row groups falling under zone maps which are relevant to the query need to be read. For example, if the column contains a WHERE clause specifying that a variable X should be < 10, only those row groups falling under the zone maps which satisfy this condition are read into memory for processing. This is referred to as zone map skipping.

On the DuckDB side

We already mentioned that DuckDB reads from Parquet and other data sources without needing to make a data copy. In addition to this, DuckDB performs predicate pushdowns. Predicate pushdown is a technique used by query processing engines to apply any relevant filters imposed by the query at the earliest possible stage. The benefit is that less data needs to be read in order to process the query.

Taking up that simple earlier example of a `WHERE X < 10` filter again, in the absence of predicate pushdown, DuckDB would have read all data and then applied the WHERE clause, incurring a cost in terms of I/O and CPU. With predicate pushdown, DuckDB automatically takes advantage of the zone maps offered by Parquet's metadata, and can identify which row groups to consider (and which to skip) before loading the data into its engine for processing. Once candidate row groups are identified, column pruning ensures that only those columns relevant for the query are read, further reducing the data to be processed. You thus notice the symbiotic relationship between DuckDB and Parquet. Parquet supports columnar reads and offers metadata (zone maps) which are exploited by DuckDB to process the query efficiently.

Another remarkable feature of DuckDB is its parallelism. While not fundamentally unique to the DuckDB engine, what stands out about DuckDB's use of parallelisation is its ability to extract performance even with a lightweight footprint, running as a single embedded engine and not requiring large-scale distributed computing infrastructure. DuckDB can parallelise its operations across different row groups and chunks of data, and even across multiple source files. For example, users are able to run DuckDB queries across a set of parquet files using a wildcard naming pattern as shown in this example.

SELECT sum(total_cost) 
   FROM 
   read_parquet("/path/to/*.parquet");

In Summary

Modernised Artificial Intelligence (AI) platforms have roles to offer for a variety of file formats, engines, and processing paradigms. As organisations intensify cost focus amid growing data volumes, tools like this custom step deliver DuckDB/Parquet efficiency within the SAS ecosystem.

References

Apache Parquet format, https://github.com/apache/parquet-format
DuckDB, https://duckdb.org/
SAS/Access Interface to DuckDB
Why DuckDB, https://duckdb.org/why_duckdb
The SAS SQL Procedure, proc sql
GitHub repository for DuckDB Parquet aggregations, https://github.com/SundareshSankaran/duckdb-parquet-aggregations