The Quack is Back: SAS/ACCESS Meets DuckDB

15 Likes

Recently I learned that SAS is developing a new SAS/ACCESS engine for something called DuckDB. I wondered what’s so cool about it as SAS already provides all sorts of access engines and they all pretty much do the same thing: enable smooth access to external data sources. I started to dig deeper and heard that the two adjectives often used to describe DuckDB are performance and simplicity. Other commonly heard arguments are that it’s lightweight enough to run locally and easy to configure. DuckDB runs an In-Process Analytics Engine and thus requires no server to be installed. This means that you can use it directly inside SAS, Python, R, or even Jupyter notebooks without doing any additional configurations.

Why does it perform so well compared to many other DB engines? DuckDB uses something called vectorized execution on columnar storage. This makes it extremely fast for analytical queries, also on very large data. DuckDB can, for example, query Parquet, CSV, JSON, and Arrow files directly without producing additional copies of the data. The great thing about DuckDB is that you don’t first have to load data into a database as the engine can query the data where it lives.

Vectorized execution sounds complex, but it means that data is processed in chunks (vectors or batches) instead of row by row like traditional database engines do. Row-based engines (like PostgreSQL or SQLite) process data one row at a time — which is fine for small data, but cumbersome on very large datasets. A vectorized engine like DuckDB processes many values at once (e.g., 1,024 values per vector), using optimized memory access and processes data with a specific execution paradigm called SIMD (Single Instruction, Multiple Data). It’s super-efficient on modern multicore CPUs, utilizes cache efficiently, and achieves parallelism by performing the same operation on multiple data points concurrently, significantly speeding up the processing.

For the data analyst, DuckDB supports a rich SQL dialect with joins, common table expressions, window functions, aggregations, filtering, grouping, and regular expressions. This makes DuckDB ideal for data developers who prefer SQL over procedural code.

DuckDB is rapidly evolving with extensions being developed to enhance its integration capabilities even further. Of particular interest are the extensions to integrate with cloud-based object storages like S3 and ADLS and also the extension to support reading of Iceberg tables from local or cloud-backed catalogs. The Iceberg extension works with Parquet-backed Iceberg tables and utilizes DuckDB’s high-performance Parquet reader. This extension is further described in the Iceberg Extension pages of DuckDB documentation.

Now that’s enough about theory, let’s have a look at how SAS/ACCESS to DuckDB works. If you’re familiar with connecting SAS to external data, you know it’s done with a SAS libname statement. SAS has one on the roadmap and it will be called the SASIODUK libname engine.

Here’s an example of the usage for connecting DuckDB engine to a folder of Parquet files:

libname duklib sasioduk  
		file_type=parquet
		file_path="<your path to parquet files>";

After running it, SAS will report:

NOTE: Libref DUKLIB was successfully assigned as follows: 
      Engine:        SASIODUK 
      Physical Name: DuckDB

SAS procedures work with SASIODUK just like they do with SAS datasets, for example PROC CONTENTS:

proc contents data=duklib.'nocars.parquet'n;
quit;

Resulting in expected output:

But what’s impressive about DuckDB is the performance. Running a SQL query with multiple calculations, filters, aggregation and sort on a 27GB SAS dataset:

PROC SQL;
  SELECT
    passenger_count, 
    payment_type,
    count(*)           AS num_trips,
    avg(trip_distance) AS avg_distance,
    avg(fare_amount)   AS avg_fare,
    avg(tip_amount)    AS avg_tip
  FROM 
    saslib.yellow_tripdata_2011 
  WHERE
    passenger_count is not NULL AND
    passenger_count > 0 AND
    passenger_count < 5 AND
    trip_distance < 100 AND
    trip_distance > 0
  GROUP BY
    passenger_count, payment_type
  ORDER BY 
    payment_type, passenger_count;
QUIT;

On a SAS dataset, this query takes about 2 minutes to run:

NOTE: PROCEDURE SQL used (Total process time):
      real time           1:56.27
      cpu time            1:22.98

Next, run the same SQL query using DuckDB engine on a Parquet table:

PROC SQL;
  SELECT
    passenger_count, 
    payment_type,
    count(*)           AS num_trips,
    avg(trip_distance) AS avg_distance,
    avg(fare_amount)   AS avg_fare,
    avg(tip_amount)    AS avg_tip
  FROM 
    duklib.'2011/*.parquet'n
  WHERE
    passenger_count is not NULL AND
    passenger_count > 0 AND
    passenger_count < 5 AND
    trip_distance < 100 AND
    trip_distance > 0
  GROUP BY
    passenger_count, payment_type
  ORDER BY 
    payment_type, passenger_count;
QUIT;

Looking at the execution time for this code, it’s easy to spot the difference:

NOTE: PROCEDURE SQL used (Total process time):
      real time           2.54 seconds
      cpu time            26.24 seconds

The same query using DuckDB libname engine on same data, but stored as Parquet file, that takes about 2,7GB, runs in lightning fast 2,5 seconds! The graph below illustrates the execution time comparison.

I’ve been working with data for decades and still witnessing an almost 50x performance improvement with this new technology is almost beyond belief! If you’re a bit of a sceptic like me and cling on to your SAS datasets, I hear you... but still recommend having a look at this. And not just because of the stellar performance that DuckDB can bring but also the kick-ass compression that Parquet can offer thus saving you money in storage costs.

Now that we have proved that DuckDB can do awesome things with locally stored Parquet files, you probably wonder what to do with all the Parquet files in your S3 or ADLS. No problem, DuckDB’s got you covered with the S3/AWS extension and the ADLS/Azure extension. Using ADLS as an example we need to install the appropriate Azure and HTTPFS extensions in our extensions directory. The commands below can be wrapped inside our PROC SQL EXECUTE statement and referencing the previously created SASIODUK libname reference:

libname duklib sasioduk;

PROC SQL;
	connect using duklib;

SET extension_directory = '/tmp/.extensions';
	INSTALL httpfs;
	LOAD httpfs;
	INSTALL azure;
	LOAD azure;

Then adding an Azure Managed Identity for Azure authentication:

CREATE OR REPLACE SECRET secret2 (
    	TYPE         azure,
           	PROVIDER     credential_chain,
           	CHAIN        managed_identity,
     	ACCOUNT_NAME '<your account name here>'

To prove it works, run a simple SQL select against a Parquet file in ADLS. Note that ABFSS in the URI stands for Azure blob file system secure.

SELECT * FROM connection TO duklib (
SELECT * FROM 'abfss://<your directory here>/*.parquet'
LIMIT 10
);

Resulting in log output (and of course the actual 10 rows from the Parquet file):

NOTE: PROCEDURE SQL used (Total process time):
      real time           3.91 seconds
      cpu time            2.08 seconds

I’m super impressed with the flexibility that DuckDB offers through its extensions and you can find the full list of core extensions here: DuckDB Core Extensions. I like the easy adding of extensions so much that I’ll add it to my ‘Why DuckDB’ list: 1) Performance, 2) Simplicity, and 3) Extendability.

In case I got you interested in utilizing DuckDB with SAS, you’ll be positively surprised with the recent SAS Viya roadmap developments. Until this release DuckDB users had to rely on JDBC to get SAS working with DuckDB. Mighty Duck is about to get some more air beneath its wings, since SAS/ACCESS to DuckDB aka SASIODUK is planned for 2025.07 stable release of SAS Viya (with the usual caveats about software release schedules).

AhmedAl_Attar · ‎06-19-2025

@jarno

Thank you for this informative article, I would be interested in reading about testing Community DuckDB extensions, similar to what you have done with selective Core Extensions.

I wonder, which version of DuckDB this SASIODUK engine is based on? I hope it's 1.3.0+, How will it keep up with the DuckDB future releases?

JuanS_OCS · ‎06-19-2025

Great article @jarno , thanks!

Any experiences in SAS 9.4? And with multi-database support (https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html) queried from SAS?

jarno · ‎06-26-2025

Thanks for your interest @AhmedAl_Attar and @JuanS_OCS ! I'll try to answer all questions so far:

Which version of DuckDB this SASIODUK engine is based on?
- Current development and testing is being done with 1.3.0

Support for multi-database queried from SAS?
- SAS has not tested the multi-database support, though we would expect it to work. You would need to use explicit SQL to set it up as per the link in your question.

Any experiences in SAS 9.4?
- Currently there are no plans to support SAS 9.4

Tom · ‎06-26-2025

Aren't you missing an EXECUTE BY DUKLIB ( ... ) wrapper around the DUCKDB syntax in your last PROC SQL step? I do not remember PROC SQL having support for a LOAD or INSTALL or REPLACE statements.

jarno · ‎06-27-2025

You're absolutely correct @Tom , it needs the EXECUTE statement in the form of:

PROC SQL;
connect using duklib;
EXECUTE ( ...

I did try to mention that in my explanation above:

The commands below can be wrapped inside our PROC SQL EXECUTE statement and referencing the previously created SASIODUK libname reference:

abhi80 · ‎08-03-2025

@jarno Do we still hit the limitation of only supporting 32767 characters using this engine ?

The Quack is Back: SAS/ACCESS Meets DuckDB

The 2025 SAS Hackathon has begun!

SAS AI and Machine Learning Courses