SAS + DuckDB Series: Querying Open File and Table Formats

2 Likes

We’ve already explained what DuckDB is, how DuckDB is integrated into SAS Viya, and how to use its proprietary format in-memory or on-disk.

Now it’s time to explore one of the capabilities that has generated so much excitement around DuckDB in the data engineering space: the ability to transparently and easily query open file and table formats (Parquet, Delta Lake, Iceberg, and others) without formally importing the data first.

DuckDB Way to Access Open File and Table Formats

DuckDB makes accessing external files extremely simple. It provides a rich library of native functions to read files directly, such as the read_parquet, read_json, delta_scan, and iceberg_scan functions:

SELECT TXT_1."VendorID", TXT_1."tpep_pickup_datetime", ... , TXT_1."congestion_surcharge", TXT_1."Airport_fee"
FROM read_parquet('/data/yellow_taxi/yellow_tripdata_2024-11.parquet') TXT_1
WHERE TXT_1."passenger_count" >= 5

In some situations, you don’t even need to explicitly call a function because DuckDB can automatically infer the format from the file extension.

CREATE OR REPLACE TABLE nov24 AS FROM 'yt_2411.parquet'

That covers reading data.

For writing, DuckDB uses the COPY TO directive together with either the file extension or the FORMAT keyword to generate the desired output format:

COPY (SELECT TXT_1."trip_id", TXT_1."pickup_datetime", ... , TXT_1."review_comment" FROM "reviews" TXT_1)
TO '/data/json/reviews.json' (FORMAT JSON)

This simple and powerful syntax greatly accelerates interoperability across the wide variety of file formats commonly used today.

And this works not only with files stored locally on disk, but also with files stored in cloud object storage systems, as we will see later.

How Is It Surfaced in SAS and the DuckDB Engine?

DuckDB’s syntax already makes it easy to access a variety of open file formats. SAS Viya simplifies the experience even further.

No need to remember which function to use for an Avro or Iceberg dataset, and no need to worry about the correct options to generate a Parquet or JSON file.

To make these open file and table formats easy to work with in SAS, the DuckDB library engine provides two additional options:

FILE_TYPE specifies which format you want to work with
FILE_PATH specifies where to find or create the files

libname myduck duckdb file_type=parquet file_path="/gelcontent/data/yellow_taxi" ;

You can then use this SAS library just like any other SAS library. In this case, every input or output table transparently maps to a Parquet dataset. Reading and writing Parquet files therefore becomes seamless, and your users may not even realize they are interacting with Parquet data underneath.

If you inspect the DuckDB SQL generated behind the scenes, you will recognize the functions and special DuckDB syntax discussed earlier. It’s also a great way to learn how DuckDB operates internally.

And if multiple formats coexist in the same directory, no problem: you can simply define multiple libraries pointing to the same location with different FILE_TYPE options.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Types of Files

The FILE_TYPE option currently supports the following values:

AVRO, CSV, DELTA, EXCEL, ICEBERG, JSON, and PARQUET.

Globbing Syntax

Parquet datasets are often partitioned, either using a structured Hive-style layout or a more unstructured directory organization. In practice, you may want to query an entire tree of Parquet files — assuming they share the same schema — as if they represented a single table.

DuckDB supports this using its globbing syntax:

SELECT COUNT(*) FROM read_parquet('/data/yellow_taxi/year2024/**/*.parquet')

** enables recursive directory traversal, while * performs single-level wildcard matching.

With SAS and the DuckDB library engine, a simple option enables DuckDB’s globbing capabilities: DIRECTORIES_AS_DATA (a library or dataset option valid only when reading Parquet files).

For example, suppose we want to read the following Hive-partitioned Parquet dataset:

Running this SAS code:

libname duck duckdb file_path="/data/yellow_taxi" 
                    file_type=parquet ;
proc sql ;
   select count(*)
   from duck.yt_part(directories_as_data=yes) ;
quit ;

triggers the following DuckDB SQL:

SELECT COUNT(*) FROM read_parquet('/data/yellow_taxi/yt_part/**/*.parquet')

The really interesting part is that you can even leverage DuckDB globbing syntax directly from SAS by using SAS name literals (currently supported for Parquet, JSON, and CSV files):

libname duck duckdb file_path="/data/csv" 
                    file_type=csv ;
proc sql ;
   select count(*) from duck."table1/*"n ;
quit ;
data sas_table2 ;
   set duck."table2/**/*"n ;
run ;

Access Files on Cloud Object Storage

Today, data is increasingly stored in cloud environments, often within object storage systems. DuckDB can also access open file and table formats directly from cloud object storage platforms such as AWS S3, Microsoft ADLS, and Google Cloud Storage.

In SAS, this capability is surfaced through additional options in the DuckDB library definition, as shown below for Microsoft ADLS:

libname az_ib duckdb file_path="az://data/iceberg"
                     file_type=iceberg
                     azure_tenant_id="&tenant_id"
                     azure_client_id="&client_id"
                     azure_client_secret="&client_secret"
                     azure_accountName="ytsa" ;

This makes it possible to work directly with modern cloud-native datasets while still benefiting from familiar SAS programming workflows.

Learn More

These capabilities are covered in depth in the course Working with DuckDB in SAS Viya which explores DuckDB usage, performance, open file formats, and advanced integration scenarios.

Thanks for reading.

Find more articles from SAS Global Enablement and Learning here.