Keep Calm and Duck On - Your Lakehouse Just Got Smarter

6 Likes

Open file formats introduced some years ago provide many useful advantages, such as lightning-fast compute performance and stellar compression. However, on top of operating on flat files, data practitioners have seen the need to add database-like capabilities. First came schema evolution, shortly followed by ACID-compliance and eventually snapshot management. These features improve manageability, such as modifiable table structures, guaranteed transactions and data history.

Various new open table standards emerged to support changing data in the lake. Iceberg and Delta Lake are most widely used as both of these table formats were designed to handle open file formats stored on blob storage. For example, Iceberg uses JSON and Avro files to store table metadata. This, of course, creates additional overhead of managing these metadata files.

The combination of open file formats and open table formats led to the concept of a lakehouse. A data lakehouse can be defined as a modern data architecture that combines the scalability and flexibility of data lakes with the transactional reliability and performance of data warehouses. Adding openness to lakehouse architecture - modern, flexible, and scalable arrangement to manage data using open standards and cloud-native technologies.

This approach has several benefits:
• Scalability: Only limited by cloud resources spent.
• Flexibility: Can support all data types and formats.
• Cost-efficiency: Save costs by efficient data compression
• Cloud Agnostic: Portable across all popular cloud platforms

In my opinion, the main drivers are cost-efficiency and flexibility. With open standards comes freedom and the ability to integrate with other tools and technologies also conformant to open standards, such as Parquet. This is why commercial vendors also support these open technologies, the data community wants the best of each world, instead of committing to a single vendor or technology across the entire platform.

As you can see from the picture above, there are many competing technologies in the field. Latest entrant to join the table format wars, is DuckLake, joining the battle and proudly promising to keep your ducks in a row! The recognized combatants are Iceberg, Delta Lake, and Hudi that have already established stable user bases. See the table below for comparison of common table format features:

Coming in with quack, DuckLake utilizes a database system to manage your metadata for the catalog running on any standard SQL database (e.g., PostgreSQL, MySQL, DuckDB, SQLite).

Having your metadata stored in a separate database provides:
• Faster metadata lookups
• Easier schema evolution and snapshot management
• No need for external catalog services

This arrangement claims to make additional catalog services redundant and reduce operational overhead of maintaining a lakehouse. Having a centralized metadata store improves consistency and reliability across multiple tables.

While there are many advantages that a lakehouse can provide, neither the competition nor DuckLake currently support constraints, keys, or indexes. These are features found in traditional databases. Features commonly known to maintain data integrity, make the data rows identifiable and speed up database searches.

The architecture of DuckLake is straightforward. Key factor is the separation of the data layer from the metadata layer. Data Layer contains the data in Parquet file format. Metadata layer consists of the catalog database including the table schemas, snapshots, versioning, and transaction logs.

If you’re wondering if DuckLake is a table format or a data catalog, given that it contains both, they themselves define it as a data lakehouse format. This is due to the fact that it also contains a metadata catalog for storing the schema of the data, typically Parquet files. You can think of something similar to Delta Lake with Unity Catalog or even Iceberg coupled with Lakekeeper or Polaris for catalog management.

Is DuckLake picky on the infrastructure? The data files of DuckLake must be stored in Parquet. Using DuckDB files as storage is not supported at the moment. Data can be stored on any of the popular blob storages, like AWS S3, Azure Blob Storage, Google Cloud Storage or Cloudflare R2. However, the database that runs the catalog component simply needs to be a SQL database and can be installed practically anywhere. Attaching the catalog database to DuckLake and is very simple:

INSTALL ducklake;
INSTALL postgres;

-- Make sure that the database `ducklake_catalog` exists in PostgreSQL.
ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=your_postgres_host' AS my_ducklake
     (DATA_PATH 'data_files/');
USE my_ducklake;

This code is an example from ducklake.select and uses PostgreSQL as the catalog database.

Why are DuckLake and its contenders important to SAS? You have probably heard that SAS recently released SAS/ACCESS to DuckDB for added performance when accessing open file formats, such as Parquet, ORC, and Avro. SAS promotes freedom of choice in storage platforms for optimal cost, capacity and performance. Data lakehouse platforms like DuckLake combined with open file formats like Parquet means our customers can reap the benefits of fast evolving open data architecture with the ability to integrate the best of data management and analytics from SAS.

To wrap it up, DuckLake is a cool and innovative data lakehouse platform, because it brings a whole new way to manage both metadata and data. Undeniably, DuckLake is lightweight and offers performance while preserving flexibility. To learn more about DuckLake, this is a great place to start. And if you prefer to ingest your information in video format, check out this one.

References and sources:

Duck image created by MS Copilot on 3.9.2025, prompt “Draw an image of a happy yellow duck sitting in a pond”
Table format comparison table created by MS Copilot on 4.9.2025, prompt “Compare Ducklake, Iceberg, Delta Lake and Hudi side by side”
The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform: https://www.onehouse.ai/blog/the-open-table-format-war-merely-a-battle-on-the-path-to-engineering-a-...
DuckLake is an integrated data lake and catalog format: https://ducklake.select/
Understanding DuckLake: A Table Format with a Modern Architecture https://youtu.be/hrTjvvwhHEQ?si=mVImTo4KaDhDzwLL

ahmedalattar · ‎09-16-2025

Hi @jarno,

Nice article, I'm a big fan of all DuckDB and DuckLake 😊

This is another video showing DuckLake in action DuckLake w/ Hannes Mühleisen - Practical Data Lunch and Learn. June 4, 2025 (DuckLake from install to use, starts at 14:30)

Keep Calm and Duck On - Your Lakehouse Just Got Smarter

Ready to see what SAS Viya Copilot can do?

SAS AI and Machine Learning Courses