Parquet is a good choice for analytical workloads that involve filtering and aggregating large datasets. Its efficient compression and rich metadata help accelerate query performance while reducing storage costs. However, writing parquet files isn’t always straightforward and a naïve approach can produce results that differ from you expect.
Recently, we released a SAS Studio Custom Step called DuckDB – Extract Parquet Metadata, on our GitHub repository. This custom step – built for SAS Studio in SAS Viya - combines a SAS program and a user interface to help users extract and introspect Parquet metadata and copy files with metadata options that optimise usage patterns. Because the step exposes several parameters and file writing options, we also highlight potential pitfalls and provide guidance to help you plan a robust Parquet conversion strategy.
Yes, always sound advice that.
Two levers help influence the structure and performance of a parquet file: row groups and partitions.
These terms are sometimes used interchangeably but operate at different levels. Let’s understand what they are and how they differ.
Row groups are horizontal segments of data observations within a parquet file.At a basic level, if you split a file into two equal halves – each identified by its own ID – you’ve effectively created two row groups.When Parquet writers (programs that write parquet files) write a file, each row group is summarised, and column-level statistics are stored in the file’s metadata. These summaries allow Parquet readers (such as those triggered by queries against those parquet files) to quickly assess which row groups contain relevant data, avoiding unnecessary reads.
By default, Parquet writers distribute observations equally across all row groups based on existing order (which may or may not be sorted). However, this tends to provide limited performance benefit in query execution, especially when querying for small subsets of records, for which data can be scattered anywhere in the file. Suppose you have a file with only 20% of customers in delinquent status (BAD = 1). If the target data happens to be fragmented, a query is forced to search all row groups to find observations with BAD = 1 even though most row groups contain no relevant data.
Sorting the data and defining row groups prior to write can yield significant performance benefits.For instance, if the file happens to be sorted by BAD and written into five row groups (each representing 20% of the file), the query engine can skip four of them entirely after scanning metadata and touch only the row group which contains BAD = 1, in other words, “all eggs in one basket”. This translates to faster query performance and smaller I/O overhead, especially on large files.
The Parquet writer options contained in the SAS Studio Custom Step help you tune metadata and row group structure to align with your common query patterns. You can identify those patterns through code analysis or domain knowledge and design row groups that best support how your data is used.
Partitioning organises files into a hierarchical directory structure based on one or more partition columns. As with row groups, the goal is to accelerate data retrieval by minimising the amount of data scanned during queries.
First, let’s state that parquet files can consist of multiple files stored within a directory. By default, when a program like DuckDB writes out Parquet data without partitioning, it produces files under a flat structure such as:
MY_FILE.parquet
- data-0.parquet
- data-1.parquet
- ...data-n.parquet
The naming convention and number of files output can vary depending on the writer and its configuration parameters such as ROW_GROUPS_PER_FILE, ROW_GROUP_SIZE etc.
When partitioning is specified, the Parquet writer creates subdirectories corresponding to a unique subset of that variable.For example, if BAD is specified as a partitioning variable, two subdirectories, one containing all observations with BAD = 1 and the other containing observations with BAD = 0 will be written.
MY_FILE.parquet
- BAD = 0
- data-0.parquet
- BAD = 1
- data-1.parquet
The hierarchy deepens as more partitioning variables are specified. For example, specifying an additional variable (say, Sex with a value of M and F) results in each of the BAD = 1 and BAD = 0 directories getting further subdivided into subdirectories for Sex = M and Sex = F.
At query time, users benefit because the engine needs to traverse only those directories that satisfy the query criteria. DuckDB implements this when planning queries using functions such as read_parquet() and parquet_scan(). Again, this results in a scan of only part of the data and avoids the need to access unnecessary data.
The DuckDB documentation on writing and scanning Parquet with partitioning is located here: https://duckdb.org/docs/data/parquet. Also, refer this link: https://parquet.apache.org/docs/file-format/#partitioning for details on how the Apache Parquet file format supports partitioning.
Let’s bring these two levers - row groups and partitioning – together and see how they complement each other. While row groups concern themselves with how data is organised within a file, partitioning affects and determines how parquet files themselves are organised. Choosing the right combination of settings for both levers can deliver significant benefits for query engines such as DuckDB and their host applications or allied programs such as SAS Viya and SAS.
Having familiarised ourselves with row groups and partitioning, let’s tackle the question of how best to write a parquet file. Several options determine row groups and partitioning, and it is easy to get lost in the weeds. Therefore, let us consider the following guiding principles.
Know Thy Writer
Not all Parquet writers are created equal.The Parquet format was first released in 2013 and is now more than 12 years old.It has evolved steadily with many version updates.As of February 2026, it was at version 2.12.0 of the Apache Parquet library. Not every writer tool has kept pace with these developments. Using older or minimally maintained writers may lead to problems in capturing metadata, use of outdated compression techniques and encodings, or even failure to capture summary statistics per column. These adversely affect predicate pushdown capabilities that DuckDB is known to be efficient at, and hinder query performance.
A compelling case can be made for using DuckDB, and in the context of the custom step and analytical platform mentioned earlier, through the SAS/Access Interface to DuckDB as your Parquet writer. DuckDB provides simple SQL mechanisms to write parquet files such as the COPY … TO ‘file.parquet’ command to export tables directly to Parquet.It can create Hive-compatible partitioned directory structures which enable selective scans of the data.DuckDB also records comprehensive row group and column-level statistics to Parquet metadata, which ensures benefits to users through faster query performance.DuckDB offers a vectorized processing engine which aligns with Parquet’s columnar format in the way it processes data.DuckDB is multi-threaded and maximises CPU utilisation, resulting in faster reads and writes.
Avoid combining partitioning (PARTITION_BY) and row ordering (ORDER BY) within the same write operation. While both are useful options, they can cause subtle and sometimes breaking changes when combined.Partitioning breaks global row ordering.When DuckDB write to partitions in multithreaded mode, it writes each partition independently and concurrently, which can break the sort order.There are also increased chances of runtime errors due to other conditions that may occur during the write process. These errors tend to be a consequence of other aggressive type checking that occurs in those situations, such as "Invalid unicode (byte sequence mismatch)", for which the linkage to partitioning may not be readily apparent. Some of these errors are captured and discussed on the DuckDB discussion and issue board, such as this discussion and this other discussion.
As a temporary workaround, you can set DuckDB to execute in single threaded mode through this command:
SET threads TO 1;
If you do this, you stand a better chance (but not a guarantee) of success because this command forces all processing under a single thread. Note that this only reduces the likelihood of concurrency-related issues but does not guarantee that such issues won’t ever arise.Also, note that this impacts performance.
A safer tactic would be to separate these steps.
While this seems like additional processing, it preserves partitions and ordering to the extent that is required.
DuckDB provides several configuration options that control how parquet files are written. Some of these affect row group formation while others affect partitioning. A clear understanding of which options interact or don’t interact with each other helps you avoid runtime errors.
For example, ROW_GROUPS_PER_FILE helps you limit how many row groups’ worth of information you can pack in a single data file. However, combining this option with partitioning can create conflicts during write operations. Row groups as mentioned earlier are segments of observations (rows) and are evenly distributed, a characteristic which partitions have no guarantee of ensuring. The reason is that row groups organise data horizontally within a file, while partitions separate data across multiple files and directories. When both settings are used together, DuckDB’s writer cannot reconcile the expected file creation logic:
This leads to ambiguity about where a row group should be written. Conflicts also arise due to parallel partition writes because observations from the same row group may be written to different files at the same time as dictated by partition logic.
To handle this more safely, the SAS Studio custom step contains additional code that ignores ROW_GROUPS_PER_FILE when partitioning. Partitioning (PARTITION_BY) leads to this step automatically overriding and ignoring the ROW_GROUPS_PER_FILE option. When using raw DuckDB SQL, however, developers should watch out for this error and not specify ROW_GROUPS_PER_FILE if they choose to partition.
Remember the story of Goldilocks? You want your number of partitions to be neither too many nor too few, just right. The optimal number of partitions for a file is rarely obvious, and even experienced engineers find that this depends on how the file will be queried later. A rule of thumb is to align the number of partitions with anticipated future query filters. Start with the business use case: which columns tend to be frequently accessed and most likely to be used as selection filters. Those are strong candidates for partitioning. However, not all types of columns are good partition candidates. High-cardinality or continuous values (e.g., timestamps, IDs, transactional values) can generate excessive number of directories, rendering partitioning ineffective by increasing filesystem metadata overhead, complicating structure and hurting query performance. Low-cardinality columns such as most categorical variables such as Region, Year, Status etc. are better candidates for partitioning if they tend to be used often for filtering. Do not go overboard, for most use cases, one or two partitioning columns might be sufficient; deeply nested hierarchies may not yield much real benefit.
Refer the Apache Parquet partitioning page for some additional guidelines on partitioning: https://parquet.apache.org/docs/file-format/#partitioning
In short: let your data access patterns drive your partition design — and when uncertain, keep it simple. A shallow hierarchy of low-cardinality partitions almost always outperforms an elaborate or over-fitted structure.
Avoid overwriting an existing partitioned file unless you fully understand the behaviour of your writer’s OVERWRITE option in its environment. Here, we establish the context as covering operations in SAS Studio using the SAS/Access to DuckDB on parquet files located on shared storage, a common usage pattern for SAS users.Note and acquaint yourselves with DuckDB’s documentation, which states that overwrite behaviour varies by file system.
Even if overwriting happens to work, it raises logical and structural problems.A partitioned parquet file is many parquet files under different directories.Overwriting might create new subdirectories reflecting the new partition pattern but tends to leave behind stale directories from earlier partitions. This has the potential to cause confusion. Consider this simple example. Imagine that you created a structure which segmented on BAD (a categorical variable with two levels) and REASON (another categorical variable with two levels).The ensuing directory structure would be as follows:
- BAD = 0
- REASON = Debt Consolidation
- REASON = Home Improvement
- BAD = 1
- REASON = Debt Consolidation
- REASON = Home Improvement
So far, so good. Now, what if you realised that you don’t use BAD as a filter so frequently and therefore do not need that variable?Let’s say you decide to partition the file again (based on REASON only) and attempt to overwrite your file.Unless you check the structure manually and make extra sure that files have been cleared (which may be a simple or laborious task depending on the file structure and number of files written), there is always a chance that a new user who has no knowledge of past operations encounters a directory structure under the parquet file looking like this.
- BAD = 0
- REASON = Debt Consolidation
- REASON = Home Improvement
- BAD = 1
- REASON = Debt Consolidation
- REASON = Home Improvement
- REASON = Debt Consolidation
- data-0.parquet
- REASON = Home Improvement
- data-0.parquet
Notice the problem here?Even if overwriting succeeds in removing existing files, existing partition directories tend to hang around.A non-trivial amount of time and effort is taken up to verify that overwrites work as expected.
As a safer practice, create a new file as a copy and then shift code reference to the new file (deleting the older file after some time). This might save you anxiety.You can choose to version the new file through a naming convention to indicate the changes that have taken place.Once you confirm that the new directory structure is complete, correct and usable, you may then delete or replace the old file. This “copy-then-swap” practice tends to be safer and easier to audit, especially in shared environments or environments where files are created through automated programs.
In Summary: Be Cautious, Not Afraid
We do not want to convey a message that writing Parquet using DuckDB is full of traps. We want you to understand that like any other tool, DuckDB demands thoughtful consideration and care prior to use. Let us acknowledge that failure modes tend to crop up and that we need to plan for them beforehand, to the extent practically possible.
DuckDB is very capable at handling Parquet files and provides speed and compression benefits that enables analytics in a flexible manner. Its SQL interface and efficient writer functionality make it suitable for experimentation and production use, if you design your data structure in a thoughtful manner.
Preparation at early stages pays off later in stability, performance and convenience. Enjoy working with DuckDB, the SAS Studio Custom Step and do feel free to chime in with additional caveats I may have doubtless missed, by dropping me a note here: Sundaresh.sankaran@sas.com.
You may consider looking into these additional references/links
Thank you, @ahmedalattar , useful links.
The ROW_GROUPS_PER_FILE and ROW_GROUPS_SIZE options are included in our tool to Extract Parquet Metadata,
let me also look to surface PER_THREAD_OUTPUT as an option
Dive into keynotes, announcements and breakthroughs on demand.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.