Unlocking the Power of Open File Formats: Freedom, Flexibility, and Fierce File Compression

9 Likes

We at SAS have always been very confident about our capability to interact with any kind of data. Be it flat files, databases or cloud storage, if there is will, there is always a way.

A few claims to set the stage… CSV has been and still is the de facto standard for data interchange between different systems as well as relational databases are still doing a fantastic job in dependable transactional processing. Now that we’ve established that, let’s look into what else is out there.

From the early days of SAS, we saw the need for a storage format that reliably stores structured data and is very fast for doing full table scans. For transactional processing, row by row access is perfect, but for analytics, the fastest way typically is to scan all the rows in one go. Therefore, sas7bdat format was introduced in the late 90s with the SAS System version 7. In short, it’s a binary encoded package that contains the data with some metadata included, such as column names, labels, SAS formats and SAS informats. One often overlooked benefit was that it was accessible between all supported OS platforms of the time.

While sas7bdat remains great for full table scans in SAS9 environment, in modern memory-rich computing environments it has a fallback. Data stored in sas7bat requires conversion when loading data to an in-memory processing engine such as SAS Viya’s Cloud Analytics Services (CAS). Like its predecessor engine called LASR, the CAS engine uses sashdat as native storage format. CAS-based sashdat tables differ from traditional sas7bdat in several ways, including scalability, multi-user access, and parallel processing. As it was designed from ground-up to support in-memory processing, it can be directly memory mapped and requires no conversion at load.

Both of these formats are reliable and well performing, but as the data volumes constantly grow, there has been need from market for yet improved efficiency in compression as cloud storage may be plentiful, but not free. Both of these SAS proprietary file formats offer compression and for sas7bdat, you can define compression easily by adding compress=yes to your libname statement:

libname mylib 'pathtofolder' compress=yes;

This enables compression for all tables created in the specified library. Back in the day SAS developers refrained from using compression because CPU power was scarce and expensive. Modern multi-core processors can handle it without skipping a beat and it doesn’t introduce much lag in dataset access times.

Similarly for sashdat, the compression can be specified with CASUTIL procedure when writing out a sashdat table with the compress=true option:

proc cas;
    table.save /
        caslib="casuser"
        name="mydata.sashdat"
        table={name="my_table", caslib="casuser"}
        compress=true
quit;

While the storage efficiency can be improved with compression, further optimization is always a benefit when storing data. Also, I’m not hiding the fact that these are SAS proprietary storage formats and we live in the age of openness. Along with the introduction of cloud storage there has been a rise of new file formats that are open by definition. Any vendor can build access tools for them and many already have. With SAS Viya, also SAS has introduced support for open file formats, including Parquet and ORC. Accessing Parquet files with SAS Viya is simple with the Parquet libname engine:

libname myparq parquet "/export/home/users/user/parquet_folder";

After this, the parquet files are available for SAS procedures. To verify that that the file is actually in parquet format, we can use CONTENTS procedure. Pay attention to engine type and the compression algorithm being used:

proc contents data=myparq.class;
run;

The libname engine for ORC works exactly the same. To create a library using the ORC engine:

libname myorc orc "/export/home/users/user/orc_folder";

And likewise for using SAS procedures on ORC data, again using procedure CONTENTS to check that file is in ORC format and it is compressed with ZLIB by default:

proc contents data=myorc.class;
run;

One useful performance tip you can also find in the documentation is that in the ORC and Parquet engines, compression is on by default. In other SAS engines, compression is turned off by default.

Turning off compression for these engines almost always reduces performance, especially with a cloud storage system.

To illustrate the compression capabilities between various file formats, I’m including the table below that compares SAS file formats with the most common open file formats:

This test was run on a randomized synthetic 1GB dataset and represents average compression rates for these storage formats. On live data it may show even better performance as real world tends to be repetitive.

Now you might be wondering that sas7bdat has already server SAS users se reliably for many years and you’ve never had any problems with respect to size or performance. Rest assured, sas7bdat is not going away, even if vendors like SAS are investing heavily to better support the open file formats.

It’s comforting to know that if your data volumes are growing and cloud storage costs are a pain in your cloud budget, there are reliable remedies to the problem. I know that all data is not created equal, some data is used all the time while others are mainly stored for safekeeping. You may have heard the buzzword data tiering and that just might be the topic for my next blog. Stay tuned!

Ronan_Lincoln · ‎03-28-2025

Extremely useful, thanks a lot for sharing @jarno . In particular, the benchmark table by format X compression method is priceless. 🙂 Wanted for a long time to have such figures. How many obs and variables (CHAR, NUM) did the 1Gb reference table have ?

SASJedi · ‎04-07-2025

I've been messing around some with storing my SAS data using the parquet LIBNAME engine. I'm impressed with both the reduction in size on disk and the speed of access. It's awesome!

jarno · ‎04-08-2025

Hello @Ronan_Lincoln the 1GB dataset used in the benchmark table has 12 vars and 8,6 million obs.

touwen_k · ‎04-08-2025

The comparison table is great! One question: if you stored the data compressed, is the data easy accessible to work with or do you need to decompress?

jarno · ‎04-08-2025

Hello @touwen_k and thanks for the comment! In general SAS datasets compressed work just like uncompressed ones. With the open file formats it's a bit different. Have a look at this link to find about possible limitations: ORC and Parquet engines - Restrictions for SAS Features

Unlocking the Power of Open File Formats: Freedom, Flexibility, and Fierce File Compression

Registration is open

SAS AI and Machine Learning Courses