SAS Viya 3.5 Parquet file support - Quicker loads and smaller files

6 Likes

SAS Viya 3.5 introduces a couple of new file types support. Among them, 2 very popular columnar storage formats which are used a lot in a Hadoop ecosystem: Apache Parquet and Apache ORC. Let’s talk about the Parquet file support, what it is, what it means from a CAS perspective and what first benefits we could expect from it.

What it is

“Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem.”

Simply said, instead of storing data row by row, values are arranged and stored column by column, as shown below:

Extract of https://www.slideshare.net/cloudera/hadoop-summit-36479635

Select the image to see a larger version.
Mobile users: To view the image, select the "Full" version at the bottom of the page.

Columnar storage has been designed to provide an alternative to row-based data:

Row-based is great when one needs to access many columns and many records of a big data set
Columnar layout is great when one needs to compute various statistics on a few columns of a big data set

Apache Parquet has significant advantages:

It limits the I/O to only the data that is needed
- Unused columns are NOT read
It saves (a lot of) space
- Column layout enables a better compression

What it means from a CAS perspective

Starting with Viya 3.5, CAS supports the reading and writing of Apache Parquet files through 3 CASLIB types: PATH, DNFS and S3.

So, CAS can access and write:

Parquet files on the CAS Controller
Parquet files on a network location accessible from all the CAS nodes
Parquet files on AWS S3

Notice that the HDFS CASLIB is not in scope. Also, the Parquet file support is available only on Linux for both SMP and MPP CAS.

From a physical standpoint, CAS can READ Parquet data from a single file (.parquet extension) or from a directory of Parquet partitions. In that case, both the directory and the partition files are named with the .parquet extension.

As for WRITE, CAS only creates Parquet files in directories. It does not create a single Parquet file.

In order to see the Parquet files using CAS tools, the CASLIB will have to be defined with the “subdirs” option.

Quoting Brian Bowman from the R&D Data Management CAS Team, “Apache Parquet is deeply integrated into CAS table architecture internals and therefore exploits massive thread and MPP parallelism for PATH, DNFS and S3 CASLIBs.” However, for persisted CAS tables, Parquet is not (yet) the format used in CAS memory as well as in CAS disk cache. When one explicitly loads and persists a Parquet file in CAS 3.5, the CAS table will be in SASHDAT format.

What are the benefits

When using Parquet files to back/source CAS tables, one can expect the following benefits over using SASHDAT files:

Way smaller files on disk
Faster load times
Easier integration with 3^rd party tools

Although it depends on many criteria, we have seen up to 30 times smaller files when using Parquet instead of SASHDAT. I’m confident other folks at SAS and customers will see even better ratios too.

For example, a custom created 20 million row MEGACORP data set is a 11.6GB SASHDAT file but only 483MB in Parquet format. So, almost 25 times smaller.

When using S3, smaller means cheaper.

Load times are smaller too. Since the input file is smaller, the load requires less I/O. However, the Parquet file is still converted to SASHDAT internally (CAS memory and CAS disk cache). But as mentioned earlier, Parquet is well integrated with CAS and the loading phase does not suffer from this conversion:

4-nodes CAS, PATH CASLIB	Load time in sec. with default COPIES (1)	Load time in sec. with COPIES=0
SASHDAT - 11.6GB	97.12	54.38
Parquet - 483 MB	51.66	18.65
Times faster	~2	~3

Here we are seeing 2 to 3 times faster loads when using a Parquet file from a PATH CASLIB.

Keep in mind that once the table is loaded, any subsequent CAS processes on this table run in similar times regardless of the source file because the CAS table is in the same format (SASHDAT) in both cases.

Finally, the Parquet file format is quickly gaining adoption in open source and cloud, which makes it a good standard for exchanging data easily and efficiently in modern ecosystems.

Thanks to Brian Bowman for providing early insights on the Parquet file format support in CAS.