BookmarkSubscribeRSS Feed

SAS Viya 3.5 Parquet file support - Quicker loads and smaller files

Started ‎01-09-2020 by
Modified ‎01-09-2020 by
Views 7,891

SAS Viya 3.5 introduces a couple of new file types support. Among them, 2 very popular columnar storage formats which are used a lot in a Hadoop ecosystem: Apache Parquet and Apache ORC. Let’s talk about the Parquet file support, what it is, what it means from a CAS perspective and what first benefits we could expect from it.

What it is

Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem.

 

Simply said, instead of storing data row by row, values are arranged and stored column by column, as shown below:

 

nir_post43_columnar_storage.png

Extract of https://www.slideshare.net/cloudera/hadoop-summit-36479635

Select the image to see a larger version.
Mobile users: To view the image, select the "Full" version at the bottom of the page.

 

Columnar storage has been designed to provide an alternative to row-based data:

  • Row-based is great when one needs to access many columns and many records of a big data set
  • Columnar layout is great when one needs to compute various statistics on a few columns of a big data set

Apache Parquet has significant advantages:

  • It limits the I/O to only the data that is needed
    • Unused columns are NOT read
  • It saves (a lot of) space
    • Column layout enables a better compression

What it means from a CAS perspective

Starting with Viya 3.5, CAS supports the reading and writing of Apache Parquet files through 3 CASLIB types: PATH, DNFS and S3.

 

So, CAS can access and write:

  • Parquet files on the CAS Controller
  • Parquet files on a network location accessible from all the CAS nodes
  • Parquet files on AWS S3

Notice that the HDFS CASLIB is not in scope. Also, the Parquet file support is available only on Linux for both SMP and MPP CAS.

 

From a physical standpoint, CAS can READ Parquet data from a single file (.parquet extension) or from a directory of Parquet partitions. In that case, both the directory and the partition files are named with the .parquet extension.

 

As for WRITE, CAS only creates Parquet files in directories. It does not create a single Parquet file.

 

In order to see the Parquet files using CAS tools, the CASLIB will have to be defined with the “subdirs” option.

 

Quoting Brian Bowman from the R&D Data Management CAS Team, “Apache Parquet is deeply integrated into CAS table architecture internals and therefore exploits massive thread and MPP parallelism for PATH, DNFS and S3 CASLIBs.” However, for persisted CAS tables, Parquet is not (yet) the format used in CAS memory as well as in CAS disk cache. When one explicitly loads and persists a Parquet file in CAS 3.5, the CAS table will be in SASHDAT format.

What are the benefits

When using Parquet files to back/source CAS tables, one can expect the following benefits over using SASHDAT files:

  • Way smaller files on disk
  • Faster load times
  • Easier integration with 3rd party tools

Although it depends on many criteria, we have seen up to 30 times smaller files when using Parquet instead of SASHDAT. I’m confident other folks at SAS and customers will see even better ratios too.

 

For example, a custom created 20 million row MEGACORP data set is a 11.6GB SASHDAT file but only 483MB in Parquet format. So, almost 25 times smaller.

 

When using S3, smaller means cheaper.

 

Load times are smaller too. Since the input file is smaller, the load requires less I/O. However, the Parquet file is still converted to SASHDAT internally (CAS memory and CAS disk cache). But as mentioned earlier, Parquet is well integrated with CAS and the loading phase does not suffer from this conversion:

 

4-nodes CAS, PATH CASLIB Load time in sec. with default COPIES (1) Load time in sec. with COPIES=0
SASHDAT - 11.6GB 97.12 54.38
Parquet - 483 MB 51.66 18.65
Times faster ~2 ~3

 

Here we are seeing 2 to 3 times faster loads when using a Parquet file from a PATH CASLIB.

 

Keep in mind that once the table is loaded, any subsequent CAS processes on this table run in similar times regardless of the source file because the CAS table is in the same format (SASHDAT) in both cases.

 

Finally, the Parquet file format is quickly gaining adoption in open source and cloud, which makes it a good standard for exchanging data easily and efficiently in modern ecosystems.

 

Thanks to Brian Bowman for providing early insights on the Parquet file format support in CAS.

Version history
Last update:
‎01-09-2020 11:59 AM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags