SAS Viya 3.5 introduces a couple of new file types support. Among them, 2 very popular columnar storage formats which are used a lot in a Hadoop ecosystem: Apache Parquet and Apache ORC. Let’s talk about the Parquet file support, what it is, what it means from a CAS perspective and what first benefits we could expect from it.
“Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem.”
Simply said, instead of storing data row by row, values are arranged and stored column by column, as shown below:
Extract of https://www.slideshare.net/cloudera/hadoop-summit-36479635
Select the image to see a larger version.
Mobile users: To view the image, select the "Full" version at the bottom of the page.
Columnar storage has been designed to provide an alternative to row-based data:
Apache Parquet has significant advantages:
Starting with Viya 3.5, CAS supports the reading and writing of Apache Parquet files through 3 CASLIB types: PATH, DNFS and S3.
So, CAS can access and write:
Notice that the HDFS CASLIB is not in scope. Also, the Parquet file support is available only on Linux for both SMP and MPP CAS.
From a physical standpoint, CAS can READ Parquet data from a single file (.parquet extension) or from a directory of Parquet partitions. In that case, both the directory and the partition files are named with the .parquet extension.
As for WRITE, CAS only creates Parquet files in directories. It does not create a single Parquet file.
In order to see the Parquet files using CAS tools, the CASLIB will have to be defined with the “subdirs” option.
Quoting Brian Bowman from the R&D Data Management CAS Team, “Apache Parquet is deeply integrated into CAS table architecture internals and therefore exploits massive thread and MPP parallelism for PATH, DNFS and S3 CASLIBs.” However, for persisted CAS tables, Parquet is not (yet) the format used in CAS memory as well as in CAS disk cache. When one explicitly loads and persists a Parquet file in CAS 3.5, the CAS table will be in SASHDAT format.
When using Parquet files to back/source CAS tables, one can expect the following benefits over using SASHDAT files:
Although it depends on many criteria, we have seen up to 30 times smaller files when using Parquet instead of SASHDAT. I’m confident other folks at SAS and customers will see even better ratios too.
For example, a custom created 20 million row MEGACORP data set is a 11.6GB SASHDAT file but only 483MB in Parquet format. So, almost 25 times smaller.
When using S3, smaller means cheaper.
Load times are smaller too. Since the input file is smaller, the load requires less I/O. However, the Parquet file is still converted to SASHDAT internally (CAS memory and CAS disk cache). But as mentioned earlier, Parquet is well integrated with CAS and the loading phase does not suffer from this conversion:
4-nodes CAS, PATH CASLIB | Load time in sec. with default COPIES (1) | Load time in sec. with COPIES=0 |
---|---|---|
SASHDAT - 11.6GB | 97.12 | 54.38 |
Parquet - 483 MB | 51.66 | 18.65 |
Times faster | ~2 | ~3 |
Here we are seeing 2 to 3 times faster loads when using a Parquet file from a PATH CASLIB.
Keep in mind that once the table is loaded, any subsequent CAS processes on this table run in similar times regardless of the source file because the CAS table is in the same format (SASHDAT) in both cases.
Finally, the Parquet file format is quickly gaining adoption in open source and cloud, which makes it a good standard for exchanging data easily and efficiently in modern ecosystems.
Thanks to Brian Bowman for providing early insights on the Parquet file format support in CAS.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.