SAS Viya 3.5 - CAS and S3 Parquet files

With SAS Viya 3.5 release, CAS can read and write .parquet files from Amazon S3 bucket without any intermediate AWS services (AWS EMR, Athena, etc. ). CAS can directly read the parquet file from S3 location generated by third party applications (Apache SPARK, hive, etc.). This post is about how to read and write the S3-parquet file from CAS.

The S3 type CASLIB supports the data access from the S3-parquet file. The parquet data file name must have .parquet suffix to load into CAS. When a list of parquet data files (same file structure) part of a big dataset placed in a sub-folder, the sub-folder name also must have .parquet suffix. The data read and write from CAS to S3 bucket are in parallel. Each CAS nodes read and write data to the S3 location in parallel.

The following diagram describes the data access from S3-Parquet files to CAS.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The following steps describe the CAS load/save from the S3-Parquet data file.

Pre-requisite:

Valid AWS Access Key on CAS Controller and Nodes.
Read/write permission to the S3 bucket.
The data files and folder name with .parquet suffix.

The following screen-shot describes an S3 bucket and sub-folder with parquet files (generated and stored by third-party application).

Parquet data load/save to CAS

With valid AWS Access keys placed at CAS Controller and Node servers, the following code can be used to load/save CAS table from/to S3-parquet file. The code describes the CAS load from the subfolder containing the list of parquet data files. It also describes the CAS table save to S3 as the parquet files.

When new and incremental parquet files placed in the same S3 sub-folder, the CAS table must be reloaded. There is no auto incremental CAS load from delta data files.

Code:

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib AWSCAS1 datasource=(srctype="s3",                    
                   awsConfigPath="/opt/sas/viya/config/data/AWSData/config",                   awsCredentialsPath="/opt/sas/viya/config/data/AWSData/credentials",
                   awsCredentialsProfile="default",
                   region="US_East",
                   bucket="sas-viyadeploymentworkshop",
                   objectpath="/gel/LargeData/"
               );          

proc casutil incaslib="AWSCAS1"  outcaslib="AWSCAS1";
   	load casdata="baseball.parquet" casout="baseball" replace ;
        list tables ;    
run;
quit;

proc casutil  incaslib="AWSCAS1"  outcaslib="AWSCAS1";
   save casdata="baseball" casout="baseball_new.parquet" replace ;
run;
quit;

CAS mySession  TERMINATE;

Log extract:

…………
……..
83   
84   caslib AWSCAS1 datasource=(srctype="s3",
85                      awsConfigPath="/opt/sas/viya/config/data/AWSData/config",
86                      awsCredentialsPath="/opt/sas/viya/config/data/AWSData/credentials",
87                      awsCredentialsProfile="default",
88                      region="US_East",
89                      bucket="sas-viyadeploymentworkshop",
90                      objectpath="/gel/LargeData/"
91                  );
NOTE: Executing action 'table.addCaslib'.
NOTE: 'AWSCAS1' is now the active caslib.
NOTE: Cloud Analytic Services added the caslib 'AWSCAS1'.

93   
94   proc casutil incaslib="AWSCAS1"  outcaslib="AWSCAS1";
NOTE: The UUID '80594f6b-53d1-ee4a-9f74-a7aab23423de' is connected using session MYSESSION.
95      
95 !     load casdata="baseball.parquet" casout="baseball" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the file baseball.parquet available as table BASEBALL in caslib AWSCAS1.
NOTE: Action 'table.loadTable' used (Total process time):
….
………….

83   proc casutil  incaslib="AWSCAS1"  outcaslib="AWSCAS1";
NOTE: The UUID '650d9df6-96c9-6640-8ca4-548477542db4' is connected using session MYSESSION.
84      save casdata="baseball" casout="baseball_new.parquet" replace ;
NOTE: Executing action 'table.save'.
NOTE: Cloud Analytic Services saved the file baseball_new.parquet in caslib AWSCAS1.


…..
……………..

Result extract:

S3 Bucket and folder with CAS table saved as Parquet file:

Parquet files CAS load/save Performance:

The data load/save performance between CAS and S3 depends on CAS HW resources and network traffic speed between CAS servers and AWS S3 service. The following test result is from a RACE CAS environment with standard network speed to AWS S3 service.

Test Environment details:
File upload speed from RACE Server to S3 bucket= ~381.584 Mbps.
RACE CAS Servers= 1+4Nodes – 32GB Mem with 4 CPU on each Node.

Run time from CAS table save to S3 bucket as parquet files.

Run time from S3- Parquet data file load to CAS

Summary:

To load CAS from S3-parquet data file, files and sub-folder names must have .parquet extension.
Data load from S3-parquet files to CAS is in parallel.
The size of the parquet files is smaller (~ 3 times) compare to a CAS table.
The in-memory CAS table data stays in SASHDAT format though loaded from the parquet files.
CAS can load parquet files generated by third-party applications.

Note: CAS write date/datetime data column as double(numeric) data-formatted data and may not be able to properly interpreted by third-party applications.

Related Article:
SAS Viya 3.5 Parquet file support – Quicker loads and smaller files