With SAS Viya 3.5 release, CAS can read and write .parquet files from Amazon S3 bucket without any intermediate AWS services (AWS EMR, Athena, etc. ). CAS can directly read the parquet file from S3 location generated by third party applications (Apache SPARK, hive, etc.). This post is about how to read and write the S3-parquet file from CAS.
The S3 type CASLIB supports the data access from the S3-parquet file. The parquet data file name must have .parquet suffix to load into CAS. When a list of parquet data files (same file structure) part of a big dataset placed in a sub-folder, the sub-folder name also must have .parquet suffix. The data read and write from CAS to S3 bucket are in parallel. Each CAS nodes read and write data to the S3 location in parallel.
The following diagram describes the data access from S3-Parquet files to CAS.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The following steps describe the CAS load/save from the S3-Parquet data file.
The following screen-shot describes an S3 bucket and sub-folder with parquet files (generated and stored by third-party application).
With valid AWS Access keys placed at CAS Controller and Node servers, the following code can be used to load/save CAS table from/to S3-parquet file. The code describes the CAS load from the subfolder containing the list of parquet data files. It also describes the CAS table save to S3 as the parquet files.
When new and incremental parquet files placed in the same S3 sub-folder, the CAS table must be reloaded. There is no auto incremental CAS load from delta data files.
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); caslib AWSCAS1 datasource=(srctype="s3", awsConfigPath="/opt/sas/viya/config/data/AWSData/config", awsCredentialsPath="/opt/sas/viya/config/data/AWSData/credentials", awsCredentialsProfile="default", region="US_East", bucket="sas-viyadeploymentworkshop", objectpath="/gel/LargeData/" ); proc casutil incaslib="AWSCAS1" outcaslib="AWSCAS1"; load casdata="baseball.parquet" casout="baseball" replace ; list tables ; run; quit; proc casutil incaslib="AWSCAS1" outcaslib="AWSCAS1"; save casdata="baseball" casout="baseball_new.parquet" replace ; run; quit; CAS mySession TERMINATE;
………… …….. 83 84 caslib AWSCAS1 datasource=(srctype="s3", 85 awsConfigPath="/opt/sas/viya/config/data/AWSData/config", 86 awsCredentialsPath="/opt/sas/viya/config/data/AWSData/credentials", 87 awsCredentialsProfile="default", 88 region="US_East", 89 bucket="sas-viyadeploymentworkshop", 90 objectpath="/gel/LargeData/" 91 ); NOTE: Executing action 'table.addCaslib'. NOTE: 'AWSCAS1' is now the active caslib. NOTE: Cloud Analytic Services added the caslib 'AWSCAS1'. 93 94 proc casutil incaslib="AWSCAS1" outcaslib="AWSCAS1"; NOTE: The UUID '80594f6b-53d1-ee4a-9f74-a7aab23423de' is connected using session MYSESSION. 95 95 ! load casdata="baseball.parquet" casout="baseball" replace ; NOTE: Executing action 'table.loadTable'. NOTE: Cloud Analytic Services made the file baseball.parquet available as table BASEBALL in caslib AWSCAS1. NOTE: Action 'table.loadTable' used (Total process time): …. …………. 83 proc casutil incaslib="AWSCAS1" outcaslib="AWSCAS1"; NOTE: The UUID '650d9df6-96c9-6640-8ca4-548477542db4' is connected using session MYSESSION. 84 save casdata="baseball" casout="baseball_new.parquet" replace ; NOTE: Executing action 'table.save'. NOTE: Cloud Analytic Services saved the file baseball_new.parquet in caslib AWSCAS1. ….. ……………..
S3 Bucket and folder with CAS table saved as Parquet file:
The data load/save performance between CAS and S3 depends on CAS HW resources and network traffic speed between CAS servers and AWS S3 service. The following test result is from a RACE CAS environment with standard network speed to AWS S3 service.
Test Environment details:
File upload speed from RACE Server to S3 bucket= ~381.584 Mbps.
RACE CAS Servers= 1+4Nodes – 32GB Mem with 4 CPU on each Node.
Run time from CAS table save to S3 bucket as parquet files.
Run time from S3- Parquet data file load to CAS
Note: CAS write date/datetime data column as double(numeric) data-formatted data and may not be able to properly interpreted by third-party applications.
SAS Viya 3.5 Parquet file support – Quicker loads and smaller files
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.