SAS Viya (CAS) user can read and write parquet data files to cloud storage ( ADLS2, S3, and GCS) and path storage ( DNFS, NFS, and local Unix FS). The SAS Viya Compute Server also supports parquet file read and write to two cloud storage (GCS and S3) and path storage ( DNFS, NFS, and local Unix FS) with various compression.
This post highlights the additional features supported for SAS Viya and Parquet files in the last SAS Viya releases.
SAS Viya (CAS) can read and write parquet data files to an S3 bucket using S3 CASLIB. If you have parquet files generated to an S3 bucket by a third-party application like SPARK, Databrick, etc., you may notice few additional process status files (like _SUCCESS_, _started_, _committed_) along with actual data files. These status files were an issue for CAS while loading from a folder containing n-number of parquet data files.
With Viya 2022.12 release, CAS can now read the S3-parquet data file folder with additional status files. The CAS load process ignores these status files and only considers the actual data files.
The following screenshot describes an S3 bucket folder with parquet files and process status files generated by a SPARK cluster.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The following code describes the CAS load from the above S3 bucket folder. If the folder name does not have a .parquet extension, you need to specify the IMPORTOPTIONS=(FILETYPE="PARQUET") in the CAS load action.
%let userid=utkuma; %let s3bucket=&userid.dmviya4 ; %let aws_config_file="/gelcontent/keys/awskeyconfig" ; %let aws_credentials_file="/gelcontent/keys/credentials" ; %let aws_profile="182696677754-testers" ; %let aws_region="US_East"; %let objpath="/data/"; CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true); /* CASLIB to S3 bucket data folder */ caslib AWSCAS1 datasource=(srctype="s3", awsConfigPath=&aws_config_file, awsCredentialsPath=&aws_credentials_file , awsCredentialsProfile=&aws_profile, region=&aws_region, bucket=&s3bucket, objectpath=&objpath ) subdirs ; /* Load CAS from Parquet data files ( with _SUCCESS _started files) */ proc casutil incaslib="AWSCAS1" outcaslib="public"; droptable casdata="BaseballSpark" incaslib="public" quiet; load casdata="PARQUET/BaseballSpark" casout="BaseballSpark" IMPORTOPTIONS=(FILETYPE="PARQUET") promote ; run; quit; proc casutil incaslib="public" outcaslib="public"; list tables ; run;quit; CAS mySession TERMINATE;
.... .............. 101 102 /* Load CAS from Parquet data files ( with _SUCCESS _started files) */ 103 proc casutil incaslib="AWSCAS1" outcaslib="public"; NOTE: The UUID 'a5e9318c-4422-5243-9375-b3aa82c2f214' is connected using session MYSESSION. 105! load casdata="PARQUET/BaseballSpark" casout="BaseballSpark" IMPORTOPTIONS=(FILETYPE="PARQUET") promote ; NOTE: Executing action 'table.loadTable'. NOTE: Cloud Analytic Services made the file PARQUET/BaseballSpark available as table BASEBALLSPARK in caslib public. NOTE: Action 'table.loadTable' used (Total process time): NOTE: real time 0.938480 seconds NOTE: cpu time 0.475109 seconds (50.63%) NOTE: total nodes 4 (32 cores) NOTE: total memory 251.04G NOTE: memory 61.29M (0.02%) NOTE: bytes moved 438.31K NOE: The Cloud Analytic Services server processed the request in 0.93848 seconds. 106 run; 107 quit; .... ..............
The SAS Viya Compute server supports access to parquet files at GCS, Path locations, and S3 bucket. With the last few releases, additional data types are supported while reading Parquet data files to SAS. More details are available in the documentation under Conversion between Parquet and SAS Data types.
The following data types are supported with 2022.12 release.
The following data types are supported with 2023.01 release.
Find more articles from SAS Global Enablement and Learning here.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.