BookmarkSubscribeRSS Feed

SAS Viya and Parquet files – additional features

Started ‎01-26-2023 by
Modified ‎01-26-2023 by
Views 2,969

SAS Viya (CAS) user can read and write parquet data files to cloud storage ( ADLS2, S3, and GCS) and path storage ( DNFS, NFS, and local Unix FS). The SAS Viya Compute Server also supports parquet file read and write to two cloud storage (GCS and S3) and path storage ( DNFS, NFS, and local Unix FS) with various compression.

 

This post highlights the additional features supported for SAS Viya and Parquet files in the last SAS Viya releases.

 

CAS Load from S3-Parquet folder with status (_SUCCESS ) files

 

SAS Viya (CAS) can read and write parquet data files to an S3 bucket using S3 CASLIB. If you have parquet files generated to an S3 bucket by a third-party application like SPARK, Databrick, etc., you may notice few additional process status files (like _SUCCESS_, _started_, _committed_) along with actual data files. These status files were an issue for CAS while loading from a folder containing n-number of parquet data files.

 

With Viya 2022.12 release, CAS can now read the S3-parquet data file folder with additional status files. The CAS load process ignores these status files and only considers the actual data files.

 

The following screenshot describes an S3 bucket folder with parquet files and process status files generated by a SPARK cluster.

 

uk_1_SASViya_ParquetFile_Additional_Features_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

The following code describes the CAS load from the above S3 bucket folder. If the folder name does not have a .parquet extension, you need to specify the IMPORTOPTIONS=(FILETYPE="PARQUET") in the CAS load action.

 

Code:

 

%let userid=utkuma;

%let s3bucket=&userid.dmviya4 ;
%let aws_config_file="/gelcontent/keys/awskeyconfig"  ;
%let aws_credentials_file="/gelcontent/keys/credentials"  ;
%let aws_profile="182696677754-testers"  ;
%let aws_region="US_East";
%let objpath="/data/";

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

/* CASLIB to S3 bucket data folder */
caslib AWSCAS1 datasource=(srctype="s3",
                   awsConfigPath=&aws_config_file,
                   awsCredentialsPath=&aws_credentials_file ,
                   awsCredentialsProfile=&aws_profile,
                   region=&aws_region,
                   bucket=&s3bucket,
                   objectpath=&objpath
                   ) subdirs ;


/* Load CAS from Parquet data files ( with _SUCCESS _started files) */
proc casutil incaslib="AWSCAS1"  outcaslib="public";
    droptable casdata="BaseballSpark" incaslib="public"  quiet;
   	load casdata="PARQUET/BaseballSpark" casout="BaseballSpark" IMPORTOPTIONS=(FILETYPE="PARQUET")  promote ;
run;
quit;

proc casutil incaslib="public"  outcaslib="public";
list tables ;
run;quit;

CAS mySession  TERMINATE;

 

Log extract:

 

....
..............
101  
102  /* Load CAS from Parquet data files ( with _SUCCESS _started files) */
103  proc casutil incaslib="AWSCAS1"  outcaslib="public";
NOTE: The UUID 'a5e9318c-4422-5243-9375-b3aa82c2f214' is connected using session MYSESSION.
105!     load casdata="PARQUET/BaseballSpark" casout="BaseballSpark" IMPORTOPTIONS=(FILETYPE="PARQUET")  promote ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the file PARQUET/BaseballSpark available as table BASEBALLSPARK in caslib public.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE:       real time               0.938480 seconds
NOTE:       cpu time                0.475109 seconds (50.63%)
NOTE:       total nodes             4 (32 cores)
NOTE:       total memory            251.04G
NOTE:       memory                  61.29M (0.02%)
NOTE:       bytes moved             438.31K
NOE: The Cloud Analytic Services server processed the request in 0.93848 seconds.
106  run;
107  quit;
....
..............

 

Result Output:

 

uk_2_SASViya_ParquetFile_Additional_Features_2.png

  

Parquet Engine support additional data types

 

The SAS Viya Compute server supports access to parquet files at GCS, Path locations, and S3 bucket. With the last few releases, additional data types are supported while reading Parquet data files to SAS. More details are available in the documentation under Conversion between Parquet and SAS Data types.

 

The following data types are supported with 2022.12 release.

 

  • DATE
  • INTERVAL
  • TIME
  • TIMESTAMP

 

The following data types are supported with 2023.01 release.

  • STRING
  • ENUM
  • UUID

   

Important Links:

 

About Parquet and ORC Engines

Conversion between Parquet and SAS Data types

Parquet and ORC LIBNAME statement and Options

Restriction for Parquet File Features 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎01-26-2023 12:39 PM
Updated by:
Contributors

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags