SAS Viya: CAS access to Parquet file at ADLS2 Blob Storage

With the new SAS Viya release (2021.1 and later), CAS can read and write Parquet data files to Azure ADLS2 Blob storage, in addition to CSV and, ORC data files. The data read and write for the parquet file is supported by ADLS CASLIB with the parallel data load mechanism. The CAS worker PODs are loading Parquet data file in parallel from ADLS2 Blob storage.

CAS supported data file type at ADLS2 (Blob Storage)

ORC - Serial method to load/save CAS data files to ADLS2
CSV - Parallel method to load CAS, serial method to save CAS data to ADLS2
Parquet - Parallel method to load and save CAS data files to ADLS2

Pre-requisites

User access to Azure Storage Account with Storage Blob Data Contributor role
Azure application with permission to access Azure Data lake and Azure Storage
Azure File Share mounted to CAS PODs using K8s PVC for the access key
CAS System parameter Azureauthcacheloc= set to use Azure Access key from File Share PVC

Data access path

The following picture describes the CAS load/save from parquet data files stored at Azure Blob Storage.

Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

To load CAS in parallel from ADLS2 Blob Storage, it requires access to an Azure Access Key file from each CAS worker's PODs. You can mount a PVC to CAS PODs and configure the CAS system parameter cas.AZUREAUTHCACHELOC= to share the Azure Access Key file amongst the CAS PODs.

The detailed steps are discussed in the article manage-azure-access-key-with-azureauthcacheloc-part-2 , for creating and mounting a PVC to CAS PODs against Azure File Share and updating system parameter cas.AZUREAUTHCACHELOC= .

The detailed steps are discussed in the article cas-accessing-azure-data-lake-files, for configuring Azure user application and Storage Account to access ALDS2 Blob storage.

The following screenshot describes the location of the Azure Access Key file (.json file) shared amongst the CAS PODs.

CAS save/load to/from Parquet data files at ADLS2 Blob storage

With the CAS PODs mounted to Azure File share and AZUREAUTHCACHELOC= parameter set to Azure File Share, you can use the following code to save and load Parquet data files to ADLS2 blob storage. The ADLS CASLIB will share the Azure Access key amongst the CAS PODs from the central location.

Code:

%let MYSTRGACC="utkumaviya4adls2";
%let MYSTRGFS="fsdata";
%let MYTNTID="b1c14d5c-XXXXXXXX-xxxxxxxxxxxx" ;
%let MYAPPID="a2e7cfdc-XXXXXXXXXXXX-xxxxxxxxxxxx";

CAS mySession  SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib ADLS2 datasource=(
      srctype="adls"
      accountname=&MYSTRGACC
      filesystem=&MYSTRGFS
      dnsSuffix=dfs.core.windows.net
      timeout=50000
      tenantid=&MYTNTID
      applicationId=&MYAPPID
   )
   path="/sample_data"
   subdirs;

proc casutil incaslib="ADLS2";
   list files ;
run;quit;

/* Save CAS Data to ADLS2 storage data file (parquet) */
proc casutil incaslib="ADLS2" outcaslib="ADLS2" ;
load data=sashelp.cars casout="cars" replace;
save casdata="cars" casout="cars.parquet"  replace;
run;quit;


/* CAS load from Parquet data files at ADLS2 storage */
proc casutil  incaslib="ADLS2"  outcaslib="ADLS2";
  load casdata="cars.parquet" casout="cars_parquet" replace ;
  list tables ;
run;quit;

cas mysession terminate;

Log extract :

.....
..............
78   /* Save CAS Data to ADLS2 storage data file (parquet) */
79   proc casutil incaslib="ADLS2" outcaslib="ADLS2" ;
NOTE: The UUID '2d06260a-99c8-5543-b9b4-478a03fd394f' is connected using session MYSESSION.
80   load data=sashelp.cars casout="cars" replace;
NOTE: The INCASLIB= option is ignored when using the DATA= option in the LOAD statement.
NOTE: Executing action 'table.addTable'.
NOTE: Action 'table.addTable' used (Total process time):
NOTE: SASHELP.CARS was successfully added to the "ADLS2" caslib as "CARS".
81   save casdata="cars" casout="cars.parquet"  replace;
NOTE: Executing action 'table.save'.
NOTE: Cloud Analytic Services saved the file cars.parquet in caslib ADLS2.
NOTE: Action 'table.save' used (Total process time):
......
.............
..................
78   /* CAS load from Parquet data files at ADLS2 storage */
79   proc casutil  incaslib="ADLS2"  outcaslib="ADLS2";
NOTE: The UUID '2d06260a-99c8-5543-b9b4-478a03fd394f' is connected using session MYSESSION.
81     load casdata="cars.parquet" casout="cars_parquet" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the external data from cars.parquet available as table CARS_PARQUET in caslib ADLS2.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE: The Cloud Analytic Services server processed the request in 0.739576 seconds.
..........
.................

The following screenshot describes Parquet data file saved to Azure ADLS2 Blob storage.

CAS and Parquet data files at ASDLS2 - facts and features

CAS can read single .parquet files or n-number of .parquet files from a subfolder.
CAS can read parquet files and folders with .parquet extensions only.
When CAS is loaded from an ADLS2 sub-folder containing n-number of .parquet data files with matching file structure, all of the data files become part of a consolidated single CAS table.
CAS can read simple and plain structure parquet data files from ADLS2. If the data files contain complex data types like "struct", "array", "Json", etc. these columns cannot be read into separate multiple columns.
Before loading a complex (containing arrays, structs, map, etc.) parquet data file into CAS, read it using a third-party application (Azure Synapse or Databricks or HDInsight cluster) into tabular format.
CAS tables saved to parquet files at ADLS2 are uncompressed.
Parquet files created by CAS at ADLS2 can be read by third-party applications like Azure Synapse, Databricks, SPARK, etc.
If a partitioned CAS table is saved into parquet data files at ADLS2, the data files does not retain the CAS partition.
CAS creates parquet data files at ADLS2 based on number of CAS worker nodes and size of the CAS table. It does not create partitioned data files based on data columns.
CAS does not support data read by data column partition from a sub-folder containing partitioned parquet data file.
A global CAS table loaded from parquet data file does not retain the same format into in-memory table, it converts into SASHDAT format. This results into considerable data size difference between parquet data file and CAS table size (e.g. ~ 330 MB parquet data files = ~ 5.8 GB CAS table (~16 times).
If parquet data file structure has 20 columns and looking to load CAS from just 5 columns data, is not supported .