BookmarkSubscribeRSS Feed

SAS Viya: CAS access to Parquet file at ADLS2 Blob Storage

Started ‎03-23-2021 by
Modified ‎08-23-2021 by
Views 7,246

With the new SAS Viya release (2021.1 and later), CAS can read and write Parquet data files to Azure ADLS2 Blob storage, in addition to CSV and, ORC data files. The data read and write for the parquet file is supported by ADLS CASLIB with the parallel data load mechanism. The CAS worker PODs are loading Parquet data file in parallel from ADLS2 Blob storage.

 

CAS supported data file type at ADLS2 (Blob Storage)

  • ORC - Serial method to load/save CAS data files to ADLS2
  • CSV - Parallel method to load CAS, serial method to save CAS data to ADLS2
  • Parquet - Parallel method to load and save CAS data files to ADLS2

 

Pre-requisites

  • User access to Azure Storage Account with Storage Blob Data Contributor role
  • Azure application with permission to access Azure Data lake and Azure Storage
  • Azure File Share mounted to CAS PODs using K8s PVC for the access key
  • CAS System parameter Azureauthcacheloc= set to use Azure Access key from File Share PVC

 

Data access path

The following picture describes the CAS load/save from parquet data files stored at Azure Blob Storage.

 

ADLS_CASLIB_and_ParquetFile_1_1.png

 Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

To load CAS in parallel from ADLS2 Blob Storage, it requires access to an Azure Access Key file from each CAS worker's PODs. You can mount a PVC to CAS PODs and configure the CAS system parameter cas.AZUREAUTHCACHELOC= to share the Azure Access Key file amongst the CAS PODs.  

 

The detailed steps are discussed in the article manage-azure-access-key-with-azureauthcacheloc-part-2 , for creating and mounting a PVC to CAS PODs against Azure File Share and updating system parameter cas.AZUREAUTHCACHELOC= .  

 

The detailed steps are discussed in the article cas-accessing-azure-data-lake-files, for configuring Azure user application and Storage Account to access ALDS2 Blob storage.  

 

The following screenshot describes the location of the Azure Access Key file (.json file) shared amongst the CAS PODs.  

 

ADLS_CASLIB_and_ParquetFile_3.png

 

CAS save/load to/from Parquet data files at ADLS2 Blob storage

 

With the CAS PODs mounted to Azure File share and AZUREAUTHCACHELOC= parameter set to Azure File Share, you can use the following code to save and load Parquet data files to ADLS2 blob storage. The ADLS CASLIB will share the Azure Access key amongst the CAS PODs from the central location.  

 

Code:

 

%let MYSTRGACC="utkumaviya4adls2";
%let MYSTRGFS="fsdata";
%let MYTNTID="b1c14d5c-XXXXXXXX-xxxxxxxxxxxx" ;
%let MYAPPID="a2e7cfdc-XXXXXXXXXXXX-xxxxxxxxxxxx";

CAS mySession  SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib ADLS2 datasource=(
      srctype="adls"
      accountname=&MYSTRGACC
      filesystem=&MYSTRGFS
      dnsSuffix=dfs.core.windows.net
      timeout=50000
      tenantid=&MYTNTID
      applicationId=&MYAPPID
   )
   path="/sample_data"
   subdirs;

proc casutil incaslib="ADLS2";
   list files ;
run;quit;

/* Save CAS Data to ADLS2 storage data file (parquet) */
proc casutil incaslib="ADLS2" outcaslib="ADLS2" ;
load data=sashelp.cars casout="cars" replace;
save casdata="cars" casout="cars.parquet"  replace;
run;quit;


/* CAS load from Parquet data files at ADLS2 storage */
proc casutil  incaslib="ADLS2"  outcaslib="ADLS2";
  load casdata="cars.parquet" casout="cars_parquet" replace ;
  list tables ;
run;quit;

cas mysession terminate;

 

Log extract :

 

.....
..............
78   /* Save CAS Data to ADLS2 storage data file (parquet) */
79   proc casutil incaslib="ADLS2" outcaslib="ADLS2" ;
NOTE: The UUID '2d06260a-99c8-5543-b9b4-478a03fd394f' is connected using session MYSESSION.
80   load data=sashelp.cars casout="cars" replace;
NOTE: The INCASLIB= option is ignored when using the DATA= option in the LOAD statement.
NOTE: Executing action 'table.addTable'.
NOTE: Action 'table.addTable' used (Total process time):
NOTE: SASHELP.CARS was successfully added to the "ADLS2" caslib as "CARS".
81   save casdata="cars" casout="cars.parquet"  replace;
NOTE: Executing action 'table.save'.
NOTE: Cloud Analytic Services saved the file cars.parquet in caslib ADLS2.
NOTE: Action 'table.save' used (Total process time):
......
.............
..................
78   /* CAS load from Parquet data files at ADLS2 storage */
79   proc casutil  incaslib="ADLS2"  outcaslib="ADLS2";
NOTE: The UUID '2d06260a-99c8-5543-b9b4-478a03fd394f' is connected using session MYSESSION.
81     load casdata="cars.parquet" casout="cars_parquet" replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the external data from cars.parquet available as table CARS_PARQUET in caslib ADLS2.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE: The Cloud Analytic Services server processed the request in 0.739576 seconds.
..........
.................

 

The following screenshot describes Parquet data file saved to Azure ADLS2 Blob storage.

 

 

 

CAS and Parquet data files at ASDLS2 - facts and features

  • CAS can read single .parquet files or n-number of .parquet files from a subfolder.
  • CAS can read parquet files and folders with .parquet extensions only.
  • When CAS is loaded from an ADLS2 sub-folder containing n-number of .parquet data files with matching file structure, all of the data files become part of a consolidated single CAS table.
  • CAS can read simple and plain structure parquet data files from ADLS2. If the data files contain complex data types like "struct", "array", "Json", etc. these columns cannot be read into separate multiple columns.
  • Before loading a complex (containing arrays, structs, map, etc.) parquet data file into CAS, read it using a third-party application (Azure Synapse or Databricks or HDInsight cluster) into tabular format.
  • CAS tables saved to parquet files at ADLS2 are uncompressed.
  • Parquet files created by CAS at ADLS2 can be read by third-party applications like Azure Synapse, Databricks, SPARK, etc.
  • If a partitioned CAS table is saved into parquet data files at ADLS2, the data files does not retain the CAS partition.
  • CAS creates parquet data files at ADLS2 based on number of CAS worker nodes and size of the CAS table. It does not create partitioned data files based on data columns.
  • CAS does not support data read by data column partition from a sub-folder containing partitioned parquet data file.
  • A global CAS table loaded from parquet data file does not retain the same format into in-memory table, it converts into SASHDAT format. This results into considerable data size difference between parquet data file and CAS table size (e.g. ~ 330 MB parquet data files = ~ 5.8 GB CAS table (~16 times).
  • If parquet data file structure has 20 columns and looking to load CAS from just 5 columns data, is not supported .

 

Important Links:

CAS accessing Azure Data Lake files

Manage Azure Access Key with AZUREAUTHCACHELOC= (Part -2)

 

cas.AZUREAUTHCACHELOC

AZUREAUTHCACHELOC System Option

Azure Data Lake Storage Data Source    

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎08-23-2021 11:19 AM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags