SAS Viya and Parquet Data Files

1 Like

SAS Viya users can read and write Parquet data files to Azure ADLS2 and AWS S3 cloud storage. With SAS Viya 2021.1 release, users can now load CAS from non .parquet suffix data files, a limitation in the previous release. The IMPORTOPTIONS=(FILETYPE="PARQUET") in CAS load action enables to read the non .parquet suffix data files and folders.

Users can load CAS from a single parquet data file or a folder containing the n-number of parquet data files. CAS read and write Parquet files to ADLS2 and S3 in parallel means each CAS worker node loads data directly from the source parquet data file.

This article discusses loading CAS from non .parquet suffix data files located at ADLS2 and S3.

Pre-requisites

ADLS2:

User access to Azure Storage Account with Storage Blob Data Contributor role
Azure Access Key available to CAS PODs Via PVC to File Share
Azure application with permission to access Azure Data lake and Azure Storage
CAS System parameter Azureauthcacheloc= set to use Azure Access key from File Share PVC

S3:

User Read/Write access to S3 bucket
AWS Access Key available to CAS PODs Via NFS Mount or PVC

Data access path

The following picture describes the SAS Viya(CAS) access to the parquet data files at ADLS2 and S3.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

CAS load from non .parquet suffix data files at ADLS2

CAS load in parallel from ADLS2 Blob Storage requires access to an Azure Access Key file from each CAS worker's PODs and user access to a storage account.

You can mount a PVC to CAS PODs and configure the CAS system parameter cas.AZUREAUTHCACHELOC= to share the Azure Access Key file amongst the CAS PODs. The details steps are discussed in blog manage-azure-access-key-with-azureauthcacheloc-part-2 .

The details steps are discussed in blog cas-accessing-azure-data-lake-files, for configuring Azure user application and Storage Account to access ALDS2 Blob storage.

With the CAS PODs mounted to Azure FileShare and AZUREAUTHCACHELOC= parameter set to Azure FileShare, you can use the following code to load CAS from Parquet data files at ADLS2 blob storage. The ADLS CASLIB will share the Azure Access key amongst the CAS PODs from the central location.

The following picture describes the location of the non .parquet suffix data file at ADLS2, to be loaded into CAS using IMPORTOPTIONS=(FILETYPE="PARQUET") in CAS load action.

Code:

%let MYSTRGACC="utkumaviya4adls2";
%let MYSTRGFS="fsdata";
%let MYTNTID="b1c14d5c-XXXXXXXX-xxxxxxxxxxxx" ;
%let MYAPPID="a2e7cfdc-XXXXXXXXXXXX-xxxxxxxxxxxx";

CAS mySession  SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

caslib ADLS2 datasource=(
      srctype="adls"
      accountname=&MYSTRGACC
      filesystem=&MYSTRGFS
      dnsSuffix=dfs.core.windows.net
      timeout=50000
      tenantid=&MYTNTID
      applicationId=&MYAPPID
   )
   path="/"
   subdirs;


/* CAS load from non .parquet suffix data files at ADLS2 storage */
proc casutil  incaslib="ADLS2"  outcaslib="ADLS2";
  load casdata="baseball" casout="baseball_parquet" IMPORTOPTIONS=(FILETYPE="PARQUET")  replace ;
  list tables ;
run;quit;

cas mysession terminate;

Log extract :

.....
..............
105
106  /* CAS load from non .parquet suffix data files at ADLS2 storage */
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           0.25 seconds
      cpu time            0.04 seconds

107  proc casutil  incaslib="ADLS2"  outcaslib="ADLS2";
NOTE: The UUID 'b81c4007-3e24-c246-a68c-d29c7c160851' is connected using session MYSESSION.
108    load casdata="baseball" casout="baseball_parquet" IMPORTOPTIONS=(FILETYPE="PARQUET")  replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the external data from baseball available as table BASEBALL_PARQUET in caslib ADLS2.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE:       real time               0.467833 seconds
NOTE:       cpu time                0.259730 seconds (55.52%)
NOTE:       total nodes             4 (16 cores)
NOTE:       total memory            125.46G
NOTE:       memory                  37.01M (0.03%)
NOTE:       bytes moved             438.31K
NOTE: The Cloud Analytic Services server processed the request in 0.467833 seconds.
..........
.................

The following picture describes the non .parquet suffix data files loaded into CAS.

CAS load from non .parquet suffix data files at S3

To load CAS in parallel from the S3 bucket requires access to the AWS Access Key file from each CAS worker's PODs.

You can mount a PVC or NFS server to CAS PODs and share the AWS Access Key file amongst the CAS PODs. With the AWS Access Key available to CAS PODs via NFS mount, you can use the following code to load CAS from Parquet data files at S3. The S3 CASLIB will share the AWS Access Key amongst the CAS PODs from the central location.

The following picture describes the location of the non .parquet suffix data file at S3, to be loaded into CAS using IMPORTOPTIONS=(FILETYPE="PARQUET") in CAS load action.

Code:

%let userid=utkuma;
%let s3bucket=&userid.dmviya4 ;
%let aws_config_file="/mnt/viya-share/config/access-key/config"  ;
%let aws_credentials_file="/mnt/viya-share/config/access-key/credentials"  ;
%let aws_profile="182696677754-testers"  ;
%let aws_region="US_East";
%let objpath="/data/";

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);

/* CASLIB to S3 bucket data folder */
caslib AWSCAS1 datasource=(srctype="s3",
                   awsConfigPath=&aws_config_file,
                   awsCredentialsPath=&aws_credentials_file ,
                   awsCredentialsProfile=&aws_profile,
                   region=&aws_region,
                   bucket=&s3bucket,
                   objectpath=&objpath
                   ) subdirs ;

/* Load CAS from non .parquet suffix data-folder */
proc casutil incaslib="AWSCAS1"  outcaslib="public";
    droptable casdata="baseball_prqt" incaslib="public"  quiet;
   	load casdata="PARQUET/baseball_prqt" casout="baseball_prqt" IMPORTOPTIONS=(FILETYPE="PARQUET") promote ;
run;quit;


/* Load CAS from single non .parquet suffix data file*/
proc casutil incaslib="AWSCAS1"  outcaslib="public";
    droptable casdata="baseball_prqt_1" incaslib="public"  quiet;
   	load casdata="PARQUET/baseball_prqt/baseball_prqt_1" casout="baseball_prqt_1" IMPORTOPTIONS=(FILETYPE="PARQUET") promote;
run;quit;


proc casutil incaslib="public"  outcaslib="public";
list tables ;
run;quit;

CAS mySession  TERMINATE;

Log extract :

.....
..............
118
119  /* Load CAS from non .parquet suffix data-folder */
120  proc casutil incaslib="AWSCAS1"  outcaslib="public";
NOTE: The UUID 'e142c752-c755-034c-bd82-a09eaeaa526a' is connected using session MYSESSION.
121      droptable casdata="baseball_prqt" incaslib="public"  quiet;
NOTE: Executing action 'table.dropTable'.
NOTE: The Cloud Analytic Services server processed the request in 0.019752 seconds.
122
122!     load casdata="PARQUET/baseball_prqt" casout="baseball_prqt" IMPORTOPTIONS=(FILETYPE="PARQUET") promote ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the file PARQUET/baseball_prqt available as table BASEBALL_PRQT in caslib public.
NOTE: The Cloud Analytic Services server processed the request in 0.494753 seconds.
123  run;
124  quit;
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           0.53 seconds
      cpu time            0.03 seconds

125
126
127  /* Load CAS from single non .parquet suffix data file*/
128  proc casutil incaslib="AWSCAS1"  outcaslib="public";
NOTE: The UUID 'e142c752-c755-034c-bd82-a09eaeaa526a' is connected using session MYSESSION.
129      droptable casdata="baseball_prqt_1" incaslib="public"  quiet;
NOTE: Executing action 'table.dropTable'.
NOTE: The Cloud Analytic Services server processed the request in 0.01951 seconds.
130
130!     load casdata="PARQUET/baseball_prqt/baseball_prqt_1" casout="baseball_prqt_1" IMPORTOPTIONS=(FILETYPE="PARQUET") promote ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the file PARQUET/baseball_prqt/baseball_prqt_1 available as table BASEBALL_PRQT_1 in caslib
      public.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           0.36 seconds
      cpu time            0.01 seconds
..........
.................

The following picture describes the non .parquet suffix data files loaded into CAS.

Summary:

CAS can read parquet files and folders with and without .parquet extension at ADLS2 and S3.
CAS can load from a sub-folder containing n-number of .parquet data files with matching file structure. All of the data files become part of a consolidated single CAS table.
CAS can read plain structure parquet data files from ADLS2 and S3. If the data files contain complex data types like “structure”, “array”, “JSON”, etc., they are not read into separate individual data element columns.
Before loading a complex (containing arrays, struts, etc.) parquet data file into CAS, read it using a third-party application (Synapse or Databricks or Athena) into a tabular format.

Important Links:

CAS accessing Azure Data Lake files

Manage Azure Access Key with AZUREAUTHCACHELOC= (Part -2)

CAS access to Parquet file at ADLS2 Blob Storage?

Find more articles from SAS Global Enablement and Learning here.