SAS Viya users can read and write Parquet data files to Azure ADLS2 and AWS S3 cloud storage. With SAS Viya 2021.1 release, users can now load CAS from non .parquet suffix data files, a limitation in the previous release. The IMPORTOPTIONS=(FILETYPE="PARQUET") in CAS load action enables to read the non .parquet suffix data files and folders.
Users can load CAS from a single parquet data file or a folder containing the n-number of parquet data files. CAS read and write Parquet files to ADLS2 and S3 in parallel means each CAS worker node loads data directly from the source parquet data file.
This article discusses loading CAS from non .parquet suffix data files located at ADLS2 and S3.
ADLS2:
S3:
The following picture describes the SAS Viya(CAS) access to the parquet data files at ADLS2 and S3.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
CAS load in parallel from ADLS2 Blob Storage requires access to an Azure Access Key file from each CAS worker's PODs and user access to a storage account.
You can mount a PVC to CAS PODs and configure the CAS system parameter cas.AZUREAUTHCACHELOC= to share the Azure Access Key file amongst the CAS PODs. The details steps are discussed in blog manage-azure-access-key-with-azureauthcacheloc-part-2 .
The details steps are discussed in blog cas-accessing-azure-data-lake-files, for configuring Azure user application and Storage Account to access ALDS2 Blob storage.
With the CAS PODs mounted to Azure FileShare and AZUREAUTHCACHELOC= parameter set to Azure FileShare, you can use the following code to load CAS from Parquet data files at ADLS2 blob storage. The ADLS CASLIB will share the Azure Access key amongst the CAS PODs from the central location.
The following picture describes the location of the non .parquet suffix data file at ADLS2, to be loaded into CAS using IMPORTOPTIONS=(FILETYPE="PARQUET") in CAS load action.
Code:
%let MYSTRGACC="utkumaviya4adls2";
%let MYSTRGFS="fsdata";
%let MYTNTID="b1c14d5c-XXXXXXXX-xxxxxxxxxxxx" ;
%let MYAPPID="a2e7cfdc-XXXXXXXXXXXX-xxxxxxxxxxxx";
CAS mySession SESSOPTS=(CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib ADLS2 datasource=(
srctype="adls"
accountname=&MYSTRGACC
filesystem=&MYSTRGFS
dnsSuffix=dfs.core.windows.net
timeout=50000
tenantid=&MYTNTID
applicationId=&MYAPPID
)
path="/"
subdirs;
/* CAS load from non .parquet suffix data files at ADLS2 storage */
proc casutil incaslib="ADLS2" outcaslib="ADLS2";
load casdata="baseball" casout="baseball_parquet" IMPORTOPTIONS=(FILETYPE="PARQUET") replace ;
list tables ;
run;quit;
cas mysession terminate;
Log extract :
.....
..............
105
106 /* CAS load from non .parquet suffix data files at ADLS2 storage */
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 0.25 seconds
cpu time 0.04 seconds
107 proc casutil incaslib="ADLS2" outcaslib="ADLS2";
NOTE: The UUID 'b81c4007-3e24-c246-a68c-d29c7c160851' is connected using session MYSESSION.
108 load casdata="baseball" casout="baseball_parquet" IMPORTOPTIONS=(FILETYPE="PARQUET") replace ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the external data from baseball available as table BASEBALL_PARQUET in caslib ADLS2.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE: real time 0.467833 seconds
NOTE: cpu time 0.259730 seconds (55.52%)
NOTE: total nodes 4 (16 cores)
NOTE: total memory 125.46G
NOTE: memory 37.01M (0.03%)
NOTE: bytes moved 438.31K
NOTE: The Cloud Analytic Services server processed the request in 0.467833 seconds.
..........
.................
The following picture describes the non .parquet suffix data files loaded into CAS.
To load CAS in parallel from the S3 bucket requires access to the AWS Access Key file from each CAS worker's PODs.
You can mount a PVC or NFS server to CAS PODs and share the AWS Access Key file amongst the CAS PODs. With the AWS Access Key available to CAS PODs via NFS mount, you can use the following code to load CAS from Parquet data files at S3. The S3 CASLIB will share the AWS Access Key amongst the CAS PODs from the central location.
The following picture describes the location of the non .parquet suffix data file at S3, to be loaded into CAS using IMPORTOPTIONS=(FILETYPE="PARQUET") in CAS load action.
Code:
%let userid=utkuma;
%let s3bucket=&userid.dmviya4 ;
%let aws_config_file="/mnt/viya-share/config/access-key/config" ;
%let aws_credentials_file="/mnt/viya-share/config/access-key/credentials" ;
%let aws_profile="182696677754-testers" ;
%let aws_region="US_East";
%let objpath="/data/";
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
/* CASLIB to S3 bucket data folder */
caslib AWSCAS1 datasource=(srctype="s3",
awsConfigPath=&aws_config_file,
awsCredentialsPath=&aws_credentials_file ,
awsCredentialsProfile=&aws_profile,
region=&aws_region,
bucket=&s3bucket,
objectpath=&objpath
) subdirs ;
/* Load CAS from non .parquet suffix data-folder */
proc casutil incaslib="AWSCAS1" outcaslib="public";
droptable casdata="baseball_prqt" incaslib="public" quiet;
load casdata="PARQUET/baseball_prqt" casout="baseball_prqt" IMPORTOPTIONS=(FILETYPE="PARQUET") promote ;
run;quit;
/* Load CAS from single non .parquet suffix data file*/
proc casutil incaslib="AWSCAS1" outcaslib="public";
droptable casdata="baseball_prqt_1" incaslib="public" quiet;
load casdata="PARQUET/baseball_prqt/baseball_prqt_1" casout="baseball_prqt_1" IMPORTOPTIONS=(FILETYPE="PARQUET") promote;
run;quit;
proc casutil incaslib="public" outcaslib="public";
list tables ;
run;quit;
CAS mySession TERMINATE;
Log extract :
.....
..............
118
119 /* Load CAS from non .parquet suffix data-folder */
120 proc casutil incaslib="AWSCAS1" outcaslib="public";
NOTE: The UUID 'e142c752-c755-034c-bd82-a09eaeaa526a' is connected using session MYSESSION.
121 droptable casdata="baseball_prqt" incaslib="public" quiet;
NOTE: Executing action 'table.dropTable'.
NOTE: The Cloud Analytic Services server processed the request in 0.019752 seconds.
122
122! load casdata="PARQUET/baseball_prqt" casout="baseball_prqt" IMPORTOPTIONS=(FILETYPE="PARQUET") promote ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the file PARQUET/baseball_prqt available as table BASEBALL_PRQT in caslib public.
NOTE: The Cloud Analytic Services server processed the request in 0.494753 seconds.
123 run;
124 quit;
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 0.53 seconds
cpu time 0.03 seconds
125
126
127 /* Load CAS from single non .parquet suffix data file*/
128 proc casutil incaslib="AWSCAS1" outcaslib="public";
NOTE: The UUID 'e142c752-c755-034c-bd82-a09eaeaa526a' is connected using session MYSESSION.
129 droptable casdata="baseball_prqt_1" incaslib="public" quiet;
NOTE: Executing action 'table.dropTable'.
NOTE: The Cloud Analytic Services server processed the request in 0.01951 seconds.
130
130! load casdata="PARQUET/baseball_prqt/baseball_prqt_1" casout="baseball_prqt_1" IMPORTOPTIONS=(FILETYPE="PARQUET") promote ;
NOTE: Executing action 'table.loadTable'.
NOTE: Cloud Analytic Services made the file PARQUET/baseball_prqt/baseball_prqt_1 available as table BASEBALL_PRQT_1 in caslib
public.
NOTE: Action 'table.loadTable' used (Total process time):
NOTE: PROCEDURE CASUTIL used (Total process time):
real time 0.36 seconds
cpu time 0.01 seconds
..........
.................
The following picture describes the non .parquet suffix data files loaded into CAS.
Important Links:
CAS accessing Azure Data Lake files
Manage Azure Access Key with AZUREAUTHCACHELOC= (Part -2)
CAS access to Parquet file at ADLS2 Blob Storage?
Find more articles from SAS Global Enablement and Learning here.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.