The Amazon (S3) Simple Storage Service is an object storage platform with a simple web service interface to store and retrieve any amount of data. You can store various formatted data files into Amazon S3 storage. However, CAS and SAS have limited capability to read it. With SAS Viya 3.4, CAS can read and write only SASHDAT and CSV formatted data files to S3 bucket using an S3 type CASLIB. The base SAS can manage the S3 bucket using PROC S3 with operations like creating a bucket, folder, and uploading/downloading data files. You cannot have a FILENAME or LIBNAME statement against an S3 bucket to read and write the data files.
Consider a situation where you want to store a SAS data set file (.sas7bdat) at S3 bucket and access it from both base SAS and CAS process. Currently, there is no direct method to access a .sas7bdat file stored at S3. You have to download the .sas7bdat files from S3 bucket to local servers before using with SAS or CAS. What if you had an option to mount the S3 bucket as a remote filesystem to read and write the data files from both CAS and SAS process. Mounting an AWS S3 bucket as a filesystem means you can use all your existing tools to interact with S3 bucket to perform read and write operations on files and folders.
This post is about mounting the S3 bucket as NFS filesystem to CAS and SAS (Viya client) server. A base SAS process can save a .sas7bdat data file to NFS mounted S3 bucket and a CAS session can load it without physically copying to CAS server. Similarly, a CAS session can save a .sas7bdat data file to NFS mounted S3 bucket and a base SAS process can access it without downloading to SAS server.
There are (few listed) software/technology which can enable you to mount the S3 bucket as NFS filesystem to UNIX/Win server (SAS and CAS server).
This post discusses methods using S3FS to mount an S3 bucket as a remote NFS filesystem. A subsequent article will cover the other two methods.
s3fs is a FUSE filesystem that allows you to mount an Amazon S3 bucket as a local filesystem. It stores files natively and transparently in S3 so that other programs can access the same files. The maximum size of objects that s3fs can handle depends on Amazon S3. For example, up to 5 GB when using single PUT API, and up to 5 TB is supported when Multipart Upload API is used. The s3fs is a stable software and is being used in a number of production environments, e.g., rsync backup to s3. S3FS allows Linux and MacOS to mount an S3 bucket as NFS file system via FUSE. It’s compatible with AWS S3, Google Cloud Storage and other S3 based object stores.
The following steps describe how to install and configure the s3fs software at SAS and CAS server and access the S3 bucket with data files.
It’s free software and available via EPEL repository (pre-build package) at RHEL and CentOS. The YUM statement can be used at Unix server (SAS and CAS) to install the software.
[root@intviya01 ~]# yum repolist
repo id repo name status
!epel/x86_64 Extra Packages for Enterprise Linux 7 - x86_64 12,914
[root@intviya01 ~]# yum install -y s3fs-fuse
…
….….
…………….
==================================================================================================================================
Package Arch Version Repository Size
==================================================================================================================================
Installing:
s3fs-fuse x86_64 1.84-3.el7 epel 240 k
Transaction Summary
==================================================================================================================================
Install 1 Package
…
…….
…………
Installed:
s3fs-fuse.x86_64 0:1.84-3.el7
Complete!
[root@intviya01 ~]#
To access an S3 Bucket you need an IAM (Identity and Access Management) based user ACCESS_KEY_ID and SECRET_ACCESS_KEY. The secret key is used by s3fs to connect to the S3 bucket and mount an NFS drive. By using the standard method at AWS console you can obtain the ACCESSKEY_ID and SECRET_KEY. Following pics describe the creation of ACCESS_KEY_ID and SECRET_ACCESS_KEY.
Select any image to see a larger version.
Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.
Identify (create) an S3 bucket where you can store (read/write) the data file using ACCESS_KEY.
The default location for the s3fs password (.passwd-s3fs) file can be ${HOME}/.passwd-s3fs or you can have password at /etc/passwd-s3fs. The password file must have chmod 600.
$ echo XXXXXXXXXDHPGXHFMDGA:XXXXXXXXXoK9mLS0BrB9roLV7oqRVkXXXXXXXXXX > ${HOME}/.passwd-s3fs
$ chmod 600 ${HOME}/.passwd-s3fs
$ mkdir /opt/sas/s3mnt
$ chown sas:sas /opt/sas/s3mnt
$ chmod 755 /opt/sas/s3mnt
$ echo user_allow_other >> /etc/fuse.conf
The following mount statement includes password file, uid, gid, and umask which provide the metadata for S3 object at Unix machine. By default, without specific uid and gid in mount statement, the S3 files will be shown at Unix as owned by root. Note: In my test environemnt uid=2000 and gid=2000 is for user 'sas'. Note: To mount an out of AWS US region S3 bucket, it requires additional parameters as endpoint= and url=.
To mount a file system.
$ s3fs gelsas /opt/sas/s3mnt -o passwd_file=${HOME}/.passwd-s3fs -o allow_other,uid=2000,gid=2000,umask=000
To mount an out of AWS-US region S3 bucket, it require additional parameters as endpoint= and url=.
$ s3fs gelsas /opt/sas/s3mnt -o passwd_file=${HOME}/.passwd-s3fs -o endpoint=ap-southeast-2 -o url="https://s3-ap-southeast-2.amazonaws.com" -o allow_other,uid=2000,gid=2000,umask=000 -o multireq_max=5
The following statement can be used to unmount a file system.
$ fusermount -u /opt/sas/s3mnt
The following statement can be used in “/etc/fstab” file to mount the S3 bucket during OS boot.
s3fs#gelsas /opt/sas/s3mnt fuse _netdev,allow_other,umask=000,uid=2000,gid=2000 0 0
Since uid, gid, and umask option used in mount statement, the data files listed from S3 bucket is owned by a specific user (sas) with read/write permission to others.
[root@intviya01 ~]# df -k /opt/sas/s3mnt
Filesystem 1K-blocks Used Available Use% Mounted on
s3fs 274877906944 0 274877906944 0% /opt/sas/s3mnt
[root@intviya01 ~]# ls -l /opt/sas/s3mnt
total 1153
-rwxrwxrwx. 1 sas sas 917504 Feb 12 11:50 customers.sas7bdat
-rwxrwxrwx. 1 sas sas 262144 Feb 12 11:49 prdsale.sas7bdat
[root@intviya01 ~]#
A process/service runs at OS to support the access and object availability from S3 bucket.
[root@intviya01 sas]# ps -eaf | grep s3fs
root 22565 1 0 11:53 ? 00:00:00 s3fs gelsas /opt/sas/s3mnt -o passwd_file=/root/.passwd-s3fs -o allow_other
root 29258 6826 0 11:56 pts/1 00:00:00 grep --color=auto s3fs
[root@intviya01 sas]#
Once you have access to the S3 bucket via a filesystem, you can copy a data file to S3 bucket using Unix command. The copied file gets stored at S3 bucket and is accessible from AWS UI and CLI.
[root@intviya01 s3mnt]# pwd
/opt/sas/s3mnt
[root@intviya01 s3mnt]# ls -l
total 1158
-r-xr-x---. 1 sas sas 917504 Feb 12 11:50 customers.sas7bdat
-r-xr-x---. 1 sas sas 262144 Feb 12 11:49 prdsale.sas7bdat
[root@intviya01 s3mnt]# cp /gelcontent/demo/DM/data/order_fact.sas7bdat .
[root@intviya01 s3mnt]# ls -l
total 146627
-r-xr-x---. 1 sas sas 917504 Feb 12 11:50 customers.sas7bdat
-r-xr-x---. 1 sas sas 148964352 Feb 12 15:34 order_fact.sas7bdat
-r-xr-x---. 1 sas sas 262144 Feb 12 11:49 prdsale.sas7bdat
[root@intviya01 s3mnt]#
After mounting S3 bucket to SAS compute server, the data files can be accessed from BASE SAS using LIBNAME and FILENAME statement. The following example describes the S3 bucket data file access from base SAS using a LIBNAME statement.
libname mylib "/opt/sas/s3mnt" ;
proc print data=mylib.customers ;
run;
The following example describes the S3 bucket data file access from base SAS using FILENAME statement.
filename mydata "/opt/sas/s3mnt/dept.txt";
DATA _NULL_;
INFILE mydata ;
INPUT;
LIST;
RUN;
After mounting S3 bucket to CAS Controller, the data files can be accessed using path-based CASLIB. The following example describes the data load to CAS from an S3 bucket and data save from CAS to an S3 bucket. The new object/files saved at S3 bucket can be accessed using AWS UI and CLI. Other SAS application can also use the newly saved .sas7bdat file.
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US");
caslib caslibs3 datasource=(srctype="path") path="/opt/sas/s3mnt" ;
/* load a S3 data file to CAS */
PROC CASUTIL incaslib="caslibs3" outcaslib="caslibs3";
droptable casdata="customers" quiet;
LOAD casdata="customers.sas7bdat" CASOUT="customers" copies=0
importoptions=(filetype="basesas", dtm="auto", debug="dmsglvli");
RUN;
quit;
/* Save a CAS table to S3 with .sashdat extension */
proc casutil incaslib="caslibs3" outcaslib="caslibs3";
save casdata="customers" casout= "customers_new" replace ;
run;
quit;
/* Save a CAS table to S3 with .sas7bdat extension */
proc casutil incaslib="caslibs3" outcaslib="caslibs3";
save casdata="customers" casout= "customers_new.sas7bdat" replace ;
run;
quit;
/* Save a CAS table to S3 with .csv extension */
proc casutil incaslib="caslibs3" outcaslib="caslibs3";
save casdata="customers" casout= "customers_new.csv" replace ;
run;
quit;
/* load a .sashdat file from S3 to CAS */
proc casutil incaslib="caslibs3" outcaslib="caslibs3";
droptable casdata="customers_new_hdat" quiet;
load casdata="customers_new.sashdat" casout="customers_new_hdat" ;
run;
quit;
/* load a .csv file from S3 to CAS */
proc casutil incaslib="caslibs3" outcaslib="caslibs3";
droptable casdata="customers_new_csv" quiet;
load casdata="customers_new.csv" casout="customers_new_csv" ;
run;
quit;
proc casutil;
list tables incaslib="caslibs3 ";
list files incaslib="caslibs3 ";
run;
/* Shutdown CAS Session */
CAS mySession TERMINATE;
The data load to CAS from S3 bucket using S3FS is a slow process. The S3FS is not a hard link and bounded by network speed between the CAS server and S3 Bucket. The S3 data access performance can improve by using S3FS CACHE at the local machine which requires additional disk space. The parallel load using S3FS mounted file system is not supported so the load time is listed using serial mode.
The following data load tests were conducted using .sas7bdat data file format.
The following data load tests were conducted using .sashdat data file format.
Note: S3 Type CAS lib only support SASHDAT and CSV file format, so the above tests were conducted using .sashdat file format for better comparison.
Note: SAS Viya 3.5 will have improved mechanism to read and write data files to S3 location. I ran the same data load test on Viya 3.5 (pre-release software) and an 8GB .sashdat file takes ~120 Sec to load into CAS using S3 Type CAS lib.
Generally, S3 cannot offer the same performance or semantics as a local file system. More specifically:
Note: Don’t try to use the accessKeyId and secretAccessKey used in this post; they are expired.
Important link: S3FS-FUSE
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.