Accessing AWS S3 as NFS from CAS and SAS – Part 1

1 Like

The Amazon (S3) Simple Storage Service is an object storage platform with a simple web service interface to store and retrieve any amount of data. You can store various formatted data files into Amazon S3 storage. However, CAS and SAS have limited capability to read it. With SAS Viya 3.4, CAS can read and write only SASHDAT and CSV formatted data files to S3 bucket using an S3 type CASLIB. The base SAS can manage the S3 bucket using PROC S3 with operations like creating a bucket, folder, and uploading/downloading data files. You cannot have a FILENAME or LIBNAME statement against an S3 bucket to read and write the data files.

Consider a situation where you want to store a SAS data set file (.sas7bdat) at S3 bucket and access it from both base SAS and CAS process. Currently, there is no direct method to access a .sas7bdat file stored at S3. You have to download the .sas7bdat files from S3 bucket to local servers before using with SAS or CAS. What if you had an option to mount the S3 bucket as a remote filesystem to read and write the data files from both CAS and SAS process. Mounting an AWS S3 bucket as a filesystem means you can use all your existing tools to interact with S3 bucket to perform read and write operations on files and folders.

This post is about mounting the S3 bucket as NFS filesystem to CAS and SAS (Viya client) server. A base SAS process can save a .sas7bdat data file to NFS mounted S3 bucket and a CAS session can load it without physically copying to CAS server. Similarly, a CAS session can save a .sas7bdat data file to NFS mounted S3 bucket and a base SAS process can access it without downloading to SAS server.

There are (few listed) software/technology which can enable you to mount the S3 bucket as NFS filesystem to UNIX/Win server (SAS and CAS server).

Using S3FS
Using AWS Storage Gateway
Using ObjectiveFS

This post discusses methods using S3FS to mount an S3 bucket as a remote NFS filesystem. A subsequent article will cover the other two methods.

S3FS:

s3fs is a FUSE filesystem that allows you to mount an Amazon S3 bucket as a local filesystem. It stores files natively and transparently in S3 so that other programs can access the same files. The maximum size of objects that s3fs can handle depends on Amazon S3. For example, up to 5 GB when using single PUT API, and up to 5 TB is supported when Multipart Upload API is used. The s3fs is a stable software and is being used in a number of production environments, e.g., rsync backup to s3. S3FS allows Linux and MacOS to mount an S3 bucket as NFS file system via FUSE. It’s compatible with AWS S3, Google Cloud Storage and other S3 based object stores.

The following steps describe how to install and configure the s3fs software at SAS and CAS server and access the S3 bucket with data files.

Install S3FS software.

It’s free software and available via EPEL repository (pre-build package) at RHEL and CentOS. The YUM statement can be used at Unix server (SAS and CAS) to install the software.

[root@intviya01 ~]# yum repolist

repo id                                                   repo name                                                         status
!epel/x86_64                                              Extra Packages for Enterprise Linux 7 - x86_64                    12,914
 
[root@intviya01 ~]#  yum install -y s3fs-fuse     
…
….….
…………….
==================================================================================================================================
 Package                         Arch                         Version                            Repository                  Size
==================================================================================================================================
Installing:
 s3fs-fuse                       x86_64                       1.84-3.el7                         epel                       240 k

Transaction Summary
==================================================================================================================================
Install  1 Package

…
…….
…………

Installed:
  s3fs-fuse.x86_64 0:1.84-3.el7
 
Complete!
[root@intviya01 ~]#

Obtain an ACCESS_KEY_ID and SECRET_ACCESS_KEY.
To access an S3 Bucket you need an IAM (Identity and Access Management) based user ACCESS_KEY_ID and SECRET_ACCESS_KEY. The secret key is used by s3fs to connect to the S3 bucket and mount an NFS drive. By using the standard method at AWS console you can obtain the ACCESSKEY_ID and SECRET_KEY. Following pics describe the creation of ACCESS_KEY_ID and SECRET_ACCESS_KEY.

Select any image to see a larger version.
Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post.
Locate and identify an S3 bucket with read/write permission.
Identify (create) an S3 bucket where you can store (read/write) the data file using ACCESS_KEY.
Create an s3fs password file at Unix server (CAS/SAS) to mount the S3 bucket.
The default location for the s3fs password (.passwd-s3fs) file can be ${HOME}/.passwd-s3fs or you can have password at /etc/passwd-s3fs. The password file must have chmod 600.
```
$  echo XXXXXXXXXDHPGXHFMDGA:XXXXXXXXXoK9mLS0BrB9roLV7oqRVkXXXXXXXXXX > ${HOME}/.passwd-s3fs
$  chmod 600 ${HOME}/.passwd-s3fs
```

Create a folder (e.g. under /opt/sas/) to mount FS to S3 location.

$ mkdir /opt/sas/s3mnt
$ chown sas:sas /opt/sas/s3mnt
$ chmod 755 /opt/sas/s3mnt

Allow other users to access the mounted S3 filesystem.
```
$ echo user_allow_other >>  /etc/fuse.conf
```
Mount the S3 bucket with the folder as a filesystem.
The following mount statement includes password file, uid, gid, and umask which provide the metadata for S3 object at Unix machine. By default, without specific uid and gid in mount statement, the S3 files will be shown at Unix as owned by root. Note: In my test environemnt uid=2000 and gid=2000 is for user 'sas'. Note: To mount an out of AWS US region S3 bucket, it requires additional parameters as endpoint= and url=.

To mount a file system.
```
$ s3fs gelsas /opt/sas/s3mnt -o passwd_file=${HOME}/.passwd-s3fs -o allow_other,uid=2000,gid=2000,umask=000
```
To mount an out of AWS-US region S3 bucket, it require additional parameters as endpoint= and url=.
```
$ s3fs gelsas /opt/sas/s3mnt -o passwd_file=${HOME}/.passwd-s3fs -o endpoint=ap-southeast-2 -o url="https://s3-ap-southeast-2.amazonaws.com"  -o allow_other,uid=2000,gid=2000,umask=000 -o multireq_max=5
```
The following statement can be used to unmount a file system.
```
$ fusermount -u /opt/sas/s3mnt
```
The following statement can be used in “/etc/fstab” file to mount the S3 bucket during OS boot.
```
s3fs#gelsas /opt/sas/s3mnt fuse _netdev,allow_other,umask=000,uid=2000,gid=2000 0 0
```

List the files at Unix from the S3 bucket.

Since uid, gid, and umask option used in mount statement, the data files listed from S3 bucket is owned by a specific user (sas) with read/write permission to others.

[root@intviya01 ~]# df -k /opt/sas/s3mnt
Filesystem        1K-blocks  Used    Available Use% Mounted on
s3fs           274877906944     0 274877906944   0% /opt/sas/s3mnt

[root@intviya01 ~]# ls -l /opt/sas/s3mnt
total 1153
-rwxrwxrwx. 1 sas sas 917504 Feb 12 11:50 customers.sas7bdat
-rwxrwxrwx. 1 sas sas 262144 Feb 12 11:49 prdsale.sas7bdat
[root@intviya01 ~]#

S3FS mount process/service at OS.

A process/service runs at OS to support the access and object availability from S3 bucket.

[root@intviya01 sas]# ps -eaf | grep s3fs
root     22565     1  0 11:53 ?        00:00:00 s3fs gelsas /opt/sas/s3mnt -o passwd_file=/root/.passwd-s3fs -o allow_other
root     29258  6826  0 11:56 pts/1    00:00:00 grep --color=auto s3fs
[root@intviya01 sas]#

Copy a file to NFS mounted folder at Unix.

Once you have access to the S3 bucket via a filesystem, you can copy a data file to S3 bucket using Unix command. The copied file gets stored at S3 bucket and is accessible from AWS UI and CLI.

[root@intviya01 s3mnt]# pwd
/opt/sas/s3mnt

[root@intviya01 s3mnt]# ls -l
total 1158
-r-xr-x---.  1 sas sas 917504 Feb 12 11:50 customers.sas7bdat
-r-xr-x---.  1 sas sas 262144 Feb 12 11:49 prdsale.sas7bdat

[root@intviya01 s3mnt]# cp /gelcontent/demo/DM/data/order_fact.sas7bdat .

[root@intviya01 s3mnt]# ls -l
total 146627
-r-xr-x---. 1 sas sas    917504 Feb 12 11:50 customers.sas7bdat
-r-xr-x---. 1 sas sas 148964352 Feb 12 15:34 order_fact.sas7bdat   
-r-xr-x---. 1 sas sas    262144 Feb 12 11:49 prdsale.sas7bdat
[root@intviya01 s3mnt]#

S3 Bucket data file access from BASE SAS.
After mounting S3 bucket to SAS compute server, the data files can be accessed from BASE SAS using LIBNAME and FILENAME statement. The following example describes the S3 bucket data file access from base SAS using a LIBNAME statement.
```
libname mylib "/opt/sas/s3mnt" ;
proc print data=mylib.customers ; 
run; 
```
The following example describes the S3 bucket data file access from base SAS using FILENAME statement.
```
filename mydata "/opt/sas/s3mnt/dept.txt";
DATA _NULL_; 
INFILE mydata ; 
INPUT; 
LIST; 
RUN;
```

S3 Bucket data file access from CAS.

After mounting S3 bucket to CAS Controller, the data files can be accessed using path-based CASLIB. The following example describes the data load to CAS from an S3 bucket and data save from CAS to an S3 bucket. The new object/files saved at S3 bucket can be accessed using AWS UI and CLI. Other SAS application can also use the newly saved .sas7bdat file.

CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US");
caslib caslibs3 datasource=(srctype="path")  path="/opt/sas/s3mnt" ;

/* load a S3 data file to CAS */ 
PROC CASUTIL incaslib="caslibs3" outcaslib="caslibs3";
    droptable casdata="customers"  quiet; 
	LOAD casdata="customers.sas7bdat" CASOUT="customers" copies=0
   importoptions=(filetype="basesas", dtm="auto", debug="dmsglvli"); 
RUN;
quit;

/* Save a CAS table to S3 with .sashdat extension */ 
proc casutil  incaslib="caslibs3" outcaslib="caslibs3";
   save casdata="customers"  casout= "customers_new" replace ;
run;
quit;

/* Save a CAS table to S3 with .sas7bdat extension */ 
proc casutil  incaslib="caslibs3" outcaslib="caslibs3";
   save casdata="customers"  casout= "customers_new.sas7bdat" replace ;
run;
quit;

/* Save a CAS table to S3 with .csv extension */ 
proc casutil  incaslib="caslibs3" outcaslib="caslibs3";
   save casdata="customers"  casout= "customers_new.csv" replace ;
run;
quit;

/* load a .sashdat file from S3 to CAS */ 
proc casutil incaslib="caslibs3" outcaslib="caslibs3";
    droptable casdata="customers_new_hdat" quiet;
	load casdata="customers_new.sashdat" casout="customers_new_hdat"  ; 
run;
quit;

/* load a .csv file from S3 to CAS */ 
proc casutil incaslib="caslibs3" outcaslib="caslibs3";
    droptable casdata="customers_new_csv" quiet;
	load casdata="customers_new.csv" casout="customers_new_csv"  ; 
run;
quit;

proc casutil;
        list tables incaslib="caslibs3 ";
        list files incaslib="caslibs3 ";
run;

/* Shutdown CAS Session */
CAS mySession  TERMINATE;

Data load Performance from S3 bucket to CAS using S3FS:

The data load to CAS from S3 bucket using S3FS is a slow process. The S3FS is not a hard link and bounded by network speed between the CAS server and S3 Bucket. The S3 data access performance can improve by using S3FS CACHE at the local machine which requires additional disk space. The parallel load using S3FS mounted file system is not supported so the load time is listed using serial mode.

The following data load tests were conducted using .sas7bdat data file format.

The following data load tests were conducted using .sashdat data file format.

Note: S3 Type CAS lib only support SASHDAT and CSV file format, so the above tests were conducted using .sashdat file format for better comparison.

Note: SAS Viya 3.5 will have improved mechanism to read and write data files to S3 location. I ran the same data load test on Viya 3.5 (pre-release software) and an 8GB .sashdat file takes ~120 Sec to load into CAS using S3 Type CAS lib.

Limitations of S3FS:

Generally, S3 cannot offer the same performance or semantics as a local file system. More specifically:

Metadata operations such as listing directories have poor performance due to network latency.
Eventual consistency can temporarily yield stale data(Amazon S3 Data Consistency Model).
Random writes or appends to files require rewriting the entire file.
No atomic renames of files or directories.
There is no coordination between multiple clients mounting the same S3 bucket.
This is not a hard links to S3 bucket.

Note: Don’t try to use the accessKeyId and secretAccessKey used in this post; they are expired.

Important link: S3FS-FUSE