Blobfuse is an open-source project developed to provide a virtual filesystem backed by Azure Blob storage. It is a virtual filesystem driver for Azure Blob storage. Blobfuse allows you to access blob data from Azure Storage Account through Linux Filesystem.
With all excitement around SAS and Azure Cloud, Blobfuse could be a useful tool to access SAS data sets stored at Azure Blob Storage. Azure Blob Storage is a cost-effective and reliable service to store data.
As a SAS user, you may use Azure Blob Storage to store all kinds of files (type) including “.sas7bdat” and “.sashdat” files. But there are no LIBNAME engine or CASLIB connector to directly read and write “.sas7bdat” and “.sashdat” files to Azure Blob Storage. With SAS Viya 3.5 release, SAS SPRE supports the ORC LIBNAME engine for the ORC data file at Blob Storage and ADLS FILENAME statement for other file types. CAS supports ORC and CSV data file access at Azure Blob Storage using ADLS CASLIB.
The Azure Blobfuse could be a viable option for SAS users migrating SAS datasets (.sas7bdat files) to Azure Blob Storage. By using Blobfuse, a SAS user can NFS mount the Azure Blob Storage location to a Unix server as an additional filesystem. The Unix server which is hosting SAS Compute server or CAS servers. The Blobfuse NFS mount enables SAS users to use SAS LIBNAME name statement or PATH based CASLIB to access the .sas7bdat and .sashdta datafiles.
The following diagram describes the data access path from Azure Blob Storage to SAS Compute Server and CAS Servers using Blobfuse.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The Blobfuse is an open-source project and is free for download. With access to the internet, you can use the following statement to install Blobfuse at the Unix server.
sudo rpm -Uvh https://packages.microsoft.com/config/rhel/7/packages-microsoft-prod.rpm
sudo yum install blobfuse
Blobfuse provides native-like performance by buffering and caching the open files at OS in a temporary path file system. You can use most performant disk or a ramdisk as a temporary path for best performance. In Azure, you may use ephemeral disk (SSD) on VMs to provide a low-latency buffer for Blobfuse.
sudo mkdir /mnt/blobfusetmp -p
sudo chown utkuma:sasusers /mnt/blobfusetmp
Blobfuse requires Storage Account credentials stored in a text file in the following format. You can create a configuration file in the home directory or at a safe location with Storage Account name, key, and container name.
tee ~/fuse_connection.cfg > /dev/null << "EOF"
accountName utkuma3adls2strg
accountKey 3R4oxwqyqrTqb4e4v7jsI2viFPkouln9qwNAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
containerName fsutkuma3adls2strg
EOF
sudo chmod 600 ~/fuse_connection.cfg
To mount a Blob Azure Storage container to the Unix server, it requires an empty folder. During the filesystem mount, you can use the “-o allow_other” option to enable access for other users.
sudo mkdir /opt/fscontainer
sudo chown utkuma:sasusers /opt/fscontainer
sudo blobfuse /opt/fscontainer --tmp-path=/mnt/blobfusetmp --config-file=/home/utkuma/fuse_connection.cfg -o allow_other -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120
Once Blob storage location mounted to the Unix server using Blobfuse NFS, you can view the data files under the NFS mount folder.
[utkuma@intviya01 root]$ ls -l /opt/fscontainer
total 0
-rwxrwxrwx. 1 root root 11671314432 Jul 31 14:01 dm_fact_mega_corp_10g.sas7bdat
-rwxrwxrwx. 1 root root 1167204352 Aug 4 16:34 dm_fact_mega_corp_1g_1.sas7bdat
-rwxrwxrwx. 1 root root 1167204352 Jul 31 11:34 dm_fact_mega_corp_1g.sas7bdat
-rwxrwxrwx. 1 root root 1221188352 Aug 4 16:56 dm_fact_mega_corp_1G.sashdat
-rwxrwxrwx. 1 root root 2334334976 Aug 4 16:35 dm_fact_mega_corp_2g_1.sas7bdat
-rwxrwxrwx. 1 root root 2334334976 Jul 31 11:35 dm_fact_mega_corp_2g.sas7bdat
-rwxrwxrwx. 1 root root 2442365904 Aug 4 16:57 dm_fact_mega_corp_2G.sashdat
-rwxrwxrwx. 1 root root 5835661312 Aug 4 16:38 dm_fact_mega_corp_5g_1.sas7bdat
-rwxrwxrwx. 1 root root 5835661312 Jul 31 11:36 dm_fact_mega_corp_5g.sas7bdat
-rwxrwxrwx. 1 root root 6105880216 Aug 4 17:02 dm_fact_mega_corp_5G.sashdat
-rwxrwxrwx. 1 root root 949092352 Jul 30 16:12 dm_fact_mega_corp.sas7bdat
-rwxrwxrwx. 1 root root 131072 Jul 30 16:06 fish_sas.sas7bdat
drwxrwxrwx. 2 root root 4096 Dec 31 1969 sample_data
[utkuma@intviya01 root]$
The list of files in the above output is the same files located at ADLS2 Blob storage.
When a Blob Storage location is NFS mounted to SAS compute server (Unix), users can write a SAS LIBNAME statement with the NFS mounted folder to access .sas7bdat files.
libname azshrlib "/opt/fscontainer" ;
Proc SQL outobs=20;
select * from azshrlib.fish_sas ;
run;quit;
While accessing the data files user will notice Blobfuse is buffering the data files at the temporary location used in Blobfuse mount statement. The buffer file is temporary and auto deletes.
[utkuma@intviya01 ~]$ ls -l /mnt/blobfusetmp/root/
total 128
-rwxrwxrwx. 1 root root 131072 Jul 30 16:06 fish_sas.sas7bdat
[utkuma@intviya01 ~]$
CAS mySession SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
caslib azshcaslib datasource=(srctype="path") path="/opt/fscontainer" ;
proc casutil outcaslib="azshcaslib" incaslib="azshcaslib" ;
load casdata="dm_fact_mega_corp.sas7bdat" casout="dm_fact_mega_corp" replace;
load casdata="dm_fact_mega_corp_1G.sashdat" casout="dm_fact_mega_corp_H" replace;
list tables;
quit;
CAS mySession TERMINATE;
While accessing the data files user will notice Blobfuse is buffering the data files at the temporary location used in Blobfuse mount statement. The buffer file is temporary and auto deletes.
[root@intcas01 ~]# ls -l /mnt/blobfusetmp/root
total 2250436
-rwxrwxrwx. 1 root root 1221188352 Aug 4 16:56 dm_fact_mega_corp_1G.sashdat
-rwxrwxrwx. 1 root root 949092352 Jul 30 16:12 dm_fact_mega_corp.sas7bdat
[root@intcas01 ~]#
You may ask this question, why not use the DNFS type of CASLIB against the Blobfuse folder for parallel access to the Azure Blob storage file. The answer is Yes and No! In a test environment, I have noticed the DNFS type CASLIB using the Blobfuse folder is not working well.
The CAS load from .sashdat file works in parallel, Blobfuse will cache the .sashdat file at each CAS nodes. However, a CAS table save to Blob Storage in .sashdat format creates a corrupted file and could not be loaded back to CAS. The PATH based CASLIB using the Serial method to save data at Blob Storage in .sashdat format works well.
The CAS load from .sas7bdat files does not work with DNFS type CASLIB using Blobfuse. Though I have noticed if Blob Storage is NFS mounted to each CAS node, it will cache the data file at each CAS node during CAS load. The CAS table data save to Blob storage as .sas7bdat file is always in serial. The CAS controller writes data to Blob Storage, even if you have CAS nodes mounted to Blob Storage using Blobfuse.
Notes:
The following is the run time and data transfer speed while moving data in and out from CAS to Azure Blob Storage. Notice the data save to Blob storage is slower compare to CAS load.
CAS server (SMP) hosted at Azure VM machine (Instance Standard_D14_V2, 16 vCPUs, 112GB RAM , Max IOPS = 64x500).
Notes:
Many thanks to Erwan Granger for help and collaboration on this topic.
@UttamKumar thank you for the share, an easy solution.
I am wondering, how will this blob storage work in terms of performance, more in detail, and if there is any caviat when used as shared storage i.e. SAS Grid Manager or SAS Viya's CAS.
I have some questions, as I am wondering from your experiences:
Does it provide the performance needed of min 100 MB/sec/core? Or, from another perspective, what can be the maximum I/O Throughput we can expect? No issues of file locking?
Can it be used only for normal data files, or also for SASWORK/UTILLOC?
Thank you in advance,
Best regards,
Juan
@JuanS_OCS , to answer your questions, I suggest you give a good read to the documentation linked at the bottom of the post. You will find there some limitations such as "Blobfuse doesn't guarantee 100% POSIX compliance as it simply translates requests into Blob REST APIs."
Additional considerations can be found in GitHub: https://github.com/azure/azure-storage-fuse#considerations
After reading there, I would not recommend that as a shared filesystem for SAS Grid Manager.
Hi @EdoardoRiva , thank you very much. I read the document and arrived to same assumption, although it was just my personal assumption. What you mention confirms it.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.