BookmarkSubscribeRSS Feed

Azure Blobfuse to access Blob Storage

Started ‎09-01-2020 by
Modified ‎09-01-2020 by
Views 7,039

Blobfuse is an open-source project developed to provide a virtual filesystem backed by Azure Blob storage. It is a virtual filesystem driver for Azure Blob storage. Blobfuse allows you to access blob data from Azure Storage Account through Linux Filesystem.

 

With all excitement around SAS and Azure Cloud, Blobfuse could be a useful tool to access SAS data sets stored at Azure Blob Storage. Azure Blob Storage is a cost-effective and reliable service to store data.

 

As a SAS user, you may use Azure Blob Storage to store all kinds of files (type) including “.sas7bdat” and “.sashdat” files. But there are no LIBNAME engine or CASLIB connector to directly read and write “.sas7bdat” and “.sashdat” files to Azure Blob Storage. With SAS Viya 3.5 release, SAS SPRE supports the ORC LIBNAME engine for the ORC data file at Blob Storage and ADLS FILENAME statement for other file types. CAS supports ORC and CSV data file access at Azure Blob Storage using ADLS CASLIB.

 

The Azure Blobfuse could be a viable option for SAS users migrating SAS datasets (.sas7bdat files) to Azure Blob Storage. By using Blobfuse, a SAS user can NFS mount the Azure Blob Storage location to a Unix server as an additional filesystem. The Unix server which is hosting SAS Compute server or CAS servers. The Blobfuse NFS mount enables SAS users to use SAS LIBNAME name statement or PATH based CASLIB to access the .sas7bdat and .sashdta datafiles.

 

The following diagram describes the data access path from Azure Blob Storage to SAS Compute Server and CAS Servers using Blobfuse.

 

Blobfuse_and_SAS_1.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Blobfuse_and_SAS_2.png

How to mount Blob Storage as a filesystem at Unix server

  • Install the Blobfuse software.

     

    The Blobfuse is an open-source project and is free for download. With access to the internet, you can use the following statement to install Blobfuse at the Unix server.

     

    sudo rpm -Uvh https://packages.microsoft.com/config/rhel/7/packages-microsoft-prod.rpm
    
    sudo yum install blobfuse
    

     

  • Prepare the Unix OS for NFS mount.

     

    Blobfuse provides native-like performance by buffering and caching the open files at OS in a temporary path file system. You can use most performant disk or a ramdisk as a temporary path for best performance. In Azure, you may use ephemeral disk (SSD) on VMs to provide a low-latency buffer for Blobfuse.

     

    sudo mkdir /mnt/blobfusetmp -p
    
    sudo chown utkuma:sasusers /mnt/blobfusetmp
    

     

  • Configure Storage Account Credentials at Unix Server.

     

    Blobfuse requires Storage Account credentials stored in a text file in the following format. You can create a configuration file in the home directory or at a safe location with Storage Account name, key, and container name.

     

    tee  ~/fuse_connection.cfg > /dev/null << "EOF"
    accountName utkuma3adls2strg
    accountKey 3R4oxwqyqrTqb4e4v7jsI2viFPkouln9qwNAXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    containerName fsutkuma3adls2strg
    EOF
    
    sudo chmod 600 ~/fuse_connection.cfg
    

     

  • Mount an empty directory to Azure Blob Storage.

     

    To mount a Blob Azure Storage container to the Unix server, it requires an empty folder. During the filesystem mount, you can use the “-o allow_other” option to enable access for other users.

     

    sudo mkdir /opt/fscontainer 
    
    sudo chown utkuma:sasusers /opt/fscontainer
    
    sudo blobfuse /opt/fscontainer --tmp-path=/mnt/blobfusetmp  --config-file=/home/utkuma/fuse_connection.cfg -o allow_other -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120
    

     

    Once Blob storage location mounted to the Unix server using Blobfuse NFS, you can view the data files under the NFS mount folder.

     

    [utkuma@intviya01 root]$ ls -l /opt/fscontainer
    total 0
    -rwxrwxrwx. 1 root root 11671314432 Jul 31 14:01 dm_fact_mega_corp_10g.sas7bdat
    -rwxrwxrwx. 1 root root  1167204352 Aug  4 16:34 dm_fact_mega_corp_1g_1.sas7bdat
    -rwxrwxrwx. 1 root root  1167204352 Jul 31 11:34 dm_fact_mega_corp_1g.sas7bdat
    -rwxrwxrwx. 1 root root  1221188352 Aug  4 16:56 dm_fact_mega_corp_1G.sashdat
    -rwxrwxrwx. 1 root root  2334334976 Aug  4 16:35 dm_fact_mega_corp_2g_1.sas7bdat
    -rwxrwxrwx. 1 root root  2334334976 Jul 31 11:35 dm_fact_mega_corp_2g.sas7bdat
    -rwxrwxrwx. 1 root root  2442365904 Aug  4 16:57 dm_fact_mega_corp_2G.sashdat
    -rwxrwxrwx. 1 root root  5835661312 Aug  4 16:38 dm_fact_mega_corp_5g_1.sas7bdat
    -rwxrwxrwx. 1 root root  5835661312 Jul 31 11:36 dm_fact_mega_corp_5g.sas7bdat
    -rwxrwxrwx. 1 root root  6105880216 Aug  4 17:02 dm_fact_mega_corp_5G.sashdat
    -rwxrwxrwx. 1 root root   949092352 Jul 30 16:12 dm_fact_mega_corp.sas7bdat
    -rwxrwxrwx. 1 root root      131072 Jul 30 16:06 fish_sas.sas7bdat
    drwxrwxrwx. 2 root root        4096 Dec 31  1969 sample_data
    [utkuma@intviya01 root]$
    

     

    The list of files in the above output is the same files located at ADLS2 Blob storage.

     

    Blobfuse_and_SAS_3.png

Azure Blob data file access from SAS and CAS

  • SAS LIBNAME to access Blob storage data.

     

    When a Blob Storage location is NFS mounted to SAS compute server (Unix), users can write a SAS LIBNAME statement with the NFS mounted folder to access .sas7bdat files.

     

    libname azshrlib "/opt/fscontainer" ;
    
    Proc SQL outobs=20;
    select * from azshrlib.fish_sas ;
    run;quit;
    

     

    While accessing the data files user will notice Blobfuse is buffering the data files at the temporary location used in Blobfuse mount statement. The buffer file is temporary and auto deletes.

     

    [utkuma@intviya01 ~]$ ls -l /mnt/blobfusetmp/root/
    total 128
    -rwxrwxrwx. 1 root root 131072 Jul 30 16:06 fish_sas.sas7bdat
    [utkuma@intviya01 ~]$
    

     

  • PATH CASLIB to access Blob Storage data. When a Blob Storage location is NFS mounted to the CAS Controller server (Unix), users can use PATH based CASLIB to access .sas7bdat and .sashdat files.

     

    CAS mySession  SESSOPTS=( CASLIB=casuser TIMEOUT=99 LOCALE="en_US" metrics=true);
    
    caslib azshcaslib datasource=(srctype="path")  path="/opt/fscontainer" ;
    proc casutil outcaslib="azshcaslib" incaslib="azshcaslib" ;
    	load casdata="dm_fact_mega_corp.sas7bdat" casout="dm_fact_mega_corp" replace;
        load casdata="dm_fact_mega_corp_1G.sashdat" casout="dm_fact_mega_corp_H" replace;
        list tables;
    quit;
    
    CAS mySession  TERMINATE;
    

     

    While accessing the data files user will notice Blobfuse is buffering the data files at the temporary location used in Blobfuse mount statement. The buffer file is temporary and auto deletes.

     

    [root@intcas01 ~]# ls -l /mnt/blobfusetmp/root
    total 2250436
    -rwxrwxrwx. 1 root root 1221188352 Aug  4 16:56 dm_fact_mega_corp_1G.sashdat
    -rwxrwxrwx. 1 root root  949092352 Jul 30 16:12 dm_fact_mega_corp.sas7bdat
    [root@intcas01 ~]#
    

     

    You may ask this question, why not use the DNFS type of CASLIB against the Blobfuse folder for parallel access to the Azure Blob storage file. The answer is Yes and No! In a test environment, I have noticed the DNFS type CASLIB using the Blobfuse folder is not working well.

     

    The CAS load from .sashdat file works in parallel, Blobfuse will cache the .sashdat file at each CAS nodes. However, a CAS table save to Blob Storage in .sashdat format creates a corrupted file and could not be loaded back to CAS. The PATH based CASLIB using the Serial method to save data at Blob Storage in .sashdat format works well.

     

    The CAS load from .sas7bdat files does not work with DNFS type CASLIB using Blobfuse. Though I have noticed if Blob Storage is NFS mounted to each CAS node, it will cache the data file at each CAS node during CAS load. The CAS table data save to Blob storage as .sas7bdat file is always in serial. The CAS controller writes data to Blob Storage, even if you have CAS nodes mounted to Blob Storage using Blobfuse.

     

    Notes:
    • It is recommended to allows multiple CAS nodes to mount the same blob container for read-only scenarios.
    • While a Blob container is mounted, the data in the container should not be modified by any process other than Blobfuse. This includes other instances of Blobfuse, running on this or other machines. Doing so could cause data loss or data corruption. Mounting other containers is fine.

    CAS load/save Performance using Blobfuse.

    The performance of CAS load from Blobfuse depends on the location of the CAS Unix server, and Azure Blob Storage account. For better data transfer between CAS server and Blob Storage, keep them close to each other as location-wise.

     

    The following is the run time and data transfer speed while moving data in and out from CAS to Azure Blob Storage. Notice the data save to Blob storage is slower compare to CAS load.

     

    CAS server (SMP) hosted at Azure VM machine (Instance Standard_D14_V2, 16 vCPUs, 112GB RAM , Max IOPS = 64x500).

     

    Blobfuse_and_SAS_4.png

     

    Notes:
    • For better performance user should use latest series of Azure server (e.g. E32sV3 or E32ds_V4 ).
    • Since I/O goes through the network, it’s recommended to enable the Accelerated Networking on Azure VM (for Azure supported Instance type).
    CAS server hosted at SAS RACE machine (4 vCPUs, 32 GB RAM)

     

    Blobfuse_and_SAS_5.png

     

    Many thanks to Erwan Granger for help and collaboration on this topic.

    Resource

    How to Mount Azure Blob Storage Container to a Unix Server
Comments

@UttamKumar thank you for the share, an easy solution.

 

I am wondering, how will this blob storage work in terms of performance, more in detail, and if there is any caviat when used as shared storage i.e. SAS Grid Manager or SAS Viya's CAS.

 

I have some questions, as I am wondering from your experiences:

 

Does it provide the performance needed of min 100 MB/sec/core? Or, from another perspective, what can be the maximum I/O Throughput we can expect? No issues of file locking?

 

Can it be used only for normal data files, or also for SASWORK/UTILLOC?


Thank you in advance,

Best regards,

 

Juan

@JuanS_OCS , to answer your questions, I suggest you give a good read to the documentation linked at the bottom of the post. You will find there some limitations such as "Blobfuse doesn't guarantee 100% POSIX compliance as it simply translates requests into Blob REST APIs." 

Additional considerations can be found in GitHub: https://github.com/azure/azure-storage-fuse#considerations

After reading there, I would not recommend that as a shared filesystem for SAS Grid Manager.

Hi @EdoardoRiva , thank you very much. I read the document and arrived to same assumption, although it was just my personal assumption. What you mention confirms it.

Version history
Last update:
‎09-01-2020 02:31 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags