In my previous post about Azure storage, you can find details on Azure disk storage options, i.e. storage dedicated to a single host. In this new installment, the focus is storage shared between multiple virtual machines. What options does Azure provide to SAS Architects?
In the Cloud world, the industry is moving toward object storage mechanisms, such as Azure Blob storage, which is the foundation for Azure Data Lake Storage Gen2 (ADLS2). Object storage is optimized for storing massive amounts of unstructured data, such as logs, images, video, audio.
Even more, when direct access is not available, you can leverage third party projects, for example Azure Blobfuse.
This does not mean that traditional file storage does not exist in the cloud; file storage is here and will stay as a long-term requirement.
SAS, as well, has not abandoned file storage; on the contrary, it is still one of the most widely used methods to read, write and share data, probably only rivaled by access to relational databases.
To read more about object storage see Cloud Object Storage and SAS by Stephen Foerster. The rest of this post, instead, focuses on the options available on Azure to provide shared file systems so that multiple hosts can access common data, including SAS 9, SAS Grid Manager, and SAS Viya.
When moving from traditional on-prem environments to the cloud, you can be overwhelmed by the amount of options for the simple task of sharing a file system between multiple hosts. As a SAS Architect, you will very often need such shared file system in the infrastructure and you should be able to articulate the different requirements for common usages in many SAS 9.x or SAS Viya scenarios: to host deployment artifacts, to support High Availability, as a requisite for backup tools, to satisfy the I/O requirements of SAS Grid Manager shared storage, as an RWX Persistent Volume for SAS Viya 4 in Kubernetes, etc...
Here are some of the most common storage solutions used by SAS architects.
(Yes, the ones described in my previous post).
A first, simple option is to do just as on-premise: attach virtual disks to one VM, then use NFS (Linux) or CIFS (Windows) to export that storage to other machines in the environment.
With this, you are using Azure just to provide the infrastructure, keeping all the traditional software configuration and maintenance for yourself. It is really not a managed solution.
You still have some advantages compared to traditional, on-prem solutions, when you consider that the file server machine and the exported disks are often single points of failure of your environment. On the Azure cloud, a standalone virtual machine can have a guaranteed availability above 99.9%, and managed disks are provisioned from redundant storage. According to Azure documentation:
“Managed disks are designed for 99.999% availability. Managed disks achieve this by providing you with three replicas of your data, allowing for high durability.”.
A downside of this approach is that you need one dedicated machine to be the file server, and you have to manage it yourself.
In simple environments (dev, test), you may simply delegate this role to one of the SAS hosts.
This is the “default” way of sharing file storage on Azure. Azure Files offers fully managed file shares in the cloud. Although Azure documentation lists multiple benefits, I think this is the most important:
“Azure file shares can be created without the need to manage hardware or an OS. This means you don't have to deal with patching the server OS with critical security upgrades or replacing faulty hard disks.”.
It’s a step forward in being cloud-native, compared to the previous option.
Azure file shares use the Server Message Block (SMB) protocol, i.e. they behave as shares created from a Windows server. Although originally not a Linux native protocol, Linux hosts can use Azure file shares by mounting them with the CIFS kernel client.
In using this with SAS, you can encounter some limitations of the CIFS protocol: the most notable is that once you chose a user/group as the owner of the mounted share and a permissions mode, these properties are fixed and you cannot change them without unmounting/remounting.
As an example, suppose you create a share called utils in a storage account named mysasstorage123 (The name must be unique across all existing storage account names in Azure.) After entering the correct credentials in /etc/smb_credentials, you can mount it using this line in /etc/fstab:
//mysasstorage123.file.core.windows.net/utils /mnt/utils cifs rw,vers=3.0,credentials=/etc/smb_credentials,uid=AzureUser,gid=users,file_mode=0775,dir_mode=0775,serverino 0 0
This creates a share owned by the AzureUser, with the file_mode and dir_mode of your choice:
If you create any subdirectory or file there, it will all be owned by the same user and have the same permissions.
This can be a good fit to share data in some specific use cases:
On the other side, it is not a good fit for home directories, because each user would see everybody else’s content – including secrets such as ssh keys – unless you mount a dedicated share per user.
Another huge limitation, intrinsic to the CIFS protocol, is that it requires communication on port 445; while this is usually not a problem between hosts running in your datacenter, or in the same vnet in the cloud, that port is usually closed by administrators on firewalls and between different networks. This means that, probably, you will not be able to mount these shares on any server outside Azure, including on-prem.
Although the name sounds similar to the previous option, this is a totally different kind of shared storage. You have to request onboarding to Azure NetApp Files, following the Register for Azure NetApp files instructions. After your subscription has been authorized to use the service and you have registered the Azure Resource Provider for Azure NetApp Files, you can start creating storage artifacts: storage accounts, storage pools, volumes. Finally, you can mount those volumes to multiple hosts using native NFS for Linux or SMB for Windows.
Since ANF uses native protocols, it supports multiple users and permissions on subdirectories. Although more expensive than regular Azure file shares, it is also more performant. You can provision a minimum of 4 Terabytes, up to a few Petabytes.
Just as with Azure Files, a huge benefit is that ANF is fully managed: Azure handles hardware maintenance, updates, and critical issues for you.
It also comes with some specific considerations:
Azure file shares are provisioned from a storage account; each Azure subscription can have multiple storage accounts, and different types of storage accounts. The two types of storage accounts that can provide Azure file shares to SAS environments are:
Standard Storage Accounts have a fixed maximum bandwidth of 60 MiB/s, that can be upgraded to 300 MiB/s.
Premium Storage Accounts bandwidth is proportional to the allocated storage size; the maximum possible allocation is 100 TiB which delivers 6200 MiB/s (read) and 4130 MiB/s (write), but the limits per single file are much lower: 300 MiB/s (read), 200 MiB/s (write).
Due to these low I/O throughput limits, Azure file shares are not recommended to process SAS datasets across multiple machines.
Azure NetApp Files comes in three service levels: Standard, Premium, Ultra. Ultra is usually the best choice for SAS, both in terms of cost and performance. The throughput limit for a volume is determined by a combination of the quota assigned to the volume and the service level selected; the maximum empirical throughput that has been observed in testing is about 4500 MiB/s (read) and 2000 MiB/s (write).
SAS Grid Manager can run on Azure using Azure NetApp Files as a shared storage for small environments, given proper OS and infrastructure sizing and tuning. The storage can scale to accommodate up to 24 physical cores for the compute nodes (3 nodes with 8 cores each, or 6 with 4 cores each).
As a final consideration, it is important to understand an intrinsic performance limit when using any sharing technology: they all transfer data through the Azure network and are thus subject to virtual machines networking limits. For this reason, SAS requires using accelerated networking. The network bandwidth allocated to each virtual machine is capped on egress (outbound) traffic from the virtual machine, while ingress is not metered or limited directly. However, there are other factors, such as CPU and storage limits, which can impact a virtual machine’s ability to process incoming data. In practice, this means that Azure virtual machines enforce a hard limit to the maximum write throughput towards any shared storage, while read bandwidth is virtually unlimited, up to the maximum that the storage can provide. As an example, E32s_v3 machines have a write limit at 2000 MiB/s.
The following image shows how the performance limits discussed so far can affect the storage performance of a SAS Grid Manager environment:
Bottom line: the most "throttled" pipe or constraint at ANY point in the data path will be the best throughput your workload can achieve!
The options presented in this post are the most common, but not the only possible ones. There can be specific use cases where other shared disk technologies can be used:
“All templates / modules / resources in this repo are released for use "AS IS" without any warranties of any kind, including, but not limited to their installation, use, or performance.”
This post closes with the same consideration as the previous one: there are many resource and configuration choices available within Azure. In order to select the proper shared storage to meet the needs of your SAS environment, you may have to overprovision the storage capacity to provision enough I/O throughput required by SAS.
The Cloud is an ever-evolving environment; as you are reading this post, cloud vendors have already added new capabilities to the technologies presented here, and SAS engineers are further testing them to always highlight the best fit.
Stay tuned to read the results of these performance tests with a special focus on Azure NetApp Files and SAS Grid Manager.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.