In my previous post about Azure storage, you can find details on Azure disk storage options, i.e. storage dedicated to a single host. In this new installment, the focus is storage shared between multiple virtual machines. What options does Azure provide to SAS Architects?
Evolving technologies
In the Cloud world, the industry is moving toward object storage mechanisms, such as Azure Blob storage, which is the foundation for Azure Data Lake Storage Gen2 (ADLS2). Object storage is optimized for storing massive amounts of unstructured data, such as logs, images, video, audio.
SAS is continuously implementing new features to provide robust integration with data stored in these new providers. SAS Cloud Analytic Services (CAS) can access CSV and ORC files stored in ADLS2.
Even more, when direct access is not available, you can leverage third party projects, for example Azure Blobfuse.
This does not mean that traditional file storage does not exist in the cloud; file storage is here and will stay as a long-term requirement.
SAS, as well, has not abandoned file storage; on the contrary, it is still one of the most widely used methods to read, write and share data, probably only rivaled by access to relational databases.
To read more about object storage see Cloud Object Storage and SAS by Stephen Foerster. The rest of this post, instead, focuses on the options available on Azure to provide shared file systems so that multiple hosts can access common data, including SAS 9, SAS Grid Manager, and SAS Viya.
Choices, choices, choices
When moving from traditional on-prem environments to the cloud, you can be overwhelmed by the amount of options for the simple task of sharing a file system between multiple hosts. As a SAS Architect, you will very often need such shared file system in the infrastructure and you should be able to articulate the different requirements for common usages in many SAS 9.x or SAS Viya scenarios: to host deployment artifacts, to support High Availability, as a requisite for backup tools, to satisfy the I/O requirements of SAS Grid Manager shared storage, as an RWX Persistent Volume for SAS Viya 4 in Kubernetes, etc...
Here are some of the most common storage solutions used by SAS architects.
Managed disks
(Yes, the ones described in my previous post).
A first, simple option is to do just as on-premise: attach virtual disks to one VM, then use NFS (Linux) or CIFS (Windows) to export that storage to other machines in the environment.
With this, you are using Azure just to provide the infrastructure, keeping all the traditional software configuration and maintenance for yourself. It is really not a managed solution.
You still have some advantages compared to traditional, on-prem solutions, when you consider that the file server machine and the exported disks are often single points of failure of your environment. On the Azure cloud, a standalone virtual machine can have a guaranteed availability above 99.9%, and managed disks are provisioned from redundant storage. According to Azure documentation:
“Managed disks are designed for 99.999% availability. Managed disks achieve this by providing you with three replicas of your data, allowing for high durability.”.
A downside of this approach is that you need one dedicated machine to be the file server, and you have to manage it yourself.
In simple environments (dev, test), you may simply delegate this role to one of the SAS hosts.
Managed Azure Files
This is the “default” way of sharing file storage on Azure. Azure Files offers fully managed file shares in the cloud. Although Azure documentation lists multiple benefits, I think this is the most important:
“Azure file shares can be created without the need to manage hardware or an OS. This means you don't have to deal with patching the server OS with critical security upgrades or replacing faulty hard disks.”.
It’s a step forward in being cloud-native, compared to the previous option.
Azure file shares use the Server Message Block (SMB) protocol, i.e. they behave as shares created from a Windows server. Although originally not a Linux native protocol, Linux hosts can use Azure file shares by mounting them with the CIFS kernel client.
In using this with SAS, you can encounter some limitations of the CIFS protocol: the most notable is that once you chose a user/group as the owner of the mounted share and a permissions mode, these properties are fixed and you cannot change them without unmounting/remounting.
As an example, suppose you create a share called utils in a storage account named mysasstorage123 (The name must be unique across all existing storage account names in Azure.) After entering the correct credentials in /etc/smb_credentials, you can mount it using this line in /etc/fstab:
//mysasstorage123.file.core.windows.net/utils /mnt/utils cifs rw,vers=3.0,credentials=/etc/smb_credentials,uid=AzureUser,gid=users,file_mode=0775,dir_mode=0775,serverino 0 0
This creates a share owned by the AzureUser, with the file_mode and dir_mode of your choice:
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
If you create any subdirectory or file there, it will all be owned by the same user and have the same permissions.
This can be a good fit to share data in some specific use cases:
installation artifacts as in the example above, including the SAS 9 depot, SAS Viya mirror, Ansible playbooks, etc.
shared vault for SAS Viya backups or central vault for SAS 9 backups
shared file system for CAS, required if your deployment includes a secondary CAS controller
On the other side, it is not a good fit for home directories, because each user would see everybody else’s content – including secrets such as ssh keys – unless you mount a dedicated share per user.
Another huge limitation, intrinsic to the CIFS protocol, is that it requires communication on port 445; while this is usually not a problem between hosts running in your datacenter, or in the same vnet in the cloud, that port is usually closed by administrators on firewalls and between different networks. This means that, probably, you will not be able to mount these shares on any server outside Azure, including on-prem.
Managed Azure NetApp Files (ANF)
Although the name sounds similar to the previous option, this is a totally different kind of shared storage. You have to request onboarding to Azure NetApp Files, following the Register for Azure NetApp files instructions. After your subscription has been authorized to use the service and you have registered the Azure Resource Provider for Azure NetApp Files, you can start creating storage artifacts: storage accounts, storage pools, volumes. Finally, you can mount those volumes to multiple hosts using native NFS for Linux or SMB for Windows.
Since ANF uses native protocols, it supports multiple users and permissions on subdirectories. Although more expensive than regular Azure file shares, it is also more performant. You can provision a minimum of 4 Terabytes, up to a few Petabytes.
Just as with Azure Files, a huge benefit is that ANF is fully managed: Azure handles hardware maintenance, updates, and critical issues for you.
It also comes with some specific considerations:
It requires specific mount options to avoid performance degradation during metadata updates and cache coherency issues.
At the time of writing, there are no workarounds to the NFS limit to the number of groups (16) that a user can belong to. If you use Active Directory – including its Azure variants – as your identity source, Linux will probably only see the first 16 groups users belong to and nothing else. These could even be “random” email groups instead of the desired business units you based your security design upon.
Currently, it cannot scale to more than 1 storage volume per environment. This puts a hard cap to the total bandwidth that the storage can provide (about 5000 MiB/s read, 2000 MiB/s write). There are workarounds, but they require manual, ad-hoc work from NetApp engineers, or expensive vnet peering (see the next point).
Each volume has to reside in a specific subnet assigned to the same vnet of the compute tier. To mount that volume in another vnet, you have to configure vnet peering, which means Microsoft will charge you a few cents for every gigabyte that crosses the boundary. That can easily sum up to millions of dollars per year.
Performance considerations
Azure file shares are provisioned from a storage account; each Azure subscription can have multiple storage accounts, and different types of storage accounts. The two types of storage accounts that can provide Azure file shares to SAS environments are:
General purpose version 2 (GPv2) storage accounts: these are backed by standard, hard disk-based hardware. In addition to storing Azure file shares, GPv2 storage accounts can store other storage resources such as blob containers, queues, or tables. As an example, the SAS Viya Quickstart Template for Azure recommends using a blob container to host the Viya license and Mirror. This storage can be cheap (no fixed minimum price, only pay per usage), but you can quickly hit performance limits. Utilization from other storage services affects the Azure file shares allocated in the same storage account. For example, if you reach the maximum storage account capacity with Azure Blob storage, you will not be able to create new files on your Azure file share.
FileStorage storage accounts: these allow you to deploy Azure file shares on premium, solid-state disk-based hardware. FileStorage accounts can only be used to store Azure file shares; no other storage resources (blob containers, etc.) can be deployed in a FileStorage account. FileStorage provides higher throughput than GPv2, but can be more expensive (you pay per provisioned space/throughput, even when not using it, with a minimum of 100 GiB)
Standard Storage Accounts have a fixed maximum bandwidth of 60 MiB/s, that can be upgraded to 300 MiB/s.
Premium Storage Accounts bandwidth is proportional to the allocated storage size; the maximum possible allocation is 100 TiB which delivers 6200 MiB/s (read) and 4130 MiB/s (write), but the limits per single file are much lower: 300 MiB/s (read), 200 MiB/s (write).
Due to these low I/O throughput limits, Azure file shares are not recommended to process SAS datasets across multiple machines.
Azure NetApp Files comes in three service levels: Standard, Premium, Ultra. Ultra is usually the best choice for SAS, both in terms of cost and performance. The throughput limit for a volume is determined by a combination of the quota assigned to the volume and the service level selected; the maximum empirical throughput that has been observed in testing is about 4500 MiB/s (read) and 2000 MiB/s (write).
SAS Grid Manager can run on Azure using Azure NetApp Files as a shared storage for small environments, given proper OS and infrastructure sizing and tuning. The storage can scale to accommodate up to 24 physical cores for the compute nodes (3 nodes with 8 cores each, or 6 with 4 cores each).
As a final consideration, it is important to understand an intrinsic performance limit when using any sharing technology: they all transfer data through the Azure network and are thus subject to virtual machines networking limits. For this reason, SAS requires using accelerated networking. The network bandwidth allocated to each virtual machine is capped on egress (outbound) traffic from the virtual machine, while ingress is not metered or limited directly. However, there are other factors, such as CPU and storage limits, which can impact a virtual machine’s ability to process incoming data. In practice, this means that Azure virtual machines enforce a hard limit to the maximum write throughput towards any shared storage, while read bandwidth is virtually unlimited, up to the maximum that the storage can provide. As an example, E32s_v3 machines have a write limit at 2000 MiB/s.
The following image shows how the performance limits discussed so far can affect the storage performance of a SAS Grid Manager environment:
Bottom line: the most "throttled" pipe or constraint at ANY point in the data path will be the best throughput your workload can achieve!
Additional Considerations
The options presented in this post are the most common, but not the only possible ones. There can be specific use cases where other shared disk technologies can be used:
DDN Whamcloud Cloud Edition for Lustre This is the cloud community version of Lustre, a clustered filesystem that has proved to be a good fit for SAS, including SAS Grid Manager. The biggest consideration for this offer is that it is not managed, i.e. you are responsible for the installation, configuration, management, and maintenance just as on-prem. Not everyone has the skills to do it.
Azure file shares using native NFS instead of SMB This can remove some of the Azure files limitations, such as user ownership and permissions, but is still in preview in limited regions, and does not improve the available performance. For this reason, it is still not recommended to share SAS datasets.
IBM Spectrum Scale is, traditionally, a good fit to share SAS data, including for SAS Grid Manager. Unfortunately, it is still not available on Azure. The linked GitHub page only provisions the infrastructure and not the actual file system; it also reads:
“All templates / modules / resources in this repo are released for use "AS IS" without any warranties of any kind, including, but not limited to their installation, use, or performance.”
As already stated in the previous post, Azure shared disks allow you to attach a managed disk to multiple virtual machines simultaneously, but are not suitable for SAS. They require specific operating system support and specialized applications to provide the sharing capability. It may be possible to use them through RedHat GFS2, but this has not been tested nor vetted yet.
Conclusion.
This post closes with the same consideration as the previous one: there are many resource and configuration choices available within Azure. In order to select the proper shared storage to meet the needs of your SAS environment, you may have to overprovision the storage capacity to provision enough I/O throughput required by SAS.
The Cloud is an ever-evolving environment; as you are reading this post, cloud vendors have already added new capabilities to the technologies presented here, and SAS engineers are further testing them to always highlight the best fit.
Stay tuned to read the results of these performance tests with a special focus on Azure NetApp Files and SAS Grid Manager.
... View more