Last Updated: 12APR2022
Information added on 12APR2022: Information about new Ebds_v5 instances.
This post discusses specifics for running SAS® (either SAS 9.4 or Viya 3.x) in the Microsoft (MS) Azure public cloud. Please review the SAS® Global Forum 2020 paper “Important Performance Considerations When Moving SAS® to a Public Cloud” for critical information that we will not cover in this post.
To maximize the guidelines in this post, you need to understand the compute resources (cores, memory, IO throughput and network bandwidth) needs of your SAS applications. If you know this information, then you can override the generic IO throughput recommendations discussed in this post.
Please remember that most public cloud instances list CPUs as virtual CPU(s). These CPUs might be hyperthreaded (two threads per physical core). You need to understand if the vCPU includes hyperthreads so that you can ensure you have the correct number of physical cores for SAS. To convert Intel vCPUs to physical cores, divide the number of hyperthreaded vCPUs by 2.
In addition to the information about Azure instances types, storage and networking, please follow the best practices in the “Optimizing SAS on RHEL (April 2019, V 1.3.1 or later)” tuning guide. The information in the “2.4.4.4 Virtual Memory Dirty Page Tuning for SAS 9” section on page 17 is essential.
Azure instance types. This link brings you to the list of instance types. Read the description carefully to thoroughly understand what compute resources are available with each instance.
If the instance type contains multiple processor models – such as the Esv3 series which can be either Broadwell, Skylake or Cascade Lake processors – you need to confirm that each instance is using the same Intel processor since you are unable to select the chip set that will be used for the VM from the Portal. After an instance is instantiated, using the lscpu command will list the CPU Model Name for the system.
For SAS Grid compute nodes and CAS Controller/Workers, we recommend these systems all be the same CPU generation. This will ensure you get consistent performance overall rather than from the slowest and oldest CPU instance. Please work with your Microsoft account team to determine how to make this happen. Also, we strongly suggest investing in Unified (a.k.a. Premier) Support when deploying SAS in Azure.
General Tuning Guidance
Review “Max uncached disk throughput IOPS/MBps” to see what the maximum MB per second IO throughput is available between the instance you are looking at and Premium Storage. For a Standard_E32s_v4 instance (one of the most popular MS Azure instances that is being used for SAS compute systems), the maximum IO throughput (instance total, not per physical core) is 768 MB per second. For a 16 physical core system, this means 48 MB/sec/physical core IO bandwidth for all the data that will be stored on external Premium Storage. If you need more IO throughput per physical core to the external Premium Storage, you can constrain the number of cores in the instance. There will be more details on “constraining cores” later in this post. UPDATE: With the new Ebsd_v5 instances, the maximum IO throughput has been increased significantly. Update added 12PAR2022.
Review “Max NICs/Expected network bandwidth (Mbps)” to see what the maximum network egress bandwidth is. For a Standard_E32s_v4 instance, the maximum network egress bandwidth is constrained to 16 Gigabit/s, whereas ingress is constrained by network card speed and number of network connections alone. Refer to this page for detail, where the first 4 paragraphs are a must read. Please note, SAS recommends a network bandwidth of at least 10 Gigabit between SAS systems that within a SAS infrastructure.
Review “Temp storage (SSD) GB” and “Max cached and temp storage throughput: IOPS/MBps (cache size in GB)” to see the size and maximum IO throughput of the local, ephemeral disk. For a Standard_E32s_v4 instance, the maximum size of the internal SSD that could be used for temporary SAS file systems (SAS WORK/UTILLOC or CAS_DISK_CACHE) is 512 GB and the maximum IO throughput is 512 MB/sec (32 MB/sec/physical core). This Temp storage size is both small and operates at a much lower IO throughput than is recommended by SAS - so you will probably not want to use it for temporary SAS file systems. When the local ephemeral storage is inadequate, more IO is required from the external Premium Storage that also has a cap on its IO throughput – see number 2) above. UPDATE: With the new Ebsd_v5 instances, the maximum IO throughput has been increased significantly. Update added 12PAR2022.
To get consistent instance-to-instance IO and throughput, ensure all your instances are in the same Azure Proximity Placement Group
Please note: You can Utilize Constrained Cores with Azure instances to reduce the number of vCPU’s (and thus physical cores) presented to the instance’s operating system. This would turn a Standard_E32s_v4 from a 16 physical cores system to an 8 physical cores system, effectively doubling the IO bandwidth per core. This increases the IO throughput per physical core closer to minimum recommended for SAS workloads. Details on this feature, and a list of instances that can be constrained, can be found here.
To avoid sporadic NMI lockups that might hold processing while a thread waits for an available vCPU when using RHEL 7.x (3.10 kernel) with SAS compute nodes. There is a known issue in the iSCSI and SCSI drivers in this kernel which can cause CPU lock ups when under heavy IO load. Without going into too much technical detail, it basically boils down to Linux having a ringbuffer with IO completes that it wants too occasionally flush out. In some cases, the flushing can take very long due to system defaults which can cause your CPUs to lock up. This in turn may result in timeouts of SAS servers, which can cause job failures.
There are two workarounds to resolve the issue. Add either of the following options to Grub and reboot the machine.
Decrease the ringbuffer size and increase the vCPUs per channel. Preferred solution
hv_storvsc.storvsc_ringbuffer_size=131072 hv_storvsc.storvsc_vcpus_per_sub_channel=1024
Disable blk-mq
scsi_mod.use_blk_mq=n
Network
To achieve optimal network bandwidth, Azure Accelerated Networking must be enabled. Accelerated Networking is available on any Azure VM with 2 or more physical cores.
To validate that Accelerated Network is enabled on a linux instance, please run the following commands and ensure your output looks like the output on this web site.
lspci
ethtool -S eth0 | grep vf_
uname -r
In addition to Accelerated Networking, SAS needs to be on an isolated cloud VNET, Resource Group, etc. This VNET should “share nothing” with other customer infrastructures. The exception is placing the instances for your shared file system and RDBMSs dedicated to SAS on this VNET as well.
To get consistent instance-to-instance IO and throughput, ensure that all your instances are in the same Azure Proximity Placement Group
DNS resolution must be verified prior to SAS installation. The FQDNs used to communicate between Nodes within Azure should resolve to Azure internal IP addresses. From nodes external to Azure (client desktops running SAS Enterprise Guide or the SAS Plug-in for Microsoft Office) FQDNs must resolve to the public / external IP addresses of nodes hosting the SAS server tier running inside Azure.
When attaching SAS clients that maintain persistent connections (for example SAS Enterprise Guide) to Azure instances from outside of Azure, we have seen the connections being dropped if they are idle for more than 4 minutes. This is a feature of Azure NSG. To avoid this from happening, you will need to go into the SAS Management Console and add the KeepAlive setting to Workspace Server and set it to a value less than 4 minutes.
SAS nodes must be able to communicate directly with each other without contention. SAS Compute nodes need both extremes of high throughput and low-latency communications. Throughput is needed for loading data into memory. Low-latency is needed to coordinate and perform complex analytics between nodes and to provide data resilience via copy redundancy. Please start your SAS deployment with an isolated VNET, using private IP addresses and private DNS. (At minimum, SAS nodes should be in their own subnet.) If you need to deploy a SAS solution in Azure and you do not have cross-premises connectivity (e.g. ExpressRoute, VPN), then use one of following approaches to enhance your security:
Use Azure Bastion Host (Preferred) - https://docs.microsoft.com/en-us/azure/bastion/bastion-overview
Create a “jump box” that is the public entry point (with a Private IP address)
Azure Default VM Network MTU Size - Azure strongly recommends the default network MTU size of 1500 not be adjusted because Azure’s Virtual Network Stack will attempt to fragment a packet at 1400 bytes. To learn more, please review this “Azure and MTU” article.
External Storage
To achieve the most IO throughput for SAS, please make sure that you follow the best practices in the “Optimizing SAS on RHEL (April 2019, V 1.3.1 or later)” tuning guide. The information in the “2.4.4.4 Virtual Memory Dirty Page Tuning for SAS 9” section on page 17 is essential.
The following architecture recommendations cover scale-up scenarios. Scale-out recommendations will follow later, pending validation.
Premium Storage: Like the instance types, there is a maximum IO throughput per Premium Disk. These values can be found on the “Throughput per disk” row of this table. Multiple Premium Disks can be attached to an instance enough to meet or exceed the “Max uncached disk throughput IOPS/MBps” of an instance, should be attached the instance. These disks should be striped together using the operating system to create a single file system that can utilize the full throughput across all the disks.
When creating disk storage, you will be prompted for setting a Storage Caching value. Please set the following based on the type of files that will be used by these disks:
ReadWrite for your operating system storage
None* for your persistent SAS data files
None* for your SAS temporary files
* this value was changed on 07DEC2021 after additional testing .
With RHEL 7.x distribution and 3.x kernel testing has shown that leaving the virtual-guest tuned profile (vm.dirty_ratio = 30 and vm.dirty_background_ratio = 10) achieves the best IO throughput when using Premium Storage.
Azure NetApp Files (ANF): Please review this blog on how to use ANF with SAS. Azure NetApp Files: A shared file system to use wi... - SAS Support Communities
EXAScaler Cloud Shared File System: Please review this blog on how to use EXAScaler Cloud (Lustre) with SAS. EXAScaler Cloud by DDN: A shared file system to us... - SAS Support Communities. EXAScaler Cloud has replaced the community version of Lustre.
IBM Spectrum Scale: Please review this blog on how to use IBM Spectrum Scale with SAS. Sycomp Storage Fueled by IBM Spectrum Scale: A new... - SAS Support Communities
Veritas InfoScale: Please review this blog on how to use InfoScale with SAS. InfoScale by Veritas: A shared file system to use ... - SAS Support Communities (added 01APR2022)
Azure Disk Storage is the only shared block storage in the cloud that supports both Windows and Linux based clustered or distributed applications to run your most demanding enterprise applications – like clustered databases, parallel file systems, stateful containers, and machine learning applications – in the cloud, without compromising on well-known deployment patterns for fast failover and high availability. While this storage will technically function with all SAS applications, we do not feel you will be able to achieve the IO throughput for MOST SAS applications.
As a reminder, SAS temporary files and directories such as SAS WORK, SAS UTILLOC and CAS_DISK_CACHE should be placed on storage with the highest proven throughput possible. Today that usually means Premium Storage or the instance’s local SSD.
Reference Instances for SAS Compute Nodes
To summarize the above, the following are good example configuration for SAS 9.4 or SAS Viya 3.5 compute nodes.
Standard_E16bds_v5 or E32bds_v5 specs for this system: recommended instances - but may not be available everywhere since these are newly released.
Ice Lake processor.
8 or 16 physical cores (16 or 32vCPUs)
128/256 GB RAM
For persistent storage, use six P30 Premium Disks striped together for a total of 6 TBs. If more disk space is needed, then add more P30 disks or larger Premium Disks..
The internal SSD drive can be used for SAS temporary file systems, but it cannot be increased in size.
30 Gigabit egress network connectivity
Standard_E64-32ds_v4 or E64_16ds_v4 specs for this system: recommended instances
Cascade Lake processor.
8 or 16 physical cores (16 or 32vCPUs)
504 GB RAM
For persistent storage, use six P30 Premium Disks striped together for a total of 6 TBs. If more disk space is needed, then add more P30 disks or larger Premium Disks. Remember the maximum IO bandwidth to the E64 instance is 1,200 MB /sec. With the constrained cores, this equates to 75 MB/sec/physical core for the Standard_E64-32ds_v4 and 150 MB/sec/physical core for the Standard_E64-16ds_v4 . SAS recommends at least 100 MB/sec/physical core.
The internal 2,400 GB SSD drive can be used for SAS temporary file systems, but it cannot be increased in size. The throughput for this storage is 1,936 MB/sec which equates to 121 MB/sec/physical core. SAS recommends at least 150 MB/sec/physical core.
30 Gigabit egress network connectivity
Standard_E32s_v4 - specs for this system:
Broadwell, Skylake or Cascade Lake processor. The inability to determine which chipset you will get with this instance type makes this not a good choice for SAS Grid implementations.
16 physical cores (32vCPUs)
256 GB RAM
For persistent storage, use four P30 Premium Disks striped together for a total of 4 TBs. If more disk space is needed, then add more P30 disks or larger Premium Disks. Remember the maximum IO bandwidth to the E32 instance is 768 MB/sec. This equates to 48 MB/sec/physical core. SAS recommends at least 100 MB/sec/physical core.
The internal 512 GB SSD drive can be used for SAS temporary file systems, but it cannot be increased in size. The throughput for this storage is 512 MB /sec which equates to 32 MB /sec/physical core. SAS recommends at least 150 MB /sec/physical core.
16 Gigabit egress network connectivity
Standard_L32s_v2 - specs for this system:
AMD 7551 processor.
If you are planning on using SAS 9.4m6 and earlier versions of SAS 9.4, on these instances, you will need to set the Linux environment variable (MKL_DEBUG_CPU_TYPE) to the value of 5. Here is the command to do this: export MKL_DEBUG_CPU_TYPE=5
16 physical cores (32 vCPUs-constrained).
256 GB RAM
For persistent storage, use four P30 Premium Disks striped together for a total of 4 TBs. If more disk space is needed, then add more P30 disks or larger Premium Disks. Remember the maximum IO bandwidth to the L32 instance is 640 MB/sec. This equates to 40 MB/sec/physical core. SAS recommends at least 100 MB/sec/physical core.
This system has four internal 1.92 TB NVMe drives which can be used for temporary file systems. These disk can be OS striped to create a 7.5 TB file system. The maximum IO throughput to these drives is 8000 MB/sec! That equates to 500 MB/sec/physical core. SAS recommends at least 150 MB/sec/physical core.
16 Gigabit egress network connectivity
Conclusion
There are many resources, configuration settings and constraints to check within Azure to configure an instance to meet the needs of your SAS application. It is highly likely you may have to provision an instance with more physical cores (with or without a constrained core count) in order to get the commensurate IO throughput required by your application. Likewise, you may also have to over-provision storage capacity to achieve the IO throughputs needed for your SAS application.
It is possible you may have to use an instance type with more cores than needed (with or without a constrained core count) in order to get the commensurate IO throughput required by your application. And that you may have to setup more storage capacity in order to get the IO throughput than you need.
As always, there are cost versus performance choices. These selections need to be based on your SLAs and business needs for SAS applications running in Azure versus where they are currently running.
Acknowledgements
Many thanks to SAS R&D, SAS Technical Support, Microsoft Azure, Azure NetApp Files, Sycomp, Veritas, and DDN experts for reviewing this post.
Margaret Crevar, SAS
Jim Kuell, SAS
Chris Marin, Microsoft
Jarrett Long, Microsoft
Gert van Teylingen, Azure NetApp Files
Chad Morgenstern, Azure NetApp Files
Dan Armistead, Azure NetApp Files
Greg Marino, Azure NetApp Files
James Cooper, DDN
John Zawistowski, Sycomp
Joseph D’Angelo, Veritas
... View more