Keeping Analytics Available: SAS Viya now supports Multi-AZ Deployment!

The 2025.10 version of SAS Viya comes with an exciting new feature that will make many customers very happy ! 😊

Indeed, from this version, you can deploy the SAS Viya platform into a cluster that spans multiple availability zones (AZ) in M... and remain fully supported !

After many customers’ requests to support and leverage a SAS Viya deployment across multiple Availability Zone, it became a product management priority in the last months.

While the support of multiple AZ is now available, it is still limited to a selection of offerings and specific Cloud platforms.

This post provides an overview of the requirements and limits of such deployment, as well as some additional considerations and an update on the current Multi-AZ support scope (as of December 2025).

Availability Zones in the Cloud

But before we dive in into the SAS Viya requirements and limitations, let’s quickly review what this concept of Availability Zones (aka "AZ") means in the Cloud.

Public Cloud providers (such as Azure, AWS and GCP) have a hierarchical organization of their resources by Region, then Zone and Data centers.

In the Cloud, "zones" correspond to distinct physical locations. Each zone corresponds to a separated group of datacenters within a region. As an example, in Azure, the WEST-US region contains 3 different zones.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Availability zones are typically separated by several kilometers, and usually within 100 kilometers. So they're close enough to have low-latency connections to other availability zones through a high-performance network. However, they're far enough apart to reduce the likelihood that more than one will be affected by local outages or weather.

The benefits of deploying an application across multiple AZ is that it improves its availability, even in case of a major disaster that would impact the whole zone (earthquake, flood, cyber-attack, etc…).

When you create a Managed Kubernetes cluster in the Public Cloud, you choose the region and you can also choose one or multiple zones (if the region has the availability zone support).
By default, the AKS control plane is already zone resilient (there is nothing to configure for that). But as explained in the Azure documentation :

"control plane resiliency isn't sufficient for your cluster to remain operational when a zone fails. For the system node pool and any user node pools that you deploy, you must enable availability zone support to help ensure that your workloads are resilient to availability zone failures."

For example, if you want a Multi AZ cluster and you are using the azure IaC project to provision your AKS cluster, you would need to set the default_nodepool_availability_zones and the node_pools_availability_zone terraform variables to define which Availability zones are associated to your user node pools.

Requirements…

Now, if you decide to deploy SAS Viya in a Cloud managed Kubernetes service where multiple availability zones support is enabled, there are some requirements and limits to be aware of. They are now officially documented in the SAS Operations Guide there.

Zone-redundant storage

The first requirement is at the storage level.

If you deploy the SAS Viya platform across multiple availability zones, some pods holding the SAS Viya platform components may be rescheduled (as part of the normal pod’s lifecycle) and land on a different zone form the one where the Persistent Storage was initially provisioned and attached.

So we need to provision "zone-redundant storage" (aka "ZRS") for all Persistent Volumes created by the SAS Viya deployment in order to ensure data availability across zones.

As an example, using a storage class based on Single Virtual Machine with an NFS Server (made available through an NFS provisioner) for the SAS Viya platform Persistent Volumes does NOT provide "Zone-redundant" storage.

So for Viya volumes that requires "RWX" ("Read Write Many") access, the "Single NFS Server VM" option (which is the "standard" storage type with the SAS provided IaC tools for Azure, AWS and GCP) does NOT meet the Multi AZ requirements: The VM is located in a single zone and in case of a zone failure, the storage of the Viya platform would not be accessible.

Instead, Managed Storage with multi-zone support (such as Azure NetApp or Amazon FSx for NetApp ONTAP) should be used and properly configured for HA, too.

For Viya volumes that require "RWO" ("Read Write Once") access, block storage is recommended and typically used. However we must ensure that the Cloud storage class used for block Storage is zone-redundant. As an example the "managed-csi" and "managed-csi-premium" RWO storage class meet the requirement as they utilize Azure Premium zone-redundant storage (ZRS) to create managed disks. Amazon Elastic Block Storage is not zone-redundant. Instead, look to use Amazon FSx for NetApp ONTAP or Amazon Elastic File Storage instead.

Configure SAS Viya (and critical 3^rd party tools) for HA

The second requirement is to configure High Availability (aka "HA") in the SAS Viya platform deployment as instructed in Configure High Availability in SAS Viya Platform: Deployment Guide.

The High Availability configuration basically instructs to use a specific PatchTransformer that creates 2 replicas for all the stateless microservices. Most of the stateful platform components (Consul, RabbitMQ, Redis) are already deployed in HA mode by default with two or three replicas.

Others like CAS and OpenSearch are not deployed in HA by default and should be specifically configured for it.

Also note that to truly make the SAS Viya platform more resilient, some critical 3^rd party software component that the platform relies on (ingress-nginx, NFS provisioner, potentially SAS Viya monitoring) also need to be configured for HA.

If you install these 3^rd party applications with the DaC ("Deployment As Code project") baseline role, the tool does not currently allow you to deploy ingress or the NFS CSI driver controller with multiple replicas. You need to do it manually.

Pod Topology Spread Constraints

While not explicitly listed in the documented official requirements, there are also some considerations regarding the distribution of the pod’s replicas across the availability zones.

It is not enough to have multiple replicas of the pods. We also want to make sure that they are spread across distinct zones, so if a whole zone fails, we are sure that all our replicas were not in the same zone !

While you can tweak them and set them individually at the pod’s level, Kubernetes has "built-in" Cluster default topology spread constraints already in place.

If you don’t change anything the kube-scheduler acts as if you had configured the topology constraints as below:

Source: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#internal-defaul...

Where:

maxSkew: The maximum allowed difference in the number of matching pods between any two topology domains. Understanding the maxSkew parameter is not trivial (see explanations there) but basically it describes the maximum degree to which Pods can be unevenly distributed. Setting this to "1" ensures a near-even spread.
topologyKey: This must be set to the zone label: kubernetes.io/zone.
whenUnsatisfiable specifies, when maxSkew can't be satisfied, what action should be taken:
- DoNotSchedule (default) tells the scheduler not to schedule it. It's a "hard constraint" (the incoming pod remains in PENDING state if the scheduler can’t satisfy the constraint).
- ScheduleAnyway tells the scheduler to still schedule it while prioritizing Nodes that reduce the skew. It's a "soft constraint".

As we can see, it relies on the fact that the Kubernetes nodes have both kubernetes.io/hostname and topology.kubernetes.io/zone labels set (which is the case by default in the SAS Viya supported Cloud Managed Kubernetes platforms).

While the maxSkew value of "1" is recommended for the most "even" distribution of the pods across the AZ, the default Kubernetes settings appears to be adequate for the SAS Viya pods’ distribution.

From repeated testing run by SAS, default pod topology spread constraints have been found to distribute pods across zones at deployment without additional configuration. Despite the fact that the maxSkew is higher than "1", multiple tests and stop/start operations have resulted in all StatefulSets and Deployments (configured with multiple pods) evenly distributed. It explains why there are currently no specific pod’s placements recommendations in the official documentation.

Finally, note that Cloud providers usually don’t allow to change this Cluster's default configuration but instead recommend to override the cluster level default by specifying the topologySpreadConstraints field at a pod level in case you require a different MaxSkew settings for example (that’s the case for AKS and EKS).

...and limits for SAS Viya

As of end of 2025, several limitations remain on various areas for the SAS Support to deployments in multiple availability zones.

The table below summarizes these limitations.

Area	Support Limitations	Comment/Notes
Cloud Providers	Only Azure and AWS	Work is in progress to add support for SAS Viya deployments in GCP (target : stable 2025.12).
SAS Viya offerings	Only for SAS Viya Analytics, SAS Visual Statistics, SAS Viya, SAS Viya Advanced, SAS Viya Enterprise, and Visual Investigator.
Automated recovery from a Zone failure	CAS and Compute jobs require manual intervention.	In case of Zone failure, CAS may need to be restarted manually. Jobs that use SAS/CONNECT fail or keep waiting and do not reach a fail state after experiencing a fail-over in multi-availability zone environments. The jobs will need to be restarted manually. In general, SAS Compute jobs may also need to be manually restarted if they were running on a node located in the failed Zone.
Multi-region	All availability zones must be contained within a single Region.	SAS Viya deployments across multiple regions are not supported.
PostgreSQL Server (SAS Infrastructure Data Server)	An internal SAS PostgreSQL server is not supported. A zone-redundant database for SAS Infrastructure Data Server is required.	The Azure Database for PostgreSQL, Amazon RDS for PostgreSQL and Cloud SQL for PostgreSQL support HA across multiple availability zones. For existing deployments with internal PostgreSQL, it is possible to migrate from an internal to an external PostgreSQL cluster (see PostgreSQL Data Transfer Guide). Note that, currently the Cloud IaC tools don’t always allow the provisioning of multi-zone external PostgreSQL clusters (as of December 2025).

The official documentation also states:

SAS testing with multi-zone deployments, while thorough, should not be considered to be exhaustive. Some SAS Viya platform components might have specific HA considerations that are not addressed by these requirements.

Indeed… while zone failures have been thoroughly tested (with tools such as "Azure Chaos Studio"), it would be impossible to test every possible combination of failing components in a zone failure scenario since, by nature, Kubernetes randomly schedule the pod’s replicas across the zones.

Additional considerations for CAS

One of the key limitation, noted in the table above, is related to CAS automated recovery.

CAS is not a standard Kubernetes deployment object…In case of a zone failure, a CAS manual restart will very likely be required. It means that, for the end users, actively working with CAS, an outage will happen and prevent them to use CAS for a little while.

However with the proper alert mechanism and automated restart/tables reload process in place, the impact on the RTO (Recovery Time Objective) can be limited…re-enabling CAS on the remaining zone is likely to take less time than bringing back up a whole failed zone in the Cloud.

But in such cases, it is important that the number of CAS nodes on the remaining zone can be scaled out to accommodate the defined number of CAS workers. If the maximum node value has not been set appropriately in the Cloud autoscaler, a manual scale up of the nodes might be needed.

Another concern or limit is the impact on the CAS performance. Remember that MPP (Massively Parallel Processing) CAS relies on multi-node communications to perform its analytics processing. While no performance tests have been performed yet, to evaluate the exact impact on the CAS actions execution times, having CAS workers and CAS controllers spread across different geographical zones (increasing network latency) is likely to increase them.

Multi-AZ support : current and future state

The addition of the multiple availability zones support for a SAS Viya deployment is a phased effort :

Phase 1: Test the SAS Viya offering across multiple availability zones (in a single region) on AKS
Phase 2: Validate SAS Viya Enterprise and other platform products in a managed Kubernetes cluster spread across multiple availability zones (in a single region)
Phase 3: Validate all SAS Viya offerings/solutions in a managed Kubernetes cluster spread across multiple availability zones (in a single region)

The SAS documentation has been updated in version 2025.10 to reflect the progress and successful testing activities on the phase 2 for AKS and EKS. The target SAS Viya version for "phase 2" support of multiple AZ in Google Kubernetes Engine is currently 2025.12.

Note that the resilience of the SAS Viya platform during a zone failure event has been tested in the Phase 2 scenarios using "Chaos testing" methods to ensure each component could automatically recover in a healthy zone.

Conclusion

So while SAS now officially supports Multi Availability Zones deployments of the SAS Viya platform (which is a great news), the right Architectural decisions still must be made. They should allow the SAS Viya platform to remain available with limited manual effort and acceptable RTO (Recovery Time Objective) in case of a zone failure: zone-redundant and replicated storage with failover mechanism, HA configuration for SAS Viya pods and critical 3^rd party applications, zone-redundant external PostgreSQL,….

At the time of this post write-up, this support also remains limited to a subset of the SAS Viya platform offerings. However work is in progress to test and support more offerings, solutions and allow more and more customers to benefit from this increased resilience of their SAS Viya deployment in the Cloud.

While not explicitly stated in the limitations, the level of resilience also depends on the number of nodes that are provisioned…For example, if there is only one node in the "Compute" or in the "System" node pool then it will not help to have multiple zones in the cluster…if the single Compute or System node is in a failing zone, the environment will not remain available for the end-users.

Similarly, a true High Availability configuration for several of our StatefulSets services (Consul, RabbitMQ) requires 3 replicas that should be distributed on distinct nodes. As a consequence, if we want to keep this HA state even in case of a zone failure, we should have at least 2 nodes in the three zones to hold these StatefulSets pods (and ensure that each replicas runs in a different zone).

The increased number of nodes costs adds up to the already high costs of specific infrastructure deployed with High availability options (ex: ZRS disks, multiple NetApp volumes with Cross-Zone replication, Highly Available Azure PostgreSQL flexible server, etc…).

Customers should be made aware, during the Architecture discussions, that setting up such type of Multi-zone environment, capable of limiting the impact of a Zone failure on the availability of the platform, requires a significant infrastructure budget.

Finally, in addition to the budget aspect, another concern to take into account in these discussions, is the potential impact on CAS performance (discussed in the "limits" section above). "Tradeoffs" is the key word here :).

Find more articles from SAS Global Enablement and Learning here.