BookmarkSubscribeRSS Feed

Keeping Analytics Available: SAS Viya now supports Multi-AZ Deployment!

Started a week ago by
Modified a week ago by
Views 189

The 2025.10 version of SAS Viya comes with an exciting new feature that will make many customers very happy ! 😊

 

Indeed, from this version, you can deploy the SAS Viya platform into a cluster that spans multiple availability zones (AZ) in M... and remain fully supported !

 

After many customers’ requests to support and leverage a SAS Viya deployment across multiple Availability Zone, it became a product management priority in the last months.

 

While the support of multiple AZ is now available, it is still limited to a selection of offerings and specific Cloud platforms. 

 

This post provides an overview of the requirements and limits of such deployment, as well as some additional considerations and an update on the current Multi-AZ support scope (as of December 2025).

 

 

 

Availability Zones in the Cloud

 

But before we dive in into the SAS Viya requirements and limitations, let’s quickly review what this concept of Availability Zones (aka "AZ") means in the Cloud.

 

Public Cloud providers (such as Azure, AWS and GCP) have a hierarchical organization of their resources by Region, then Zone and Data centers.

 

In the Cloud, "zones" correspond to distinct physical locations. Each zone corresponds to a separated group of datacenters within a region. As an example, in Azure, the WEST-US region contains 3 different zones.

 

01_RP_multi-az-regions-diagram-1024x578.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Availability zones are typically separated by several kilometers, and usually within 100 kilometers. So they're close enough to have low-latency connections to other availability zones through a high-performance network. However, they're far enough apart to reduce the likelihood that more than one will be affected by local outages or weather.

 

The benefits of deploying an application across multiple AZ is that it improves its availability, even in case of a major disaster that would impact the whole zone (earthquake, flood, cyber-attack, etc…).

 

When you create a Managed Kubernetes cluster in the Public Cloud, you choose the region and you can also choose one or multiple zones (if the region has the availability zone support).
By default, the AKS control plane is already zone resilient (there is nothing to configure for that). But as explained in the Azure documentation :

 

"control plane resiliency isn't sufficient for your cluster to remain operational when a zone fails. For the system node pool and any user node pools that you deploy, you must enable availability zone support to help ensure that your workloads are resilient to availability zone failures."

 

For example, if you want a Multi AZ cluster and you are using the azure IaC project to provision your AKS cluster, you would need to set the default_nodepool_availability_zones and the node_pools_availability_zone terraform variables to define which Availability zones are associated to your user node pools.

 

 

Requirements…

 

Now, if you decide to deploy SAS Viya in a Cloud managed Kubernetes service where multiple availability zones support is enabled, there are some requirements and limits to be aware of. They are now officially documented in the SAS Operations Guide there.

 

 

Zone-redundant storage

 

The first requirement is at the storage level.

 

If you deploy the SAS Viya platform across multiple availability zones, some pods holding the SAS Viya platform components may be rescheduled (as part of the normal pod’s lifecycle) and land on a different zone form the one where the Persistent Storage was initially provisioned and attached.

 

So we need to provision "zone-redundant storage" (aka "ZRS") for all Persistent Volumes created by the SAS Viya deployment in order to ensure data availability across zones.

 

As an example, using a storage class based on Single Virtual Machine with an NFS Server (made available through an NFS provisioner) for the SAS Viya platform Persistent Volumes does NOT provide "Zone-redundant" storage.

 

So for Viya volumes that requires "RWX" ("Read Write Many") access, the "Single NFS Server VM" option (which is the "standard" storage type with the SAS provided IaC tools for Azure, AWS and GCP) does NOT meet the Multi AZ requirements:  The VM is located in a single zone and in case of a zone failure, the storage of the Viya platform would not be accessible.

 

Instead, Managed Storage with multi-zone support (such as Azure NetApp or Amazon FSx for NetApp ONTAP) should be used and properly configured for HA, too.

 

For Viya volumes that require "RWO" ("Read Write Once")  access, block storage is recommended and typically used. However we must ensure that the Cloud storage class used for block Storage is zone-redundant. As an example the "managed-csi" and  "managed-csi-premium" RWO storage class meet the requirement as they utilize Azure Premium zone-redundant storage (ZRS) to create managed disks. Amazon Elastic Block Storage is not zone-redundant. Instead, look to use Amazon FSx for NetApp ONTAP or Amazon Elastic File Storage instead.

 

 

Configure SAS Viya (and critical 3rd party tools) for HA

 

The second requirement is to configure High Availability (aka "HA") in the SAS Viya platform deployment as instructed in Configure High Availability in SAS Viya Platform: Deployment Guide.

 

The High Availability configuration basically instructs to use a specific PatchTransformer that creates 2 replicas for all the stateless microservices. Most of the stateful platform components (Consul, RabbitMQ, Redis) are already deployed in HA mode by default with two or three replicas.

 

Others like CAS and OpenSearch are not deployed in HA by default and should be specifically configured for it.

 

Also note that to truly make the SAS Viya platform more resilient, some critical 3rd party software component that the platform relies on (ingress-nginx, NFS provisioner, potentially SAS Viya monitoring) also need to be configured for HA.

 

If you install these 3rd party applications with the DaC ("Deployment As Code project") baseline role, the tool does not currently allow you to deploy ingress or the NFS CSI driver controller with multiple replicas. You need to do it manually.

 

 

Pod Topology Spread Constraints

 

While not explicitly listed in the documented official requirements, there are also some considerations regarding the distribution of the pod’s replicas across the availability zones.

 

It is not enough to have multiple replicas of the pods. We also want to make sure that they are spread across distinct zones, so if a whole zone fails, we are sure that all our replicas were not in the same zone !

 

While you can tweak them and set them individually at the pod’s level, Kubernetes has "built-in" Cluster default topology spread constraints already in place.

 

If you don’t change anything the kube-scheduler acts as if you had configured the topology constraints as below:

 

02_RP_TopologySpreadDefaultConstraints-1024x349.png

Source: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#internal-defaul...

 

Where:

 

  • maxSkew: The maximum allowed difference in the number of matching pods between any two topology domains. Understanding the maxSkew parameter is not trivial (see explanations there) but basically it describes the maximum degree to which Pods can be unevenly distributed. Setting this to "1" ensures a near-even spread.
  • topologyKey: This must be set to the zone label: kubernetes.io/zone.
  • whenUnsatisfiable specifies, when maxSkew can't be satisfied, what action should be taken:
    • DoNotSchedule (default) tells the scheduler not to schedule it. It's a "hard constraint" (the incoming pod remains in PENDING state if the scheduler can’t satisfy the constraint).
    • ScheduleAnyway tells the scheduler to still schedule it while prioritizing Nodes that reduce the skew. It's a "soft constraint".

 

As we can see, it relies on the fact that the Kubernetes nodes have both kubernetes.io/hostname and topology.kubernetes.io/zone labels set (which is the case by default in the SAS Viya supported Cloud Managed Kubernetes platforms).

 

While the maxSkew value of "1" is recommended for the most "even" distribution of the pods across the AZ, the default Kubernetes settings appears to be adequate for the SAS Viya pods’ distribution.

 

From repeated testing run by SAS, default pod topology spread constraints have been found to distribute pods across zones at deployment without additional configuration. Despite the fact that the maxSkew is higher than "1", multiple tests and stop/start operations have resulted in all StatefulSets and Deployments (configured with multiple pods) evenly distributed. It explains why there are currently no specific pod’s placements recommendations in the official documentation.

 

Finally, note that Cloud providers usually don’t allow to change this Cluster's default configuration but instead recommend to override the cluster level default by specifying the topologySpreadConstraints field at a pod level in case you require a different MaxSkew settings for example (that’s the case for AKS and EKS).

 

 

...and limits for SAS Viya

 

As of end of 2025, several limitations remain on various areas for the SAS Support to deployments in multiple availability zones.

 

The table below summarizes these limitations.

 

Area

Support Limitations

Comment/Notes

Cloud Providers

Only Azure and AWS

 Work is in progress to add support for SAS Viya deployments in GCP (target : stable 2025.12).

SAS Viya offerings

Only for SAS Viya Analytics, SAS Visual Statistics, SAS Viya, SAS Viya Advanced, SAS Viya Enterprise, and Visual Investigator.

 

Automated recovery from a Zone failure

CAS and Compute jobs require manual intervention.

 In case of Zone failure, CAS may need to be restarted manually.

 

Jobs that use SAS/CONNECT fail or keep waiting and do not reach a fail state after experiencing a fail-over in multi-availability zone environments. The jobs will need to be restarted manually.

 

In general, SAS Compute jobs may also need to be manually restarted if they were running on a node located in the failed Zone.

 Multi-region

All availability zones must be contained within a single Region.

 SAS Viya deployments across multiple regions are not supported.

PostgreSQL Server (SAS Infrastructure Data Server)

An internal SAS PostgreSQL server is not supported.
A zone-redundant database for SAS Infrastructure Data Server is required.

 The Azure Database for PostgreSQL, Amazon RDS for PostgreSQL and Cloud SQL for PostgreSQL support HA across multiple availability zones.

 

For existing deployments with internal PostgreSQL, it is possible to migrate from an internal to an external PostgreSQL cluster (see PostgreSQL Data Transfer Guide).

 

Note that, currently the Cloud IaC tools don’t always allow the provisioning of multi-zone external PostgreSQL clusters (as of December 2025).

 

The official documentation also states:

 

  • SAS testing with multi-zone deployments, while thorough, should not be considered to be exhaustive. Some SAS Viya platform components might have specific HA considerations that are not addressed by these requirements.

 

Indeed… while zone failures have been thoroughly tested (with tools such as "Azure Chaos Studio"), it would be impossible to test every possible combination of failing components in a zone failure scenario since, by nature, Kubernetes randomly schedule the pod’s replicas across the zones.

 

Additional considerations for CAS

 

One of the key limitation, noted in the table above, is related to CAS automated recovery.

 

CAS is not a standard Kubernetes deployment object…In case of a zone failure, a CAS manual restart will very likely be required. It means that, for the end users, actively working with CAS, an outage will happen and prevent them to use CAS for a little while.

 

However with the proper alert mechanism and automated restart/tables reload process in place, the impact on the RTO (Recovery Time Objective) can be limited…re-enabling CAS on the remaining zone is likely to take less time than bringing back up a whole failed zone in the Cloud.

 

But in such cases, it is important that the number of CAS nodes on the remaining zone can be scaled out to accommodate the defined number of CAS workers. If the maximum node value has not been set appropriately in the Cloud autoscaler, a manual scale up of the nodes might be needed.

 

Another concern or limit is the impact on the CAS performance. Remember that MPP (Massively Parallel Processing) CAS relies on multi-node communications to perform its analytics processing. While no performance tests have been performed yet, to evaluate the exact impact on the CAS actions execution times, having CAS workers and CAS controllers spread across different geographical zones (increasing network latency) is likely to increase them.

 

Multi-AZ support : current and future state

 

The addition of the multiple availability zones support for a SAS Viya deployment is a phased effort :

 

  • Phase 1: Test the SAS Viya offering across multiple availability zones (in a single region) on AKS
  • Phase 2: Validate SAS Viya Enterprise and other platform products in a managed Kubernetes cluster spread across multiple availability zones (in a single region)
  • Phase 3: Validate all SAS Viya offerings/solutions in a managed Kubernetes cluster spread across multiple availability zones (in a single region)

The SAS documentation has been updated in version 2025.10 to reflect the progress and successful testing activities on the phase 2 for AKS and EKS. The target SAS Viya version for "phase 2" support of multiple AZ in Google Kubernetes Engine is currently 2025.12.

Note that the resilience of the SAS Viya platform during a zone failure event has been tested in the Phase 2 scenarios using "Chaos testing" methods to ensure each component could automatically recover in a healthy zone.

 

Conclusion

 

So while SAS now officially supports Multi Availability Zones deployments of the SAS Viya platform (which is a great news), the right Architectural decisions still must be made. They should allow the SAS Viya platform to remain available with limited manual effort and acceptable RTO (Recovery Time Objective) in case of a zone failure: zone-redundant and replicated storage with failover mechanism, HA configuration for SAS Viya pods and critical 3rd party applications, zone-redundant external PostgreSQL,….

 

At the time of this post write-up, this support also remains limited to a subset of the SAS Viya platform offerings. However work is in progress to test and support more offerings, solutions and allow more and more customers to benefit from this increased resilience of their SAS Viya deployment in the Cloud.

 

While not explicitly stated in the limitations, the level of resilience also depends on the number of nodes that are provisioned…For example, if there is only one node in the "Compute" or in the "System" node pool then it will not help to have multiple zones in the cluster…if the single Compute or System node is in a failing zone, the environment will not remain available for the end-users.

 

Similarly, a true High Availability configuration for several of our StatefulSets services (Consul, RabbitMQ) requires 3 replicas that should be distributed on distinct nodes. As a consequence, if we want to keep this HA state even in case of a zone failure, we should have at least 2 nodes in the three zones to hold these StatefulSets pods (and ensure that each replicas runs in a different zone).

 

The increased number of nodes costs adds up to the already high costs of specific infrastructure deployed with High availability options (ex: ZRS disks, multiple NetApp volumes with Cross-Zone replication, Highly Available Azure PostgreSQL flexible server, etc…).

 

Customers should be made aware, during the Architecture discussions, that setting up such type of Multi-zone environment, capable of limiting the impact of a Zone failure on the availability of the platform, requires a significant infrastructure budget.

 

Finally, in addition to the budget aspect, another concern to take into account in these discussions, is the potential impact on CAS performance (discussed in the "limits" section above). "Tradeoffs" is the key word here :).

 

 

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
a week ago
Updated by:

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!

Register now

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags