BookmarkSubscribeRSS Feed

SAS Viya High Availability: Fault Tolerance with the 2020.1 and Later Releases

Started ‎06-03-2021 by
Modified ‎07-02-2021 by
Views 6,561

In my previous article, I presented some considerations to ponder when architecting High Availability with SAS Viya. This article covers more details about the actual capabilities and configuration settings of SAS Viya deployments for the 2020.1 and later releases.

Fault Tolerance at multiple levels

As you can remember from the previous article, fault tolerance can be provided by availability capabilities that exist at multiple levels: ​

 

  • Infrastructure (Cloud Provider) ​ Cloud environments provide availability to the infrastructure (i.e. storage, load balancers, Kubernetes control plane, all have a guaranteed uptime).
  • Kubernetes cluster​ Kubernetes provides a basic level of availability by automatically distributing services on multiple nodes, monitoring and restarting failed pods, and routing communication only to healthy instances.
  • SAS Viya deployment​ Viya servers and services can be clustered to increase their availability.

 

Let’s enter into more details about the default capabilities for each one.

 

Infrastructure

Cloud providers guarantee, by default, the availability of many IAAS resources to a level that would have required careful design, high cost, and complex maintenance to any on-prem data center just a few years ago.

 

Using a computing node (virtual machine) as an example, an individual VM underpinning an AKS cluster on Azure can guarantee an SLA of 99.9% without any special configuration (source: https://docs.microsoft.com/en-us/azure/architecture/high-availability/building-solutions-for-high-av...). This means less than 45 minutes of downtime per month!

 

Using cloud-provided storage for Persistent Volumes can also guarantee a default SLA level for your data and configuration files, depending on the chosen storage provider.

 

To achieve even better resilience, it is possible to deploy your Kubernetes cluster across multiple zones (in the same region) to protect it from data center outages. You can choose to spread across multiple zones only the managed Kubernetes control plane, only the nodes, or both. Control plane availability impacts the capability to control, monitor, and configure your cluster (including starting/stopping/scaling resources); nodes availability impacts SAS Viya availability.

 

While multi-zone clusters can improve availability for end-users, they may come with additional considerations:

 

  1. Performance – Crossing zone boundaries may introduce network latencies that can slow down your applications – CAS workers are particularly impacted.
  2. Cost – Network traffic between different zones may not be free.
  3. Storage architecture – Mounting Persistent Volumes (PV) across different zones may not be possible, or limit pod placement to a specific zone. For example, in AKS, PV implemented using Azure Disks cannot be mounted to pods running in different zones than the one where they were initially created. In GKE, each regional PV can only be replicated in 2 of the 3 zones in which a cluster can be deployed. Specialized storage such as Azure NetApp Files does not suffer these limitations.
  4. Resiliency – If one AZ is lost, do you have enough nodes in the surviving AZ to run all the pods that should be moved? Including CAS?

Kubernetes

Kubernetes provides multiple artifacts and automations that impact the availability of your applications, and SAS Viya 2020.1 has been designed to take native advantage of many.

 

  • Kubernetes Controllers
    In previous SAS releases, in case of server/service failure, there were few options to restart them:
    • Manual intervention by an administrator
    • 3rd party/OS High Availability tools
    • SAS Grid Manager High Availability management
    In Kubernetes, the pods are managed by controllers. A controller acts on the current state of the managed pod, to come closer to the desired state: this includes automatically restarting failed or unresponsive pods. Examples of controllers are Statefulsets and Deployments with Replicasets. SAS Viya includes custom controllers in the form of Operators to manage some components such as CAS and Elasticsearch.
  • Probes
    Kubernetes can monitor the life lifecycle of applications via probes. Probes can take different forms, such as checking a TCP endpoint, checking an HTTP endpoint, or executing a command and validating its return code. SAS Viya pods include probes to allow Kubernetes to monitor their startup, liveness, readiness. Kubernetes can react to probe failures, for example by automatically restarting stuck pods, or allowing communication only to responsive pods.
  • Horizontal Pod Autoscaling
    Horizontal Pod Autoscalers (HPA) are specialized controllers that can automatically scale pods based on dynamic conditions, such as CPU utilization. For the current release, SAS Viya can optionally use HPAs to keep 2 replicas of Stateless Services up and running.
  • Pod disruption budgets
    Pod disruption budgets (PDB) limit the number of Pods of a replicated application that can be down simultaneously from voluntary disruptions. For example, when draining a node, Kubernetes will make sure that killing pods will not result in services going below their minimum PDB. SAS Viya specifies PDBs for quorum-based applications (Consul, RabbitMQ) to ensure that the number of running replicas is never below the minimum number needed to maintain a quorum. For every other service, PDBs ensure that at least one replica is always running when using HPAs.
  • Soft anti-affinity for pods
    Pod definitions can include affinity and anti-affinity to tell Kubernetes on which nodes they should be scheduled. If all pods of the same service are on the same node, failure of the node completely disrupts the availability of that service, and draining the node cannot be completed since there would be no surviving instances of that service. SAS Viya specifies soft anti-affinity for stateless and stateful services to ensure that identical pods have a preference to be scheduled on separate nodes (but they are not required to be, in case there are no other options)

SAS Viya

SAS Viya servers and services can be clustered to increase their availability. With clustering, if a member of the cluster goes down, the surviving ones keep servicing client requests. Stateful services are configured for High Availability by default at initial deployment. By default, your SAS Viya environment starts with:

  • 3 replicas for Consul
  • 3 Replicas for Postgres
  • 3 Replicas for RabbitMQ
  • 2 Replicas for CacheLocator
  • 2 Replicas for CacheServer
  • 1 Replica for everything else

 

CAS High Availability can be enabled by choosing to deploy across several nodes (MPP) and with a backup controller – as was discussed in a previous article.

 

An optional Kustomize transformer can be used to enable two replicas for the stateless microservices.

 

In a similar way, another Kustomize transformer can configure a highly-available cluster for Open Distro for Elasticsearch, as described in this article.

 

Do not forget that your environment has dependencies on additional software that can be running in other namespaces in the same cluster as SAS Viya. For example, software like cert-manager, NGINX Ingress Controller, and the SAS Viya Monitoring solution for Kubernetes might be critical to the availability of SAS Viya and may have been deployed by default with single replicas, making them less highly available than SAS Viya itself. To increase the availability of these additional software solutions, consult their documentation for more information.

 

Conclusion

SAS Viya integration with cloud platforms is not a simple lift and shift of previous technologies. You can get real benefits from the capabilities provided at multiple levels by infrastructure services, platform components, and deployed applications, to provide end-users a better experience while managing costs and complexity.

 

Find more articles from SAS Global Enablement and Learning here.

Comments

Hi @EdoardoRiva ,

 

this is a very nice overview, thank you!

 

Do you have plans to discuss and elaborate on the scalability available by Viya 2020.x, and in the road map, as well? I a very much looking forward to this topic!

 

Thank you in advance!

Best regards,

Juan

Version history
Last update:
‎07-02-2021 10:29 AM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags