SAS Viya High Availability: Finding the Right Questions

4 Likes

One of the enterprise capabilities that our customers often request during the architecture planning of a new implementation is High Availability. Many times, this comes in form of a technical request: “Can I cluster SAS Viya?” or “How can I create multiple instances of service X?”. Occasionally, the focus of the question is placed on the consequences of a possible failure: “What happens to the data loaded in memory in case of failure?”.

Sometimes, asking the right question is more difficult than articulating a satisfying answer. If the question already hints at a pre-conceived solution, any explanation that deviates from the anticipated line may seem incomplete, at least partially.

When discussing a broad topic such as High Availability, it’s easy to stick to concepts and solutions that proved valuable in past implementations. Maybe they are not the best when moving to a newer, cloud-native platform such as SAS Viya, where multiple infrastructure layers can provide different capabilities, all working together to increase the overall system resiliency to failures.

Technical Capabilities and Business Requirements

What capabilities does the system have? How are they implemented? When confronted with these questions, I tend to ask back: What business requirement are you trying to address?

Yes, if you are a SAS Administrator tasked with the role of maintaining happy users, it’s important to know how SAS Viya works, how services are configured and managed. But, if you are an Architect designing the environment, it’s more important to focus on business requirements and how to address them.

Let’s see how this translates when discussing SAS Viya availability.

Capabilities

These are technical capabilities of a system such a SAS Viya deployment

Availability. You can deploy multiple instances of a server or service so that, in case one fails, the others are still available to provide service. This is a basic capability often provided by clustering

Failover. This is a step further than availability: in case of failure, the services that were provided by the failed instance are moved over to the surviving ones. This capability requires additional techniques, such as session replication, using locks to synchronize transactions, etc.

Non-functional Requirements

Implementing capabilities such as the ones just highlighted can satisfy multiple business objectives:

Fault tolerance (FT). The overall system can tolerate the failure of one or multiple components without disruption of services provided to end-users. Availability and failover can be tools to provide fault tolerance. Building on SAS 9 experience, fault tolerance has been a key goal for SAS Viya since the beginning: SAS is used for enterprise-class systems and customers expect a reliable system.

Non-Disruptive Updates (NDU). Software updates can be applied online and "during working hours"; active users of the system do not perceive any outage. An administrator should be able to perform the following operations without impacting live users of the system:
- Change configuration
- Apply patches and fixes
- Upgrade the software to a supported release
- Install add-ons and changes in products
This is a new and more complex objective of SAS Viya 2020.1 and later; although many services satisfy this requirement, a few still require user disruption (such as stop and restart).

Disaster Recovery (DR). It’s the ability to re-establish service, within an acceptable timeframe, if a major disruption occurs to the facility hosting the environment. Service (and data) can be restored using a separate environment. Although often confused with High Availability, DR is a different business requirement, addressed by implementing proven business practices on top of software capabilities.

Finding the right question

How are users impacted by a system failure? This is probably the most relevant question. In fact, by moving the focus from capabilities to business requirements, it becomes natural to assess how users are affected. We can evaluate the business impact with a classification like the following:

Level	Description
None	No customer impacts.
Small	A failure could cause some user disruption, but the system will recover on its own in a few minutes.
Medium	A failure will cause user disruption and the system will not recover on its own. User intervention is required (e.g. users need to log back on). Users may need to recreate work in progress.
Large	A failure will cause user disruption and the system will not recover on its own. An administrator has to act to re-establish a functional environment for end-users.

SAS Viya Architecture and Deployment Choices

After assessing what is the acceptable level of user disruption in case of failures, it’s possible to focus on the actual High Availability configuration required to satisfy it. SAS Viya balances costs and agility with a cloud-native platform and built-in automation. In fact, availability capabilities exist at multiple levels:

Infrastructure (Cloud Provider) Cloud environments provide availability to the infrastructure (i.e. storage, load balancers, Kubernetes control plane, all have a guaranteed uptime).
Kubernetes cluster Kubernetes provides availability by automatically distributing services on multiple nodes, monitoring and restarting failed pods, and routing communication only to healthy instances.
SAS Viya deployment Viya servers and services can be clustered to increase their availability.

You should consider all levels, as they build on each other. They all influence the assessment of how failures can impact users:

What are the SAS Viya availability requirements (RTO, RPO)? In other terms, how much downtime is acceptable (minutes, hours)? Maybe the SLA provided by the infrastructure is enough to satisfy these business requirements.
What is the desired balance between cost, performance, availability? (pick 2) Depending on the availability needs of your specific environment, the cost and design complexity will vary.
Is the Kubernetes cluster itself highly available? Is it deployed across multiple zones?
How many pod replicas should be used? Is the default SAS Viya configuration enough?

Conclusion

In this article, we have seen the importance of having the right perspective when designing a highly available SAS Viya environment. Before diving into technical details about how services can be configured, consider the user and business requirements, the capabilities available at all levels (Cloud Infrastructure, Kubernetes Platform, SAS Viya), and how they all relate. Maybe you will find that a default deployment already satisfies all your needs!

In the next article, you will find the levers and knobs at your disposition, across these multiple levels, to influence SAS Viya resilience to failures.

Find more articles from SAS Global Enablement and Learning here.