Break Glass in Case of Fail…over? (SAS 9.4 Grid Environments)

2 Likes

High availability has long been a hot topic in computing, whether it concerns cloud or on-premises resources. Companies and users alike expect modern environments, products, and services to be resilient against outages, and so we in the industry strive for high uptime. This problem can be addressed in a few ways, usually multiple at once, and solutions range from monitoring, alerts, and disaster recovery to load balancing, multi-region deployments, and containerization. In this post I will talk about a redundancy-based high availability concept, failover, and how it’s accomplished on both SAS 9.4 Grid Manager and SAS 9.4 Grid Manager for Platform. To begin, however, I want to set the stage by discussing failover as it pertains to SAS Viya.

First, what’s Failover and how does it function in SAS Viya?

Failover is a concept that ensures no, or at least limited, service interruptions. This is accomplished for a given system, server, component, or service by having redundant copies on standby that we can fall back on if the original system/server/component/service/etc. goes down.

Since SAS Viya is cloud-native, containerized, and Kubernetes-based, you may take for granted the high availability it provides. Kubernetes is great because the containerized nature means services are distributed across nodes in a cluster, and failover occurs at the service level. If a particular service goes down, its associated pods can simply be restarted by Kubernetes. You can also use components such as replica sets, which ensure a specified number of redundant pods are kept running to limit interruptions when one or more of those pods fail.

SAS Grid Manager and SAS Grid Manager for Platform, the two SAS 9 Grid offerings, are not built on Kubernetes. Still, they have failover mechanisms and even come with a few that require no additional configuration beyond the initial deployment of the software! The rest of this post will explore how failover is achieved on SAS 9.4 Grid Environments.

Failover with SAS Grid Manager

Upon initial deployment of SAS Grid Manager, a basic failover scheme for the control tier is put in place. The scheme is as follows:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

As part of deployment, you will have selected one server host as the “SAS Workload Orchestrator Master Host”, and other grid nodes can be selected to act as “SAS Workload Orchestrator Master Candidates”. The SAS Workload Orchestrator – or SWO – Master Host controls job scheduling, and master candidates are grid nodes that process jobs like any other grid nodes. If the master host goes down or is unreachable, however, a master candidate from the pool of candidates is appointed to be the new SWO Master Host. Any traffic, such as incoming jobs that need to be sorted into queues (the job of the SWO Master Host), will be redirected to this newly appointed SWO Master Host.

Master Candidates are the failover, or fallback option. They may act as ordinary grid nodes, processing jobs sent by the SWO Master Host, until suddenly being promoted to the new SWO Master Host in the event of error. This means that master candidate hosts don’t have to be idling redundant machines, and can instead be put to consistent use no matter what situation the grid is in. This is a great out-of-the-box feature of SAS Grid Manager, and it can be further enhanced during post-configuration with automatic client discovery using the SAS Web Server as a proxy. More detail on this enhancement, as well as a more detailed explanation of how SAS Workload Orchestrator failover works can be found in this excellent post by Rob Collum for SAS 9.4M6: SAS Workload Orchestrator Options for Availability and Encryption

It’s possible to configure failover beyond the control tier. In the compute tier we have out-of-the-box failover due to how SAS Workload Orchestrator operates. If a grid node fails, jobs running on it will terminate. Jobs terminated this way become pending jobs and are rescheduled to a surviving grid node by SAS Workload Orchestrator.

SAS Metadata Servers can be clustered with active and standby instances, allowing standby instances to take over if the active SAS Metadata Server goes down.

In the middle tier, SAS Web Application Servers can be clustered through a load balancer. In this case, failure results in traffic being routed to the surviving servers which can accept it (for example, an authentication request to sign in to SAS Studio is sent to a failed SAS Web App Server, and is then routed to another one in the cluster by the load balancer).

A shared file system, which is important in a SAS Grid Manager environment to allow cooperation between nodes in the grid cluster, can be configured with high availability. High availability is delivered by giving multiple servers access to the same storage, and if one fails, another takes over as the surviving node to share the hosted files. A shared file system set up in this way is critical for maintaining high availability in other aspects of the cluster. A redundant host filling the role of a failed host is all well and good, but it’s almost pointless if the next host can’t access files in the cluster because the original host went down. A highly available shared storage scheme fixes this issue by removing that single point of failure observed when just one host shares files that other hosts access.

Failover with SAS Grid Manager for Platform

Upon initial deployment of SAS Grid Manager for Platform, specifically during Platform LSF configuration steps, a basic failover scheme for the control tier is put in place. The scheme is as follows:

You may find this picture familiar... You might even be thinking it’s a lazy copy of something you’ve seen a few minutes ago. My artistic choices aside, I’ve talked before about how SAS Grid Manager and SAS Grid Manager for Platform are similar offerings! As such, they both have failover capabilities which result in the same outcomes.

In SAS Grid Manager for Platform the out-of-the-box failover scheme is handled by Platform LSF’s internal high availability mechanism. LSF provides a primary master daemon and one or more backup master daemons. If the primary fails, a backup daemon takes over.

The primary master daemon schedules jobs, monitors resources, and ensures cluster coordination. Backup master daemons are passive programs that maintain cluster state awareness in order to take over if the primary master daemon fails. The backup daemons themselves do not do much except monitor cluster health wait to be promoted to master, but they have an incredibly light workload and therefore their resource utilization is negligible. Execution daemons handle the processing of jobs that are dispatched to them by the primary master daemon. Execution daemons may run on the same hosts that run the primary master daemon, backup master daemons, or alongside neither.

SAS Grid Manager for Platform’s execution hosts, metadata tier, middle tier, and shared file system can be made highly available in much the same way that SAS Grid Manager’s can.

The Platform Process Manager, a workflow and scheduling manager on the Platform Suite for SAS, can be made highly available in the same way as the control tier’s primary master daemon. Backup Platform Process Manager instances provide redundancy across the cluster and can take over the work of a failed primary process manager instance.

Conclusion

SAS Grid Manager and SAS Grid Manager for Platform, though utilizing different methods than SAS Viya would, still have mechanisms to deliver high availability through redundancy and failover. One of the benefits of clustered computing is this concept of high availability, and I wanted to show that the SAS 9 Grid hasn’t fallen behind on this important method for keeping errors and problems separate from a system’s daily users. Problems will always arise in any computing environment... a robust failover scheme ensures end-users don’t feel those effects! (Or at least, they don’t feel them too much)

Related Links

SAS Workload Orchestrator Options for Availability and Encryption

SAS Help Center: Enabling Master Host Failover for Job Flow Scheduler

SAS Help Center: Master Host Failover

SAS Help Center: High Availability and SAS Grid Manager for Platform

SAS Help Center: Setting Up High Availability for Critical Applications

Find more articles from SAS Global Enablement and Learning here.