SAS Cloud Analytic Services Availability for Administrators

3 Likes

Starting with SAS Viya 2020.1 and beyond, the SAS Cloud Analytic Services server can leverage Kubernetes capabilities to simplify management and operations. Service availability is one key area where this integration provides enhanced capabilities when compared to previous releases. In this article, we'll look into some of the integration details to help SAS Administrators build and maintain a reliable SAS Viya environment.

CAS Availability Architecture: the Basics.

SAS Cloud Analytic Services (CAS) can be deployed in a distributed analytic cluster, to use multiple machines for AI, data management, and analytics (this is also called MPP mode - Massively Parallel Processing mode). In this configuration, CAS is more resilient to failures: even if a CAS worker node fails, the service as a whole is still available.

Starting with SAS Viya 3.3, CAS can also have a backup (secondary) controller. The two controllers continuously keep themselves in sync; this enables the backup controller to provide service rapidly in case of failure of the primary. To detect failures, the controllers exchange heartbeat messages every few seconds.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

A CAS backup controller can only be used in a distributed server architecture: there are no failover capabilities for single-node CAS SMP. Deploying a backup controller is optional, and each CAS deployment supports at most one primary and one backup controller.

The primary and backup controllers share some specific storage directories, using RWX PVCs (see Rob Collum's Kubernetes Primer for SAS Viya: Storage for CAS for additional info on CAS storage). This is a significant improvement compared to SAS Viya 3: in that release, each controller has a dedicated, local permstore area, that can get out of sync in case of failover. To recover, SAS Administrators have to synchronize the permstore manually as documented here.

Managing the pods

Let's now discuss some possible failure scenarios, and see how the system reacts. Before covering the details, it's important to understand how, starting with SAS Viya 2020.1 and beyond, CAS can leverage Kubernetes capabilities. SAS Viya provides a Kubernetes CustomResourceDefinition which defines, amongst a CASDeployment custom resource, the operator that manages it: the CAS operator. The CAS operator manages the lifecycle of CAS pods; focusing on CAS availability, the CAS operator monitors the pods and manages their automatic restart policies.

1. CAS SMP

CAS SMP runs inside a single pod; if it fails, the CAS operator restarts it. All active connections and sessions are terminated; any in-memory data is lost. The new pod may end up on a different node and get a new internal IP address. This is not a problem for CAS clients that want to re-connect: the Kubernetes service sas-cas-server-default-client, used to reach the CAS controller pod, gets automatically re-routed to the new pod.

2. CAS MPP without a Backup Controller

What if the controller of an MPP CAS server without a backup controller fails?

Again, the CAS operator manages the restart policies for the CAS server pods; as soon as it registers the controller failure or termination, all pods (including the workers) are automatically deleted and restarted. The reason for this behavior is simple: after the only controller is gone, all CAS sessions are lost and workers become useless. Forcing a complete restart enables the CAS operator to reset CAS to a clean state. Obviously, this implies that all active connections and sessions are terminated, and any in-memory data is lost.

Just as in the previous case, each pod may end up on a different node and get a new internal IP address; the Kubernetes service sas-cas-server-default-client gets re-routed to the new controller pod and clients can reconnect.

3. CAS MPP with a Backup Controller

If the controller of an MPP CAS server with a backup controller fails, the result is different. The CAS operator declines to restart a failed controller – either primary or backup – when the other one can step in. In this case, if the primary controller dies or is terminated, it remains in a stopped state and its pod gets evicted.

The backup controller takes control of the cluster, and the Kubernetes service sas-cas-server-default-client gets re-routed to the backup controller pod.

After this transfer, all active connections and sessions are maintained, any in-memory data is preserved, and processing continues uninterrupted.

Starting with SAS Viya 2020.1 and beyond, there is an important difference in restart behavior when compared to previous versions. With SAS Viya 3.x, when an administrator gives a stops command to the primary controller, CAS interprets it as a command to stop the whole cluster, and all nodes come down. With SAS Viya 2020.1 and beyond, when an administrator stops the primary controller, it is considered a deviation from the desired state, and the CAS operator starts a failover to the backup controller, just as if it were a failure. The reason for this change is that now we use a declarative approach, versus the traditional imperative one. And, if you want to know, the proper way to stop the CAS server is to describe the desired "shutdown state" as "true":

kubectl -n name-of-namespace patch casdeployment default --type=json -p='[{"op": "add", "path": "/spec/shutdown", "value":true}]'

Recovering CAS server from a failure

As we have seen in the SMP or MPP cases without a backup controller, the CAS operator reacts to a failure of the primary by restarting the whole CAS server: nothing else is left for the administrator to recover. Yes, they may need to reload data and manage client disconnections, but nothing is left to be done with the backend.

Instead, when a backup controller steps in after the primary controller has failed, we have seen that the CAS operator declines to restart it. How can an administrator get back to the initial situation, with the primary in charge of the CAS cluster?

For the current release, recovering from a controller failure requires a planned outage. An administrator has to stop and restart the CAS server: after CAS restarts, the primary controller resumes its role and the backup controller falls back to providing fault tolerance.

4. What about CAS Workers?

So far, we have seen multiple scenarios covering the CAS controllers. What about CAS workers? The rule is always the same: the CAS operator manages the restart policies for the CAS server pods. As soon as it registers a worker failure or termination, it starts a new pod to take its place. The new CAS worker starts empty: no loaded data, no user sessions. This is not an issue: data availability is still addressed as it is with previous SAS Viya releases; if tables are loaded in-memory with redundant copies, they are still there after a worker failure and restart. One of the surviving workers activates or loads the data blocks for tables that were previously managed by the failed worker, and CAS keeps working.

Closing

Discussing the availability of an enterprise system such as SAS Viya can be a complex topic that can be seen from multiple viewpoints. In this article we have covered an overview of CAS availability with SAS Viya 2020.1 and beyond for SAS Administrators, highlighting how the integration with Kubernetes streamlines and simplifies the management and recovery of failed components. A different perspective could be to consider how client applications react to CAS failures and how users are (or are not) impacted. That will be the topic of another article.

Find more articles from SAS Global Enablement and Learning here.