BookmarkSubscribeRSS Feed

SAS Viya 3.4 High Availability: a couple of updates you don’t want to miss

Started ‎07-30-2019 by
Modified ‎07-30-2019 by
Views 3,667

In 2019 our enterprise customers are showing a widespread adoption of SAS Viya. They demand a robust architecture, and High Availability is often on the top list of requirements.

 

Recently, SAS has released new updates to address potential issues that could compromise the effectiveness of service clustering in the event of failovers; this impacts the CAS Server and the SAS Message Broker (RabbitMQ).

 

A previous post describes how clustering enhances the availability of Viya servers and services. For most of them, clustering also guarantees failover: in case one member of the cluster dies, its functionality is seamlessly transferred to another member, so that any client can still use the service capabilities. Unfortunately, there are cases where all of this could not work as originally intended.

CAS Server

20190325_01_CASCluster.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

One of the most important things to understand when implementing a CAS cluster with High Availability is that, although you can configure two controllers, they are not "equals". One is the Primary Controller, the other the Backup Controller. Their name is not meaningless, because the backup controller cannot operate autonomously, except after a failure of the primary one. Consider these implications:

  • You cannot operate a CAS cluster from the backup controller host. The start/stop script, for the whole CAS cluster, is located only on the primary machine.
  • There is no option in that script, nor in any other interface such as SAS Environment Manager, to only start a backup controller. You always start the primary, which, in turn, starts the backup controller followed by all worker nodes.
  • If the primary controller fails, the cluster operates without fault tolerance for the controller until a planned outage. During the planned outage, the site should recover the failed controller and return to a redundant state.

Unveiling a hidden problem

SAS has recently discovered an unintended consequence of the behavior described above.

 

By design, when the primary controller fails, the backup controller keeps working (and steps in, to service clients); in the same way, should the backup controller fail, the primary controller keeps working unaffected. So far, so good. Well, the keyword here is "fail".

 

Should an administrator send the "stop" command to any controller, whether to the primary or the backup, the whole CAS cluster stops. You may think: "I don’t see the problem, you told us above that there is no way to operate the backup controller independently. This implies that a stop command always comes from the primary". Well, not always.

 

Guess what happens when both primary and backup controllers are operating nominally, and an administrator decides to shut down the host where the backup controller is located? Maybe the administrator is thinking: "The primary controller is fine, so there will be no interruption for end-users".

 

Unfortunately, during the shutdown, the operating system sends a stop command to the backup controller. This is not a failure: the operating system is gently asking to the process "could you please stop?" The backup controller interprets this as a shutdown request for the whole cluster, and CAS is shut down cleanly. All of CAS: backup, primary, workers.

 

For the technical geeks out there: a failure would be sending a SIGKILL a.k.a. kill -9 to the backup controller process; here we are talking about sending a "clean" SIGQUIT or SIGTERM to the process.

 

To recap: while a CAS cluster can survive failures, it could not survive normal maintenance.

The fix

After all the words spent above to describe the potential issue, the fix is easy.

 

SAS Note 63522 addresses this issue; an update is available for both Viya 3.3 and Viya 3.4. After applying the update, a backup controller that is being stopped nicely will not invoke a shutdown command for the whole CAS cluster. This is now documented with an update to the SAS Viya Administration guide:

 

"If the backup controller fails or is shut down while the primary controller continues to operate, the site can continue to operate without fault tolerance."

SAS Message Broker

20190325_02_RabbitMQCluster.png

 

The SAS Message Broker, which uses RabbitMQ, is the subject of another recent discovery around clustering and failures.

RabbitMQ High Availability

During the initial deployment, it is possible to configure a cluster of two or more RabbitMQ servers by specifying the hostnames in the inventory.ini file, under the [rabbitmq] host group. After that, message queues are automatically setup for High Availability. During normal operations, messages are mirrored across members, but for each queue, only one specific instance is considered its master. All operations for a given queue are first applied on the queue's master node and then propagated to mirrors (This involves enqueueing publishes, delivering messages to consumers, tracking acknowledgments from consumers, and so on). If the node that hosts a queue master fails, one of the surviving mirrors becomes the new master.

The split-brain problem

Network connection failures between cluster members can have an effect on data consistency and nodes availability. The RabbitMQ documentation provides a very detailed description of how the software detects network partitions, how it reacts during a partition and how it recovers after the partition is solved. It explains how RabbitMQ supports different configuration settings, so that it can react differently, for example favoring data consistency versus availability.

 

Multiple customers have told us that the default configuration adopted by SAS with Viya can create issues. A network partition could cause a single node to go down, as well as prevent the rest of the cluster from automatically recovering.

The fix

RabbitMQ is capable of dealing with network partitions automatically. SAS R&D has determined that the best configuration for Viya is to set a specific property, as described by SAS Note 63804:

  1. In /opt/sas/viya/config/etc/rabbitmq-server/rabbitmq.config.ssl, add cluster_partition_handling = pause_minority
  2. Restart the sas-viya-rabbitmq-server-default service on all SAS Message Broker hosts.

A more permanent fix is coming in the form of a software update: that property will be set automatically during the deployment.

An important architectural consideration

When implementing the above fix, it is important to be aware of how RabbitMQ behaves after setting the cluster_partition_handling = pause_minority property. In the page referenced above, there is a section titled "More about pause-minority mode." You can read there:

 

Also note that RabbitMQ will pause nodes which are not in a strict majority of the cluster - i.e. containing more than half of all nodes. It is therefore not a good idea to enable pause-minority mode on a cluster of two nodes since in the event of any network partition or node failure, both nodes will pause.

 

This is important to know when designing a High Availability architecture. In situations where the objective is to achieve High Availability with the minimum number of machines, it is quite common to create a cluster of only two nodes for most components; so far, the documentation reported an explicit requirement of a minimum of three nodes only for Consul. Its third instance can be placed on another node, for example, collocated with the CAS controller, thus saving on hardware costs. After this fix, the same "3-nodes-minimum" requirement applies to RabbitMQ.

Version history
Last update:
‎07-30-2019 01:14 PM
Updated by:
Contributors

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags