SAS Viya topology changes with Stable 2021.2.6

4 Likes

With the release of Stable 2021.2.6 there were some changes that will affect your deployment topology, workload placement plan, and node selection. SAS has recently changed the default set of workload classes, so now the CONNECT workload class is optional and requires additional steps to be enabled. In this post we will discuss the new topology and when you still may need to implement the CONNECT workload class. An additional change that will affect your node selection is that the use of GPUs is now supported with some Compute processing, not just the CAS Server.

CONNECT Workload Class Changes

With Stable 2021.2.6 the default workload classes have changed, the “connect” workload class has been removed from the default configuration.

This is to reflect that when the SAS/CONNECT Spawner is supporting connections from a SAS 9.4M7, Viya 3.5 or another Viya 4 system (client), the Spawner is performing purely as a service, it is not running any of the remote workload.

This change affects the sas-connect-spawner Deployment definition. All references to the connect workload class have been removed from the sas-connect-spawner Deployment definition (this includes the labels, nodeAffinity and tolerations) and have been replaced with the “stateless” workload class. The result is that the SAS/CONNECT Spawner will now be scheduled on “stateless” nodes by default.

To illustrate the changes, I ran the ‘icdiff’ command to show the differences between Stable 2021.2.5 and 2021.2.6 for the sas-connect-spawner deployment.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In (1) you will see the label change to categorize the Spawner as a stateless service, applying the stateless workload class label, (2) shows the node affinity for the stateless nodes, and (3) shows the update to the pod tolerations. As with the other stateless services, the Spawner pod has a toleration for both the stateful and stateless taints.

Stepping back from the yaml changes. Let’s take a moment to discuss SAS/CONNECT and the different session types that are supported, and how this can affect your deployment topology (the number of node pools).

The following description is from the SAS Viya Programming Documentation. “SAS/CONNECT software is a SAS client/server toolset that provides the ability to manage, access, and process data in a distributed and parallel SAS environment. As a client/server application, SAS/CONNECT links a SAS client session to a SAS server (SAS/CONNECT Server) session.”

The SAS/CONNECT Spawner is a SAS Viya service that launches processes on behalf of SAS/CONNECT clients. The client processes can be launched in their own pods (referred to as “dynamically launched pods”) or in the SAS/CONNECT Spawner pod (in this mode, the Spawner pod supports the sessions from multiple clients).

When the client process is launched in its own pod, the “dynamically launched pod”, the new pod is started using a Kubernetes PodTemplate (sas-connect-pod-template) and runs on the Compute nodes by default. The dynamically launched pod contains the SAS/CONNECT Server for that client session.

In the second case, when the client process is launched in the SAS/CONNECT Spawner pod, the SAS/CONNECT Server process is running in the Spawner pod, and the Spawner pod may be supporting multiple client sessions. We could call this “legacy” mode, it is how the legacy clients are supported.

Note, clients from SAS 9.4M6 and earlier releases, and SAS Viya 3.4 and earlier, do NOT support dynamically launched pods. So, by default their processes are launched in the SAS/CONNECT Spawner pod. They are the SAS/CONNECT “legacy clients”.

This begs the question “When do I need a node pool dedicated to the SAS/CONNECT workload”?

I have created a decision flow to help answer this question, see later in this post.

From a resource consumption perspective, the dynamically launched pods are similar to the SAS Compute Server workload, and as previously stated, the launched pods run on the Compute nodes by default.

However, when the SAS/CONNECT Spawner pod is running multiple client sessions it can consume significant resources. Therefore, much like the CAS pods, the SAS/CONNECT Spawner pod should be assigned a dedicated Kubernetes node and should be configured with a guaranteed Quality of Service (QoS).

Hence, if you do not have any legacy client sessions, the SAS/CONNECT Spawner can happily run as a “stateless service”. To support this, as of Stable 2021.2.6, the SAS/CONNECT Spawner is deployed in the stateless workload class by default. This means that implementing the connect workload class is ONLY required, or recommended, if you are supporting the legacy clients.

To implement, enable, the ‘connect’ workload class there are two new transformers:

enable-spawned-servers.yaml
use-connect-workload-class.yaml

Along with applying the patch transformers you also must create the ‘connect’ node pool and label and taint the nodes for the ‘connect’ workload class. This is what I have called the “old topology” in the decision flow.

GPU support for SAS Compute

The other change that I would like to briefly touch on is that the SAS Programming Environment container can now make use of GPUs, can use the SAS GPU reservation service. Prior to Stable 2021.2.6, the GPU reservation service was only used by the CAS Server.

The update extends support to SAS IML workloads (PROC IML) running on the Compute Server. For a complete list of the GPU support see the Offerings and Action Sets that Support GPU Capabilities section in the System Requirements for SAS Viya.

It is important to note that GPU support (for CAS or Compute) is not available when running on Red Hat OpenShift. Also see the following blog by Raphaël Poumarede, Add a CAS “GPU-enabled” Node pool to boost your SAS Viya Analytics Platform!

Topology decision flow

Even prior to this change, it wasn’t mandatory to implement a dedicated node pool for the connect workload, it was possible to use one of the other node pools for the CONNECT Spawner pod. However, depending on the tainting of the nodes this may have needed a custom configuration for the sas-connect-spawner Deployment.

For example, you might do this if the CONNECT workload was quite light, the Spawner pod is only supporting a small number of sessions. I’m sorry I can’t give you a formula to help you determine when a dedicated node pool is required. I would see this as part of regulate capacity planning. Monitor the performance and resource usage and scale out to using a dedicated Connect node when needed.

As discussed above, the Compute nodes are a good fit for the launched pods, they are just another type of compute session.

However, there are cases where you might still want to implement a ‘connect’ node pool to isolate the connect processing. For example, with the change in support for GPUs, the Compute nodes could be GPU enabled, but this is not required for the CONNECT sessions. Therefore, to optimize costs you might want to change the default configuration to use a different node type for the CONNECT workload.

Below is a decision flow to help with the assessment of whether the ‘connect workload class’ and a dedicated ‘connect’ node pool need to be implement.

The most likely paths through the decision flow are shown as ‘A’ (the blue path) and ‘B’ (the green path). I would hope that most customers will use the default configuration and are not supporting the legacy clients, they will be using the green path (B).

Conclusion

The good news is with Stable 2021.2.6, if there are no legacy clients, there is no need to have a dedicated node pool for SAS/CONNECT, by default there is no ‘connect’ workload class. The SAS/CONNECT Spawner follows the Viya architecture pattern and works as a stateless service.

However, a key thing to remember is that this change will NOT be available in LTS 2022.1 (May), the customers will have to wait until LTS 2022.2 (November). In the meantime, when using the LTS cadence it will continue to require a custom configuration to implement.

Similarly, the new GPU support will not be available in the LTS cadence until LTS 2022.2.

Finally, with the recent changes it makes it easier to “grow” or “shrink” the topology. For example, start with three node pools and grow to four, or five (to separate the Stateful and Stateless services) when needed.

I hope this is useful and thanks for reading.

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library