Running SAS Viya on a shared Kubernetes cluster – Part 1

1 Like

When we think about running SAS Viya in a shared Kubernetes cluster there are two use cases:

Using a dedicated Kubernetes cluster for SAS Viya, running multiple SAS Viya deployments.
Running SAS Viya in a Kubernetes cluster that is shared with other (3rd party) applications.

With the withdrawal of support for application multi-tenancy within SAS Viya, as of Stable 2023.10 and LTS 2023.10, it means that the requirement for running multiple SAS Viya environments in a shared cluster may become more common.

I will start by stating, it is possible to run SAS Viya in a shared cluster, in both cases, but there are several considerations. In this post I will focus on running SAS Viya in a Kubernetes cluster shared with other applications.

When using a Kubernetes (K8s) cluster for multiple applications it is important to understand all the application workloads, their individual system requirements and the service levels associated with the applications.

These factors impact the K8s cluster design including the design for the node pools and workload placement. Complications often arise when an organisation doesn’t want to dedicate nodes to applications and does not want to taint nodes.

Applications such as SAS Viya have some specific system requirements, for example:

the need for local storage on the nodes for SASWORK and CAS disk cache
the in-memory processing requirements for the CAS server
a requirement to use GPUs.

This is not common for other business applications.

SAS Viya also has a need for specific node taints and labels. Here I’m particularly thinking about the sas-programming-environment (Compute Server) pods and CAS. Functions like the SAS Workload Orchestration and the sas-programming pre-pull function rely on having nodes labelled with: workload.sas.com/class=compute

Remember, in Kubernetes, node taints are used to repel pods and node labels are used to attract pods. These are used in conjunction with the node affinity and toleration definitions in the application pods. See the Kubernetes documentation: Assigning Pods to Nodes

As the SAS Viya configuration defaults to using preferred scheduling, appropriately labelling and tainting nodes is important to achieving the desired workload placement for a (specific) topology.

The node pools are a way to optimise the deployment for specific application requirements (like GPUs and storage) and workloads.

Another consideration when sharing the cluster with other applications is the level of permissions required to deploy and run SAS Viya. This can be a concern for some organisations. For example, the need for cluster-wide admin permissions.

SAS Viya also has specific system requirements for functions such as the ingress controller, these might be different to, or in conflict with, the application ecosystem that is currently in place.

Hence, the system requirements and admin permissions for SAS Viya can be key drivers for dedicating a cluster to running SAS Viya.

With that said, let’s now look at some deployment scenarios. I felt the easiest way to illustrate some of the deployment concerns was to run some tests.

Running in a cluster with untainted nodes

As the default SAS Viya deployment uses preferred scheduling, one of the key concerns to achieving a target topology is what I call “pod drift”. Pod drift is where the Viya pods end up running on nodes that aren’t designated for the SAS Viya processing. Therefore, you are not achieving the target workload placement for the desired topology for SAS Viya.

For most components this isn’t a problem, but for functions that have specific node requirements this can lead to poor performance and/or resource utilization, or worse, it might even break the Viya deployment in some way!

To illustrate this, I tested the following scenario when running SAS Viya in a cluster with untainted nodes:

The organisation doesn’t taint nodes but has agreed to label and taint some nodes for SAS Viya.
The SAS Viya node pools have been defined to allow them to scale to zero, and
Dedicating a single node pool to SAS Viya.

Test 1

In my first test I created an AKS cluster with the required SAS Viya node pools, with the labels and taints applied, but they were scaled to zero. In addition to four node pools that correspond to the standard SAS Viya workload classes (cas, compute, stateful, stateless) and a ‘system’ node pool for the Kubernetes control plane, we also have the ‘apps’ node pool (without any taints applied).

This was to simulate a scenario where there are other applications running and several nodes available that didn’t have any taints applied.

At the start of the SAS Viya deployment the state of the cluster is shown in the image below.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The ‘apps’ node pool had 6 nodes running and the system node pool had one node.

When I deployed SAS Viya the only node pool that was used was the apps node pool. Even with the CAS auto-resources enabled the Kubernetes scheduler selected to start a new node in the apps node pool rather than using the cas node pool. This can be seen in the image below.

In the image you can see that sas-cas-server-default-controller is running on the seventh node (aks-apps-37136449-vmss000006) in the 'apps' node pool.

When I started a SAS Studio session, I couldn’t get a compute server as there were no compute nodes available. This is shown below.

In the image, you can see that I failed to get a SAS Studio compute context, and the SAS Compute Service and the SAS Launcher pods are both running in the apps node pool, but there isn’t any SAS Compute Server pod. There were no nodes available with the compute label:

workload.sas.com/class=compute

At this point I should state that I had the SAS Workload Orchestrator (SWO) enabled, I was using the default configuration, as there weren’t any nodes available with the workload.sas.com/class=compute label the Compute Server for the SAS Studio session was not started.

Obviously, this isn’t a good experience for the SAS users.

It also highlights that it is critical to have node(s) available with the compute label when using the SAS Workload Orchestrator, and what happens when the cluster autoscaler doesn’t trigger the scaling of the compute node pool.

So, what would happen if there were cas and compute nodes available?

Test 2

In test 2 I manually scaled the cas and compute node pools to have one node each. The image below shows the state of the cluster when I started the test.

The hope here was that I would get the CAS server and Compute Server pods running on the desired nodes.

This deployment was better as the Kubernetes scheduler selected to run pods on the cas and compute nodes. The CAS Server (sas-cas-server-default-controller) was running on the cas node, and I was able to get a compute server when I started SAS Studio. This is shown in the following image.

Here you can see that the sas-compute-server pod is running on the compute node, and the SAS Compute Service and SAS Launcher pods are running in the ‘apps’ node pool.

This still wasn’t perfect as in this test I still didn’t get any pods running on the Stateless or Stateful nodes. I would have had to have nodes available in the stateless and stateful node pools for this to happen.

Another problem with this Kubernetes configuration is when SWO is disabled, there is no guarantee that all the sas-compute-server pods will be running in the compute node pool. Let’s assume that there is a significant programming workload and there isn’t sufficient capacity on the existing compute server node(s).

Rather than starting a new compute node the K8s scheduler could select one of the existing apps nodes to run the new sessions. At this point the users would start to see the error shown above in the first test. The timeout in SAS Studio waiting for the Compute Server pod (the SAS Studio compute context) to start. This is due to the time required to pull the sas-programming-environment image down to a node.

A user could strike it lucky, and the Compute Server could be scheduled onto a node that already has the sas-programming-environment image. But the Compute Server still has dependencies on things like storage for SASWORK and maybe the need for GPUs.

This illustrates the problem of using preferred scheduling in a cluster with untainted nodes.

You can probably live with the stateless and stateful pods running on any available node in the cluster, but how do you ensure that the compute and cas pods are running on the target nodes even when the node pools scale to zero?

To ensure that the Viya pods run on the desired nodes (when there are untainted nodes) ‘required node affinity’ (required scheduling) must be used. I will discuss this in more detail in part 2.

Conclusion

Sharing the cluster with other applications is possible, but this needs to be carefully planned to ensure the best result is achieved for ALL applications.

The SAS Viya system requirements and admin permissions need to be discussed and can be key drivers for dedicating a cluster to running SAS Viya.

As we have seen, there are many factors that affect the Kubernetes scheduling, including node availability, the labels and taints on nodes, as well as the application configuration (in this case SAS Viya), to name a few.

In part 2 we will look at using required scheduling to force a topology (when there are untainted nodes) and dedicating a single node pool to SAS Viya within the shared cluster.

Find more articles from SAS Global Enablement and Learning here.

touwen_k · ‎06-25-2024

Well explained concepts, it makes clear why one needs to implement the taints. I love the scenarios, they make it easy to follow blog