Running SAS Viya on a shared Kubernetes cluster – Part 2

3 Likes

This is Part 2 of the post on running SAS Viya on a shared Kubernetes cluster. In Part 1 I discussed some of the challenges that can be encountered when deploying SAS Viya in a cluster that has untainted nodes, as well as the main deployment considerations.

In this post I will discuss implementing required scheduling, required nodeAffinity, to force a desired topology when there are untainted nodes in the cluster.

Again, I will share some of the tests that I ran to help illustrate the issues.

To recap, for my testing I created an Azure Kubernetes Service (AKS) cluster with the required SAS Viya node pools, with the labels and taints applied, but they were scaled to zero. In addition to the four node pools that correspond to the standard SAS Viya workload classes (cas, compute, stateful, and stateless) and a ‘system’ node pool for the Kubernetes control plane, I also created an ‘apps’ node pool (without any taints applied).

This was to simulate a scenario where there are other applications running and having several nodes available that didn’t have any taints applied.

Using required scheduling to force a topology

To ensure that the SAS Viya pods run on the desired nodes (when there are untainted nodes), ‘required node affinity’ must be used.

I will start by saying the easiest path, the simplest configuration option when running with un-tainted nodes is to just focus on configuring the CAS and Compute pods. This is easier than reconfiguring all the stateless and stateful service to use required scheduling. There is also a patch transformer for CAS supplied in the overlays: require-cas-label.yaml

See the SAS Viya README: Optional CAS Server Placement Configuration

This means that you only need to focus on the sas-programming-environment components (pods).

I discuss enabling required scheduling in the following Post : Creating custom Viya topologies – Part 2 (using custom node pools for the compute pods).

For this test I configured required scheduling for both CAS and Compute, reset the cas and compute node pools to zero nodes, then deployed SAS Viya.

I created the following patch transformers for the sas-programming-environment pods:

sas-compute-job-config-require-compute-label.yaml
sas-batch-pod-template-require-compute-label.yaml
sas-launcher-job-config-require-compute-label.yaml
sas-connect-pod-template-require-compute-label.yaml

To summarize, the configuration the following patch update is applied to update the nodeAffinity for the Compute components:

patch: |-
  - op: remove
    path: /template/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
  - op: add
    path: /template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
    value:
      key: workload.sas.com/class
      operator: In
      values:
      - compute

While this worked, it did highlight the problem for the first user to login to SAS Studio, as it is at this point that the first compute node is created. The SAS Studio session timed out waiting for the Compute Server (pod) to download the container images and then start.

I waited for the compute node to fully start then started a new SAS Studio session. This time I got the Compute Server context.

This does beg the question: Should you ever let the compute node pool scale to zero?

Looking at the SAS Viya deployment I still didn’t have any pods running on the stateful or stateless nodes. These node pools were still scaled to zero, all the stateless and stateful pods were running on the ‘apps’ nodes.

But would starting some stateless and stateful nodes prior to the SAS Viya deployment fix this?

At this point I did another test; I manually scaled the stateful node pool to have one node. Before the SAS Viya deployment the AKS cluster had the following nodes.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

You can see that I had three nodes available in the SAS Viya node pools and again I had several ‘apps’ nodes (aks-apps-xxxx-vmssnnnn) available to simulate nodes being used for other applications.

I did this to illustrate the Kubernetes pod scheduling behaviour. Kubernetes (the cluster autoscaler) will not scale a node pool util it is needed, regardless of the node affinity for the pod, if an “acceptable” node is available, in this case one without any taints, the pod will be scheduled there as a first choice.

Hence, at the end of this deployment I still only had one stateful node and no stateless nodes.

Once SAS Viya had deployed, I used the following command to view the pods running on the stateful node (aks-stateful-12471038-vmss000000) and found there were only 8 pods running on the node.

kubectl -n namespace get pods --field-selector spec.nodeName=node-name

The rest of the Viya stateful and stateless pods were running on the 'apps' nodes.

Using one node pool dedicated to SAS Viya

In this scenario perhaps the organisation only wants to dedicate, or reserve, only one node pool to SAS Viya. When sharing a cluster with other applications this is probably the simplest approach to ensure that there are nodes available to meet the requirements for the SAS Viya Compute Server and CAS functions.

I like this configuration as you only need to create one additional node pool in an existing cluster. Then using required scheduling for the CAS and Compute pods it also allows for scaling the node pool to zero.

The problem with scaling the node pools to zero is avoided as when the CAS server starts this will trigger the initial scaling of the shared node pool.

For this test I created a ‘viya’ node pool, with the compute label and taint applied.

To avoid pod drift, the sas-programming-environment pods should be updated to use required scheduling using the standard compute labels and taints, as described above.

Additionally, as the CAS and Compute pods are sharing the same node pool, the CAS server configuration had to be updated to target the shared node pool, the ‘viya’ node pool. For this I used the provided CAS overlay as a template and targeted the workload.sas.com/class=compute label. The tolerations also had to be updated for the compute taint.

The patch transformers for the CAS configuration are shown below. The first patch transformer updates the CASDeployment.

# PatchTransformer to make the compute label required and provide a toleration for the compute taint
---
apiVersion: builtin
kind: PatchTransformer
metadata:
  name: run-cas-on-compute-nodes
patch: |-
  # Remove existing nodeAffinity
  - op: remove
    path: /spec/controllerTemplate/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
  # Add new nodeAffinity
  - op: add
    path: /spec/controllerTemplate/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
    value:
      key: workload.sas.com/class
      operator: In
      values:
      - compute
  # Set tolerations
  - op: replace
    path: /spec/controllerTemplate/spec/tolerations
    value:
      - effect: NoSchedule
        key: workload.sas.com/class
        operator: Equal
        value: compute

target:
  group: viya.sas.com
  kind: CASDeployment
  name: .*
  version: v1alpha1

The second patch transformer updates the 'sas-cas-pod-template'.

---
apiVersion: builtin
kind: PatchTransformer
metadata:
  name: set-cas-pod-template-tolerations
patch: |-
  - op: replace
    path: /template/spec/tolerations
    value:
      - effect: NoSchedule
        key: workload.sas.com/class
        operator: Equal
        value: compute

target:
  kind: PodTemplate
  version: v1
  name: sas-cas-pod-template

The final update I made was to adjust the CPU and memory requests and limits for CAS. This was to ensure that there was space available on the shared nodes for the Compute Server pods.

For example, I used nodes with 16vCPU and 128GB memory, then set the CAS limits to 12 vCPU and 96GB memory. I also configured the requests and limits the same to enforce guaranteed QoS. Using guaranteed QoS is important to protect the CAS Server pods when the nodes are busy. It ensures that the CAS pods are among the last to be evicted from a node.

This was my target topology.

To configure the CPU and memory for CAS see the example in sas-bases:

../sas-bases/examples/cas/configure/cas-manage-cpu-and-memory.yaml

When I deployed SAS Viya with a MPP CAS Server, this had the benefit of ensuring that multiple nodes were available for the Compute Server pods. The Compute pods were configured with the resource defaults (CPU and memory requests and limits).

As a side note (running SAS Viya 2024.03), if you inspect the sas-compute-server pod you will see two running containers. The sas-programming-environment container has resource limits of 2 cpu and 2Gi memory, and the sas-process-exporter container has limits of 2 cpu and 4Gi memory.

Once SAS Viya was running, I then started multiple SAS Studio sessions. In the image below you can see that the Compute Server pods are running on multiple ‘viya’ nodes.

As an end user life was good, as my SAS Studio sessions started without any failures. 😊

Conclusion

Sharing the cluster with other applications is possible, but this needs to be carefully planned to ensure the best result is achieved for ALL applications.

As always, it’s important to focus on the requirements for the SAS Viya platform. When the cluster has untainted nodes, you should configure required scheduling to ensure you have an operational SAS Viya platform. The simplest approach is probably just to focus on Compute and CAS.

A key question is: Do the untainted nodes meet the system requirements for SAS Viya?

If not, additional node pools WILL be required to run the SAS Viya platform.

Here I have proposed the concept of sharing a node pool for Compute and CAS, and shown how you could reserve some capacity for the two workloads. You should do some capacity planning and sizing to establish suitable node sizes and the appropriate resource reservations.

But keep in mind, it is possible to have a shared node pool for CAS and Compute and still use CAS auto-resources to effectively dedicate some nodes within the node pool to running CAS.

Finally, I have left you with a couple of questions ponder (from an end-user perspective):

Should you ever let the compute node pool scale to zero?
How many nodes should you have available for the Compute pods to avoid SAS Studio timeouts?

I’m sure this will lead to interesting discussions! Maybe it’s a topic for another day…

touwen_k · ‎06-28-2024

Thank you for useful command to see the node labels and suggestion to keep all the workload for compute server in one place! We will consider to try it out in our cluster. Looking forward to next blogs about how many compute nodes do you need or finding answer to questions like: do you need another stateless node or another compute node if there is significant programming workload, however based on past experience there is also a need for stateless/stateful node with more CPU than for compute node. One thing I was missing in the blog is, where can you find the default deployment for stateful/stateless nodes except of site.yaml , with the nodes affinity and tolerations requirements?