What happened? That doesn’t look like the SAS Viya deployment I wanted!

2 Likes

In previous posts I have talked about creating a Workload Placement Plan or Strategy. One of the benefits of running in the cloud is that the Cloud Providers offer elastic infrastructure. In Kubernetes terms, this equates to node pools that can scale from zero nodes to a maximum number of nodes. But if you are using node pools that can auto-scale (scale to zero nodes) you might get some unexpected results.

I was recently testing a deployment in Azure using the SAS Viya Infrastructure as Code (IaC) GitHub project, using the minimal pattern with two node pools where both could scale to zero nodes. When I deployed SAS Viya all the pods ended up running in a single node pool! This wasn’t what I was after.

So, what went wrong with my CAS workload placement?

Let’s have a look at why this happened.

I built my Azure Kubernetes Service (AKS) cluster using the minimal IaC example, see here. Which provides a System node pool, plus a node pool called ‘generic’ and one called ‘cas’. As the names might suggest, the ‘cas’ node pool was to be dedicated to running the SAS Cloud Analytic Services (CAS) pods and the ‘generic’ node pool was for everything else.

Both the ‘generic’ and 'cas' node pools could auto-scale to zero nodes, which meant when I built the cluster it only had the system node pool active, with one node running. For the cas node pool, the nodes had the CAS workload label and taint (workload.sas.com/class=cas) applied. While the generic node pool didn’t have any taint, but had the following labels applied:

workload.sas.com/class=compute
launcher.sas.com/prepullImage=sas-programming-environment

These labels are used as part of the pre-pull process for the 'sas-programming-environment' pods. As this was a test environment my first deployment used a SMP CAS server (with a default SAS Viya configuration) without using the CAS auto-resources transformer. After seeing that I ended up with all the pods, including the CAS pods, running on the generic nodes, I did a second deployment using a MPP CAS server to confirm what I was seeing. This is shown in Figure 1.

MG_1__MPP-CAS-preferred-scheduling-no-auto-resources.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Figure 1. IaC minimal sample without CAS auto-resources

The default SAS Viya configuration uses preferred node affinity, see the Kubernetes documentation Assigning Pods to Nodes. Hence, I could have labeled the figure as “Preferred node affinity without CAS auto-resources”.

As you can see, all the CAS pods are running on the generic nodes, and I ended up with three workers running on the same node (aks-generic-xxxxxx-vmss00000g). While having the three workers running on the same node was understandable (as I was not using the CAS auto-resources), but why didn’t the CAS nodes get used?

The answer to this question is within the default pod configuration, the default configuration uses preferred scheduling (preferredDuringSchedulingIgnoredDuringExecution) for the Node Affinity, along with the node pool configuration and the state at the time of deploying SAS Viya. Let’s explore what I mean by that.

With both the node pools autoscaling to zero, autoscaling to zero can occur when the "minimum node count for a node pool is set to zero". Therefore, part of the answer to why, is in the Viya start-up sequence, and the SAS Viya default configuration.

That is, some of the first objects to start are the stateful and stateless pods. Which meant that by the time the CASDeployment operator went to start the CAS controller and worker pods (or the SMP CAS Server) there were already generic nodes available.

This is where the preferred node affinity comes into play. It is only a preference that the CAS nodes are used, if there aren’t any cas nodes available then another choice is evaluated. Instead of spinning up a new CAS node the pods were started on the generic nodes as they didn’t have any taint applied.

Hence, it is a combination of these three factors (the untainted generic nodes, zero cas nodes and preferred affinity) that led to this situation. If you create node pools with a non-zero number of nodes, you may never see this behavior.

Finally, I should state that Kubernetes doesn’t have the concept of a node pool, just nodes, a node pool is a construct developed by the Cloud providers, in their implementation of Kubernetes. This is how they provide elasticity for the Kubernetes node infrastructure.

Simple! Maybe we should look at some more examples to explain what is happening.

Using CAS auto-resources

At this point you might think, I know how to fix this, I just need to use the CAS auto-resources transformer. The CAS auto-resources transformer is used to automatically adjust the resource limits and requests for the CAS pods (Controller and Workers) from a ‘Burstable’ Quality of Service (QoS) to using the ‘Guaranteed’ QoS, with a value of approximately 86 percent of the available resources (memory and CPU) of the first node found with the “CAS” label.

This might be a simplification of what the CAS Operator is doing but is the “out-of-the-box” node affinity behavior.

Enabling the auto-resources does two things for us, firstly it will ensure that there is only one CAS pod per node, and secondly adjusts the resources for the CAS pods without you (the SAS administrator) having to calculate a value and set it. The resources are set based on the size of the nodes.

If you are familiar with the SAS Viya 3.x deployment, using the CAS auto-resources allows you to have the same topology as using the CAS host group with SAS Viya 3.x. Using the CAS auto-resources (along with the CAS workload taint) allows you to have nodes dedicated to running the CAS Server.

So, what happened with this configuration?

MG_2_MPP-CAS-preferred-scheduling-with-auto-resources.png

Figure 2. Preferred node affinity with CAS auto-resources

In Figure 2, you can see that now the CAS Controller and Worker pods are all running on separate nodes, but still in the generic node pool. I should also state that there may have been other pods running on those nodes along with the CAS pods. I didn’t check, but Kubernetes could still schedule other pods to those nodes, depending on their resource requests. Remember, the generic nodes did not have any taint applied.

So, better, but not perfect.

Using Required nodeAffinity

The deployment assets provide an overlay to change the CAS node affinity from preferred scheduling to use required scheduling (requiredDuringSchedulingIgnoredDuringExecution) for the Node Affinity. The overlay is called require-cas-label.yaml. It is located under the sas-bases folder:
sas-bases/overlays/cas-server/require-cas-label.yaml

Using this overlay means that the CAS pods will only run on nodes that have the ‘workload.sas.com/class=cas’ label. Therefore, you need to ensure that there are sufficient nodes available to run all the CAS pods. Otherwise, some of the CAS pods will not be able to run.

At this point my kustomization.yaml has the definition for using the CAS auto-resources and it also includes the require-cas-label.yaml overlay. Figure 3 shows the results of using the two transformers.

MG_3_MPP-CAS-required-scheduling-with-auto-resources.png

Figure 3. Required node affinity with CAS auto-resources

As you can now see, the CAS Controller and Worker pods are all now running on the cas nodes, with one CAS pod per node. This is what I wanted, the cas node pool is now being used. 😊

Just to round out the discussion, if you wondered what the deployment would look like if I used the required node affinity without enabling the CAS auto-resources, this is shown in Figure 4.

MG_4_MPP-CAS-required-scheduling-without-auto-resources.png

Figure 4. Required node affinity without CAS auto-resources

As can be seen, the CAS Controller and Worker pods are now all running on a single cas node (aks-cas-xxxxxx-vms000001).

Conclusion

Coming back to my question “What went wrong with my CAS workload placement?”

The short answer was nothing, Kubernetes did exactly what it was told to do!

The pod scheduling rules in Kubernetes are complex and many different conditions can affect where a pod is started, what node will be used. In this post, we have discussed node affinity and taints, but there is also node anti-affinity and pod anti-affinity that will affect where a pod runs.

Using the CAS auto-resources transformer enables you to set the CAS pods resources based on the size of the nodes being used and configures the pods to run using a Guaranteed QoS. I would expect that most, if not all, production deployments will use the CAS auto-resource configuration, unless it is for an environment with little use of the CAS Server.

Remember, running the CAS pods on the same node defeats the benefits of using a MPP CAS server, being, fault tolerance, scalability, and performance. Therefore, I would always recommend using the CAS auto-resources.

However, there is one possible scenario for not using the auto-resources. That is, using the auto-resources requires the CASDeployment Operator to have a ClusterRole with "get" access on the Kubernetes nodes. This role gives the Operator the ability to look at the nodes to see the resources.

If your organization's IT (Kubernetes) Standards do not allow the role assignment, it is not possible to grant the ClusterRole access, then you should do the following:

Manually calculate the resources needed and use the ‘cas-manage-cpu-and-memory.yaml’ transformer to set the resources, and
Enable required node affinity with the ‘require-cas-label.yaml’ transformer.

See the deployment documentation, Configure CAS Settings - Adjust RAM and CPU Resources for CAS Servers. But that’s a story for another article.

To summarize the key takeaways…

When using preferred scheduling the CAS pods may end up on other nodes, when the preferred node affinity is not available.
Using CAS auto-resources with preferred scheduling does NOT guarantee that the cas nodes will be used. However, they will be used if available.
Use the require-cas-label.yaml transformer to implement required (strict) node affinity, especially if there are un-tainted nodes in the cluster.
- This will force the CAS pods to only use the CAS node pool nodes. The nodes with the CAS workload class label (workload.sas.com/class=cas). This will trigger the cas node pool to scale, if possible.
- But you need to ensure that there are sufficient nodes available to run all the CAS pods. Otherwise, some pods may end up in a pending state.

Finally, you might have noticed that each screenshot shows a different set of node names. This was because between each test I deleted the SAS Viya namespace and waited for the AKS cluster to reduce to only having a system node running. In AKS, it appears that when a node is stopped it is marked as being used, so a new node is started using the next node name in the sequence. Hence, you can see that I ran the test in Figure 4 before the test in Figure 3. I hope this is useful and thanks for reading.

References

The SAS Viya Infrastructure as Code (IaC) project is available for AWS, Google GCP and Microsoft Azure.

Find more articles from SAS Global Enablement and Learning here.

SASNZer · ‎05-18-2022

Thanks Mike, this is great reading and really very helpful when planning pod distribution, particularly on smaller environments.