Workload placement in SAS Viya for the installation engineer – part 2

3 Likes

In the first part of this article series, we introduced the workload placement concept for a SAS Viya deployment.

We presented the 5 pre-defined SAS workload classes (with the associated affinities and tolerations for the SAS Viya Pods) that must be used in accordance with the Kubernetes Nodes preparation (with labels and taints).

In this second part, we'll remind why the workload placement is a critical aspect of the SAS Viya deployment and we’ll dig a little bit further in the details and explain how the labels are used during the SAS Viya deployment.

Finally, we will explore a few additional considerations for a deployment in AKS.

Why does it matter and how does it work?

The primary goal of the workload placement strategy is to : ensure high quality of service for critical SAS Viya components, enable scale-to-zero cost saving operations, and allow for utilization of specialty nodes for SAS Viya workloads.

Such as : schedule the CAS and Compute workloads to dedicated highly performant hosts, or define different instance types and scalability ranges depending on the type of workload.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The workload placement strategy also allows for differentiated control and maintenance of the nodes.

For example, for nodes that host the stateless and stateful workload class resources they can be "drained" one at a time by the Kubernetes administrator and updated with no additional management concerns (if the HA transformer was used for the deployment).

But other nodes hosting the compute, cas or connect workload classes require additional care (cordon or terminate pods) – see the official documentation for details.

But that’s not all, there are several things relying on this workload placement strategy…

Dedicated hosts for CAS

Thanks to the node taints, only the CAS controller(s) and the CAS worker pods are allowed to run on the Nodes with the CAS workload taint.

However, that alone would NOT prevent several CAS pods (for a CAS MPP deployment) to run together on the same node…and in general you don’t want that.

There are 2 things that prevents several CAS pods from co-existing on the same node:

Pods Anti-Affinity (a CAS Controller or worker pod prefers not to be on the same node as another CAS Controller or worker pod) – see the official K8s doc for additional information.

2. CAS Auto-resources...You don’t know what it is ? OK just continue to read

CAS Auto-resources

If you use the initial kustomization.yaml file with the cas-server/auto-resources overlay, the CAS operator determines the amount of RAM and CPU for your CAS pods based upon available RAM and CPU on the Kubernetes nodes where CAS is supposed to run.

For this "CAS auto-resourcing" feature to work properly, you must have set the CAS workload class labels on your "CAS" nodes. Because that’s how the CAS operator can identify the "CAS" nodes in the cluster and set the corresponding CAS pod resources request and limits.

The "CAS auto-resourcing" overlay does not only set the CPU request and limits; it also sets them in a way that the CAS Pods are running with the “Guaranteed” Quality of Service (or QoS) in Kubernetes.

Hmm…ok…But what does it mean ?

So, for the detailed explanations, again you can refer to the official Kubernetes documentation, but basically a Pod gets the "Guaranteed" QoS when it has the same values for the CPU and memory request and limits.

The example, below corresponds to a cas-server pod whose resources requests and limits have been set by the CAS operator to run on a Node running on the "Standard_E4s_v3" azure instance type (with 4 v-CPU and 32 GB of RAM).

You can see that the pod resources request is around two thirds of the node’s CPU and memory available resources. Also note that the values are the same for requests and limits.

The important thing to know with "Guaranteed QoS" is that it prevents Kubernetes from evicting the Pod if the Node is low on resources.

As the CAS pods keeps all the CAS sessions/servers (potentially hundreds of user’s session) we really don’t want them to be evicted.

Dedicated hosts for SAS/CONNECT

By design the nodes with the "CAS" and "Connect" taints are reserved for specific types of pods.

While it makes sense for CAS to dedicate the entire Node (CAS is designed to use all available CPU on the machine to solve analytics problems as fast as possible), you might wonder why we would want to dedicate a node "only" for the SAS Connect pod...

SAS Connect has the same characteristic as CAS : multiple or all sessions/servers are started within the same pod (the sas-connect-spawner pod).

So, if Kubernetes takes that pod down because it overran a resource limit, then all the sessions/servers go down with it. Like with CAS, isolating Connect spawner to a dedicated node mitigates the risk of this happening.

SAS/CONNECT legacy vs "dynamically launched"

However, the requirement for a dedicated "Connect" node is expected to be temporary because there is an alternative way referred as "dynamically-launched" Connect servers.

In this mode, each new CONNECT session is started in dedicated, independent pods.

With "dynamically launched" CONNECT sessions, there is no need to have this dedicated "connect" node and we could, instead scale up the number of nodes in the "compute" Node pool.

Unfortunately, in the current viya versions (LTS 2021.1 and stable 2021.1.1), "dynamically launched" CONNECT sessions are only initiated from SAS Viya clients coming from inside the same cluster. So, if you have SAS 9.x or SAS Viya 3.x CONNECT clients , (or SAS Viya (2020.1 and later) but in different clusters) they will all start CONNECT sessions inside the sas-connect-spawner single pod, using the legacy CONNECT method.

To summarize, the connect workload class is a temporary stopgap until SAS/CONNECT supports dynamic launch (i.e. launching new Connect servers via the Launcher API/Service) across all use cases. After that the "connect" workload class will not be required any longer and sas-connect pods could have the “compute” workload class.

If you want to know more about SAS/CONNECT in SAS Viya, my colleague @EdoardoRiva recently published a great article about it.

The Compute "pre-pull" trick

The SAS workloads labels are also used to improve the end-user experience with SAS Studio and the Compute Server.

When a user logs into SAS Studio, it spawns a new dedicated Compute pod and the Compute container image is quite big (14-17GB).

So, if the Compute container image is not already cached on the node where the Compute pod is scheduled to run on, then it can take up to 30 minutes (or more) for the image to be pulled and the pod to be started.

We really want to avoid such terrible experience (who wants to wait more than a few seconds before being able to use a GUI ?).

The idea is to "pre-pull" the Compute pod image during the SAS Viya deployment, so it will be already available on the Nodes at the end of the deployment.

The sas-prepull Daemonset runs on all the "compute" nodes with the workload.sas.com/class=compute label as part of the deployment.

So, at the time of the login of the first user into SAS Studio, the Compute container image is already available in the Docker cache of the labeled "compute" nodes and doesn’t have to be re-downloaded on the fly.

But this nifty process only works if you have labelled at least one node with the workload.sas.com/class=compute label before the launching the deployment.

Considerations for the SAS Viya 4(2021.1 and later) IaC project

If you create your Cloud infrastructure using the SAS Viya 4 IaC tool from the sassoftware GitHub project (which we strongly recommend), then there are a few extra considerations...

Node Pools

In Kubernetes Managed Services running in the Cloud , we have the concept of “Node pools” (Azure and GCP) or “Node groups” (AWS).

A "node pool" is a group of nodes within a cluster that all have the same configuration. A "node pool" can contain one or multiple identical nodes and can scale up (more nodes added in the pool) or scale down (nodes removed from the pool).

With the SAS Viya 4 IaC tool, the Terraform variables file (where you customize the infrastructure definition) lets you define the Azure Node pools with the standard node labels and taints and comes with a default setting for all the node pool.

So, by default the IaC tool provisions and configures the nodes in accordance with the official documentation recommendation : labelling and tainting of all the nodes to match the 5 defined workload classes.

Cluster autoscaler limitations

CAS Auto-Resourcing

We have explained the CAS auto-resource mechanism above and it is doing a good job as ensuring that each CAS pods run on its own dedicated node and is able to take advantage of most of the node's resources.

However, being so directive with the way the pods operates can sometimes cause other issues.

Customers in the Cloud might want to set their minimum node count to 0 in all their node pools. So, when the SAS Viya services are stopped or scaled down, the nodes are also automatically "deprovisioned" by the AKS cluster autoscaler – which helps with the costs.

This "autoscaling" capability is probably one of the most important expected "Cloud native" characteristic for applications running in Kubernetes as it allows to pay for the infrastructure only when you need it.

Until recently, the “cas-server/auto-resources” feature was preventing you to use the AKS node pool autoscaling from 0. But the good news is that it has been fixed in the stable 2021.1.1 version. The CAS operator is now taking into account scenarios where the minimum node count has been set to 0 in the CAS node pool.

Compute server

It is also recommended to NOT configure the "compute" node pool autoscaling with a minimum node count of 0, because the deployment rely on a node with the "compute" workload class label to pre-pull the compute image (as explained above).

If it does not exist at deployment time, the node will likely be provisioned at the first SAS Studio session request and once provisioned, the compute image will have to be downloaded on the new node, so the end-user will have to wait for a very loooooonnnnnnnnnnnng time before being able to type some code in SAS Studio.

"User" and "System" node pools

In AKS (Azure Kubernetes Service), there is also a concept of system and user node pools (see the azure documentation). AKS automatically assigns the label kubernetes.azure.com/mode: system to nodes in the “system” node pool.

This causes AKS to prefer scheduling “system” pods, (which are needed for the Kubernetes API server to operate) on node pools that contain this label.

When a new AKS cluster is created, a “system” node pool is automatically created to house the Kubernetes API server components.. When new node pools are created and added those are “user” node pools (by default).

Something important to know is that, by default the SAS Viya workloads will NOT be scheduled to run on the “system” Node pool in Microsoft Azure.

Indeed all the SAS Viya pods have a little something that prevents them to go on the AKS “system” nodes.

It is a node "hard" affinity that instructs the kubernetes scheduler to only schedule this pod onto nodes that do not have the label kubernetes.azure.com/mode with a value of system.

The use of the NotIn operator is how we identify that this is instructing the scheduler to avoid this node.

So, don’t be surprised if no SAS Viya pods can be scheduled or rescheduled there. It is by design. 🙂

If you are not aware you could have some surprises…

Here is an example

Imagine that you deployed SAS Viya following the default node pool configuration and you end up with 11 “user” nodes (4 in the CAS “MPP” Node pool, 1 in the "compute" node pool, 1 in the "connect" node pool, 2 in the "stateful" node pool and 4 in the "stateless" node pools for all the microservices).

You also have two “system” nodes, so a total of 13 nodes.

The user node pools have the default taint corresponding to our 5 SAS Viya workload classes (cas, compute, connect, stateful and stateless).

Now, you realize that there is a very limited need to use SAS/CONNECT for your customer and you decide that having a dedicated node just for connect is just too much.

In such case, you want to decommission the “connect” node. The way to do it in Kubernetes is to "cordon" the node and then "drain" it, which force any pod running there to be rolled out and rescheduled somewhere else. Actually, on our "connect" node only one pod should be running, the sas-connect-spawner pod.

Now guess what ? if you do that, your sas-connect-spawner will be pending forever...

Why ?

Well, where can it go ?

Yes…nowhere 😟 unfortunately, the node's taints prevent it to go on one of our 5 "user" Node Pools Nodes and the node hard affinity (described above) prevents it to land on one of the "system" nodes.

The best way to fix the issue is to add a toleration in the sas-connect-spawner pod definition so it can go, for example, on the "compute" node...but if you were not aware of these subtilties, you might be surprised when you do this kind of operations.

Conclusion

The workload placement is a key and core topic of the SAS Viya (2020.1 and later) Deployment and Architecture that underpins the Software topology and is related to many others aspects (node pools, autoscaling, nodes maintenance).

But it is a complex topic and it might also be very new if you don’t have previous Kubernetes experience or knowledge.

Take the time to familiarize yourself with these concepts and play with them (change tolerations, node taint and labels and see what happen) because they are representative of this new Cloud/Kubernetes world where Viya is now playing.

Find more articles from SAS Global Enablement and Learning here.