Where SAS Viya Relies on Kubernetes for Workload Placement

3 Likes

SAS Viya relies on Kubernetes to provide a robust environment that is elastic, highly available, extensible, and robust. One way that Kubernetes achieves these goals is through sophisticated workload management controls. By default, SAS Viya takes advantage of these controls in specific ways to ensure the software experience is great. But there's no one-size-fits-all way to deploy SAS Viya, so understanding these controls can be useful to further fine-tune the environment to your site's specific needs.

Let's take a quick spin through several of these Kubernetes workload management controls and see how SAS Viya implements them to get the job done.

Requests and Limits

Of course, running in a Kubernetes (k8s) environment means that SAS Viya software runs in pods. And each pod consists of one or more containers. Those containers have the option to set boundaries on the amount of CPU and memory that they expect to use, if desired.

A container can establish a request for CPU and/or memory which k8s essentially treats as a minimum that it’ll try to reserve in the cluster. The analogy is like making reservations for dinner at a restaurant. And just like that, arbitrarily requesting too much CPU or RAM without using it is like booking a large party table but then not showing up with enough people. Containers can, of course, use more CPU or RAM than initially requested.

We can also establish limits for CPU and/or memory usage of containers. This means the runtime prevents the container from using more than the configured limit. So, for example, if a 2GB memory limit is set, but then the container attempts to use more than that, the system kernel's Out of Memory (OOM) Killer might terminate the process.

If you look at a pod's spec, you might see requests and limits defined similar to:

resources:
  limits:
    cpu: '2'
    memory: 3Gi
  requests:
    cpu: 50m
    memory: 1536Mi

In this example, the pod will request 50 "millicores" of CPU (i.e. 5% of a single CPU's capacity) and 1,536 mebibytes (i.e. 1,610,612,736 bytes) of RAM. Over the course of the pod's lifecycle, it can use more than these original requests; up to the limits shown (2 full CPU and/or 3 gebibytes of memory).

Quality of Service

Quality of Service (QoS) is a classification system which determines the scheduling (and eviction) priority of pods in a k8s cluster. QoS is not something set explicitly - instead it's determined based on requests and limits defined in the deployment. There are 3 classes of QoS:

Guaranteed

When a pod is assigned a Guaranteed QoS class, k8s will only schedule it to nodes which have sufficient memory and CPU resources to satisfy their requests/limits.

A pod's QoS class will be Guaranteed when:

Every container in the pod has both a memory request and a memory limit
The memory request equals the corresponding limit in every container of the pod
Every container in the pod has both a CPU request and a CPU limit
The CPU request equals the corresponding limit in every container of the pod.

In a typical SAS Viya deployment, we don't usually see pods or jobs with a Guaranteed QoS. It might be a good to consider implementing Guaranteed QoS for the crucial stateful services, like Postgres.

Burstable

When a pod's QoS class is Burstable, then k8s will schedule it to any node that has available resources. Limits per pod are enforced, but it's possible the requests could combine over time to exceed that node's capacity. Usually you'll see these nodes with requests within the available capacity but the combined limit total might be much more.

A pod's QoS class will be Burstable when:

It isn't Guaranteed QoS
At least 1 container in the pod has either a CPU or memory request or limit.

For a typical SAS Viya deployment, most pods or jobs will have a Burstable QoS.

BestEffort

Pods assigned QoS of BestEffort are scheduled to run on any node that has available resources. They have the ability to use any amount of free CPU and/or memory on the node. While flexible, take care to ensure that pods aren't resource hogs which contend with other pods, degrading service.

A pod's QoS class will be BestEffort when:

It isn't Guaranteed or Burstable QoS
No container in the pod has defined CPU or memory requests or limits.

In a typical SAS Viya deployment, pods or jobs which handle lower-priority background tasks will get BestEffort QoS. These include tasks like regularly scheduled backups, image pullers, update checkers, and so on.

Labels and Taints

There are often times when you want certain pods to run on specific nodes of your k8s cluster. CAS, in particular, is a great example because it's often desirable to run on larger nodes with many CPU and maximum RAM… very different from the nodes that are needed to run the other components of a SAS Viya deployment.

Labels and nodeSelector

Labels are used in k8s to help with scheduling pods to the correct nodes. For SAS Viya, we assign labels to nodes of the cluster to help associate pods to machines which are optimized (or simply reserved) for that use. In conjunction with the label "workload.sas.com/class", SAS Viya defines several workload classes. Besides cas, there's also workload class labels called stateless, stateful, compute, and (sometimes) connect. You'll usually see SAS Viya nodes with the label "workload.sas.com/class" and one of those values.

If you pod specification, use nodeSelector as the most straight-forward way to associate a pod with nodes using labels. All labels specified in the nodeSelector must be present on a node for the pod to assign to it.

One particular example of the nodeSelector approach is used when the SAS Workload Manager add-on is licensed. Specifically, SAS Workload Orchestrator daemonset only runs on nodes with the label "workload.sas.com/class=compute" (doc). If that's not defined on any node, then SAS Workload Orchestrator is unable to start a SAS Compute Server.

Labels are great when you want to push a pods to specific nodes, but what if you want to keep other pods away from particular nodes?

Taints and Tolerations

Taints are used in k8s to set nodes aside only to be used by specific pods. Pods with a toleration for the taint can run there.

This approach can be used for CAS. We can define nodes for CAS with the label "workload.sas.com/class=cas", but that alone doesn't reserve them for CAS' use; other pods might end up there and we don't want that. So we also place the taint "workload.sas.com/class=cas:NoSchedule" on the nodes where we want CAS to run. This taint directs k8s to only allow pods with the "cas" label and excludes other pods from running there. Because CAS is a heavy-duty, high-performance analytics engine, we want to ensure maximum value is gained by reserving those nodes to only run CAS work.

We recommend using taints sparingly in a k8s environment. It's not hard to over-taint the environment such that pods eventually won't have a place to run.

Affinities, Inter-pod Affinity, and Anti-Affinities

Labels and taints are great, but in many ways they follow a relatively strict set of binary rules: allow or not, schedule or evict. There are cases where we might allow k8s more flexibility in determining where pods can run, having a preference, but not a hard rule. This is where affinities and anti-affinities come in. They have a flexible expression syntax to accommodate more nuanced situations.

Node affinity

Node affinity is pretty similar to nodeSelector except you can qualify how strict the rule is with:

requiredDuringSchedulingIgnoredDuringExecution: k8s will only schedule the pod to a node if the rule is met (very similar to nodeSelector)
preferredDuringSchedulingIgnoredDuringExecution: k8s will attempt to schedule the pod to a node that satisfied the rule, but if no matching node is available, then k8s will still schedule the pod to a node with available resources.

By default, SAS Viya deploys using the "preferred" affinity approach. For production environments with hardware optimized for specific workloads (like CAS), we will want to enable the "required" affinity approach for some pods.

See Mike Goddard's post, What happened? That doesn’t look like the SAS Viya deployment I wanted!, for much more information about configuring your SAS Viya deployment to place workload as desired.

Inter-pod Affinity and Anti-Affinities

When you want pods to run together on the same node, you're describing inter-pod affinity. And when you want pods to avoid each other and run on different nodes, that's anti-affinity. These affinity rules don't look at the labels on the nodes, but those on the pods themselves. Just like node affinity rules, inter-pod affinity and anti-affinity offer:

requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution.

For typical deployments, SAS Viya configures pod affinities in several places, but most notably for stateful services (like RabbitMQ and Consul) to establish podAntiAffinities to prevent multiple instances of redundant services from running together on the same node. That way, if one node goes offline unexpectedly, it won't wipe out a quorum of a service's running instances.

Beyond Kubernetes, Look to SAS Viya

Configuring k8s for workload placement works really well up to a certain point. But there are areas where k8s has limited ability to "see" what's going on. Instead of trying to manage the workload at the infrastructure level, configuration needs to take place higher, at the application level.

And within SAS Viya we have a multi-level approach with many levers and switches to configure workload placement:

Multi-tenancy in SAS Viya is an approach to separate groups of users and offers the ability to physically separate their hardware use within a single deployment of SAS Viya, if desired.
CAS primarily implements workload management through the distribution of data across its workers by row, by partition, repeated tables, etc. (Example: CAS data distribution: DUPLICATE a REPLICATION using COPIES. Can you REPEAT?)
SAS Workload Management is an add-on product which provides the ability to manage users and jobs through the use of queues, priority, and preemption. (Example: 4 Tips for a Successful Start with SAS Workload Management)
And core SAS Viya itself offers a number of low-level controls to configure workloads, too. (Example: Improving Concurrency Performance in SAS Viya).