cgroups v2 changes in Kubernetes 1.28 and why it impacts your CAS Server availability

1 Like

When SAS Viya 2024.02 was released, it added the support of Kubernetes 1.28 for all supported providers.

However, as noted in the "What’s new" page, Kubernetes 1.28 has implemented a change in behavior that could affect the CAS server if your cluster is configured with Linux cgroups v2.

In this post, after explaining how Kubernetes use cgroups to better control the cluster's workload and why this Kubernetes change is likely to impact the CAS server availability, we’ll discuss situations/use-cases where this problem could occur and review the possible workarounds.

But if you just want the TL;DR version, just jump to the conclusion paragraph 🙂

The issue

Cgroups (also known as "control groups") on Linux provides a way to better control resources such as CPU time, memory among processes running on the system.

Kubernetes relies on cgroups (either v1 or v2) to enforce the containers limits for CPU and memory which are defined inside the pods “resources” sections.

Here is how it works:

if a container, running inside the pod, uses more CPU than the defined limit, then it is throttled
and if a container uses more Memory than the defined limit, then the Linux "Out of Memory" (OOM) Killer terminates the process inside the pod.

Sometimes, a picture is worth a thousand words, so here is a nice one for the cgroups that was posted on X (formerly known as twitter).

(source: https://twitter.com/b0rk/status/1214341831049252870)

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In our SAS Viya Deployment manifests, we have memory and CPU limits in the resource definitions to ensure that the Viya pods (including CAS pods) are not using more CPU or memory than expected.

For example in the CAS container : if one of the process (that corresponds to a CAS session), starts to use more memory than defined, then the OOM Killer will kill the process and the CAS session is terminated to preserve the CAS pod and the underlying node.

But Kubernetes, in its version 1.28, introduces a significant change in the cgroups v2 implementation,

If using cgroups v2, then the "cgroup aware OOM killer" is enabled for container cgroups via `memory.oom.group` . This causes processes within the cgroup to be treated as a unit and killed simultaneously in the event of an OOM kill on any process in the cgroup.

If I had to represent this new "cgroup aware OOM killer" our little cartoon, it would look like that :

(source: https://twitter.com/b0rk/status/1214341831049252870)

From Kubernetes 1.28, if cgroups v2 are enabled then : instead of killing individual CAS sessions process, when the sas-cas-server container memory limit is exceeded, this new "cgroup aware OOM killer" now kills ALL the processes which are part of the CAS cgroup (which means all the other CAS process running on the node, so the whole Controller or Worker pod would be terminated).

Am I affected by this issue?

You might already have faced this CAS OOM problem in the field. It is a known issue which is already documented in the official "CAS Troubleshooting" section of the official documentation.

Here is an extract:

Explanation: This error message indicates that a session was killed. One possible cause is that the out-of-memory (OOM) killer terminated a session because of high memory use.

But while users could just restart their CAS session and try again in case the problem arise, this event could be much more disruptive in Kubernetes 1.28 (because of the memory.oom.group change)…

According to the SAS Technical Support experience, some CAS actions started by Model Studio and Risk solutions, for example, are frequent CAS killers due to an intensive usage of memory. When the memory usage is not backed by enough space in the CAS_DISK_CACHE, then one session could exceed the CAS container’s configured memory limit and trigger the OOM Killer.

If your environment is already under pressure of CAS sessions and users, you can check in your log if these OOM have appeared.

Here is what you’ll see in the sas-cas-server container log, in case you hit the CAS OOM issue :

Child terminated by signal: PID nnn, signal 9, status 0x00000009

You can also check the CAS node(s) system log and search for messages like :

Feb 14 16:59:38 sasnode06 kernel: [11768.075646] Memory cgroup out of memory: Killed process 525176 (cas) total-vm:6054896kB, anon-rss:3983660kB, file-rss:60768kB, shmem-rss:4kB, UID:1001 pgtables:10980kB oom_score_adj:1000

If you see these kind of messages in your logs and are using cgroups v2, then you are very likely impacted by the change coming with Kubernetes 1.28.

If you are already facing this OOM issues today in your environment, one option is to bump the resources up on the CAS nodes to resolve these OOM events but other workarounds are provided below.

So what can I do ?

Limit the risk of having CAS sessions OOM Killed

The OOM Killer is triggered as soon as the sum of memory used by the CAS process on the node exceeds what has been set in the sas-cas-server’s memory limit ("resources" section in the pod's definition).

While before Kubernetes 1.28, it would affect only one CAS session and leave the others alive, it will now kills the whole CAS system (if CAS is SMP) or the whole CAS worker pod (for MPP CAS), bringing down all the CAS sessions running there.

So one way to significantly reduce the chances of experiencing the problem, is to run CAS with sufficient memory allocation (whether it is with a Guaranteed Quality of Service or with the CAS auto-resources configuration) and to provision additional capacity in case the OOM issue appear.

For example, with the default CAS auto-resources configuration (as explained in this post), the sas-cas-server container's memory limit is set to the amount of available RAM on the underlying CAS node.

With the CAS auto-resources configuration, if one or more process/CAS sessions causes the OOM, it means they use more memory than the available RAM on the machine, which it is a good indicator that the infrastructure is not adequate for the workload… Even without the specific "cgroup aware OOM killer" issue, in such a situation the whole CAS instance might be killed as Kubernetes would evict the CAS pod to prevent the node’s resources exhaustion.

Stay on Kubernetes 1.27, until a fix is available

A PR (Pull Request) has been opened in the Kubernetes GitHub repository to add a "singleProcessOOMKill" flag to the kubelet configuration. Setting that to “true” enable single process OOM killing in cgroups v2. In this mode, if a single process is OOM killed within a container, the remaining processes will not be OOM killed."

These changes are currently targeted against Kubernetes 1.30. So another option would be to stay on Kubernetes 1.27 and wait for a newer Kubernetes version that provides a way to disable the "cgroup aware OOM killer". According to the current planning, the support of Kubernetes 1.30 could be added in September with SAS Viya 2024.09 (we would drop 1.27 at the same time).

However, it might not be easy to change the kubelet configuration (especially in Cloud Managed Kubernetes environments) and this new feature remains to be confirmed.

Another problem is that jumping from Kubernetes 1.27 to 1.30 is a major change and may not be aligned with the customer upgrade strategy for their Kubernetes environment (except maybe in the case where they opted for the LTS tiers, if made available with K8s 1.27 by their cloud provider).

Use cgroups v1 for CAS node pools

Changing the cgroups type might be a better option.

Remember that this problem only affects CAS nodes configured to use cgoupsv2. So if you are using workload placement for your CAS pods, ensuring that the nodes in your CAS node pool use cgroups v1 would be a way to avoid this issue.

If you are not sure which cgroups implementation is used in your environment, here is a command that you can run, on your CAS nodes, to determine if cgroups v1 or cgroups v2 is used:

$ stat -c %T -f /sys/fs/cgroup

Example output for cgroups v1 implementation:

tmp2fs

Example output for cgroups v2 implementation:

cgroup2fs

While most Cloud managed Kubernetes service are now using cgroups v2 by default, it is still possible to switch back to cgroups v1.

The table below summarize cgroups implementation and ways to switch to cgroups v1 for the Kubernetes Distributions currently supported for a SAS Viya deployment (the information here comes from the R&D work and efforts on this topic).

Kubernetes implementation	Notes
AKS (Azure)	A Daemonset can be used to fall back to cgroups v1, see this page for details and instructions
EKS (AWS)	According to this page, at present, EKS does not offer an optimized AMI that supports cgroups v2. Its availability is currently being tracked in this GitHub issue. However it remains possible to use other images (AMI) to configure EKS with cgroups v2.
GKE (GCP)	When the GKE cluster is provisioned one can select between cgroups v1 and cgroups v2 in node configurations. The default is cgroups v2.
OCP (RedHat Openshift)	As of OpenShift Container Platform 4.14, OpenShift Container Platform uses Linux control group version 2 (cgroups v2). But it is also possible to configure cgroups v1 in openshift.
Upstream Open Source Kubernetes (OSS)	Cgroups are discussed in the Kubernetes project documentation. The kubelet and the underlying container runtime rely on the version of cgroups that is enabled on the underlying linux nodes (see above to determine which cgroups version is used in your own case.)

Conclusion

One of my SAS R&D colleagues gave a very good summary of the issue that is discussed in this blog post:

“On Kubernetes 1.27, the OOM killer would just kill the PID that had the worst OOM score. Which usually/hopefully, would just be a CAS Session that is hogging a lot of memory. Therefore just that session would die, but the rest of the users would be unaffected.

In 1.28, it kills all the PIDS which take down the whole CAS server.”

While this change in Kubernetes presents a real risk for customers with a significant CAS workload, the good news, though, is that this specific issue has been detected quite early, researched, reviewed, communicated, and well documented. A great job has been done by the SAS R&D as a whole, to provide workarounds and guidance to customers if the issue seen in the field.

In a companion and follow-up post, I will get my hands dirty and show how to reproduce this issue with a simple SAS program and highlight the differences between an environment that is affected by this change versus an environment that is not. I will also show how to implement the cgroups v1 change workaround.

Find more articles from SAS Global Enablement and Learning here.

touwen_k · ‎11-05-2024

@RPoumarede thank you very much for this valuable information and indeed it is a great work of SAS R&D. We have checked inside our organization and know that for the next Kubernetes update to version 1.30 we should be fine. Pls place an update when there is a solution in place for this issue.

RPoumarede · ‎03-06-2025

Hi @touwen_k

Sorry for the late answer, but yes there is now a recommended approach (available in recent stable versions and in LTS 2024.09) to avoid the CAS OOM issue. It is called the "Backing Store for CAS Memory Allocations".

You can find the details there.

I'm also planning to publish a blog to introduce this new feature.

touwen_k · ‎03-10-2025

thank you very much for you answer, I am going to read about it, regards Karolina T

RPoumarede · ‎03-10-2025

You are welcome, here is the blog explaining the recommended approach https://communities.sas.com/t5/SAS-Communities-Library/Improve-CAS-reliability-with-the-Backing-Stor...

The change to to add a "singleProcessOOMKill" flag to the kubelet configuration was not made in Kubernetes 1.30, however it was added in 1.32 (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.32.md#changelog-since-v13...)