Improve CAS reliability with the Backing Store for CAS Memory Allocations

3 Likes

This post is a follow-up on a previous blog where we explained how recent changes in Control Group v2 management in Kubernetes have impacted the stability of the CAS server in the SAS Viya platform.

In the latest post, we discussed some workarounds, but with the latest versions of SAS Viya there is now a new feature in the SAS software to avoid this issue in most of the cases.

This new feature is called "Backing Store for CAS Memory Allocations" and this post explains how it works and how it can be setup to mitigate the CAS OOM (Out of Memory) issues impact in the Viya environment.

TL;DR

This post contains a lot of details and is fairly technical, so if you are not about to read the whole article, here is a little “TL;DR” summary 😊

A new CAS feature has been introduced from 2024.09 to limit the CAS memory allocation. If it is enabled, when a CAS action uses more memory than a defined limit, it is interrupted with a failure and a meaningful error message, but the CAS session remains active and the Linux OOM killer is not triggered (which avoids the situation where the OOM killer performs a “group” kill of every process in the CAS container, leading to CAS service interruption). It greatly improves the stability and reliability of the CAS Server service.

The technical mechanism used behind the scene to limit the CAS memory allocation (before the OOM kicks in) is to back the memory allocation onto a special Kubernetes volume (emptyDir of “memory” medium – RAM-backed filesystem) – for which we can set a SizeLimit value. When this value (usually set to 80% of the CAS container's memory limit) is exceeded, the CAS action is interrupted but the CAS session and services remain available for the end users.

The "Backing Store for CAS Memory Allocations" feature was first introduced with SAS Viya stable 2024.09 (which means that it is also included in the latest SAS Viya LTS 2024.09) and instructions where provided to patch or edit the CASDeployment Custom resource to enable it.

Then with the new stable 2024.11 version, new PatchTransformers were added, so it is now possible to manually enable the backing store for CAS memory as part of the initial deployment of SAS Viya (with the proper configuration in the kustomization.yaml file).

The current plan is make the backing store become the default in the next versions (in the first months of 2025).

CAS Memory allocation and current issue

We know that the CAS_DISK_CACHE can be used to cache CAS tables on disk (as SASHDAT files) using memory-mapped file. CAS can leverage the CAS_DISK_CACHE to quickly and efficiently "swap in", "swap out" memory blocks as needed and hence holding a volume of data that is higher than the amount of physical memory on the machine (Read these nice posts from Rob Colum and Nicolas Robert if you want to know more about how and when the CAS Disk Cache is used)

But in CAS, an action also allocates "resident" (physical) memory in order to process a table, maybe doing things like creating computed columns or views that required RSS memory to hold rows/columns.

For this part of the Analytics processing, the CAS_DISK_CACHE is not used; CAS allocates memory using the Threaded Kernel (TK) and Memory is backed using the "mmap" system call with the MAP_ANONYMOUS flag. With this flag, the pages are only backed by either real memory or the paging files (with MAP_ANONYMOUS the mapping is not backed by any file).

However, in most Kubernetes system, containers do not have a configured paging file, and CAS can only use real memory…It increases the risks of the CAS session process (running in containers) to be terminated by the OOM killer - as soon as the container's defined memory limit is exceeded.

Having session processes forcibly killed is not a great experience for the end users.

As an aggravating factor, we know (from my previous blog) that - from recent Kubernetes versions (1.28) – when a single CAS session process is killed by the OOM killer, all the other processes in the same cgroup(v2) are also killed, which lead the whole CAS server (SMP) or individual CAS worker(s) (MPP) to go down, impacting the CAS service.

While it is understandable that, from time to time a CAS action could fail (because the system is not equipped with enough memory to run it), it is hardly acceptable that the failure of a single individual CAS action causes the entire deployment of CAS to restart and interrupt the functionality of CAS...

That's the problem that is addressed by this new "Backing store for CAS Memory allocations" feature.

What is this new CAS backing store ?

It is now possible to enable a "backing store" to support CAS memory allocations and prevent the whole system being impacted by the fact that a single action requires more memory than what has been made available for the CAS container.

With this feature, the CAS Threaded Kernel (TK) is informed that the files in a specified directory can be used to back up most of the memory allocations.

The TK_BACKING_STORE_DIR environment variable is set in the CAS container and points to a specific path where the Threaded Kernel stores the mapping files (ex: /cas/tkMemory ).

The path is mounted in the pod and mapped to a Kubernetes emptyDir volume.

As noted in the official Kubernetes documentation: "The emptyDir is created when a pod is assigned to a node and is initially empty. When a Pod is removed from a node for any reason, the data in the emptyDir is deleted permanently".

By default, the content of emptyDir volumes is stored on the node’s root disk (typically under /var/lib/kubelet ). However, it is possible to set the emptyDir.medium field to "Memory", the documentation explains that Kubernetes mounts a tmpfs filesystem (RAM-backed virtual filesystem) instead of using the disk.

A size limit can then be specified to limits the capacity of the emptyDir volume. That's how we can control/restrict the memory consumption before it is too late...

In the CASDeployment custom resource specification, it would look like that :

OK but how does that help ?

The backing store for CAS memory allocations relies on the SizeLimit parameter as THE way to set a control on the maximum amount of memory that a CAS action could use, without involving the OOM Killer.

With the backing store feature enabled: when a CAS action is submitted to CAS, the action starts to map the memory files in the emptyDir and if the size of the mmapped files exceeds the defined SizeLimit (for example 24GB in the previous example), then there will be an individual failure of the CAS action.

The error message shown to the end users (for example in SAS Studio or any other CAS client) will inform them that the CAS action has failed because it ran out of memory.

It will look like this:

Or like that:

With this new behavior, the problem is addressed before involving the OOM killer and seeing it terminate all the nearby CAS processes. It helps ensure that the CAS server and even the CAS session itself suffer less impact from large-memory tasks, allowing the end-users to continue to work with CAS.

The “fail fast” principle is used here to prevent the OOM killer issue and improve the CAS overall stability.

Finally, the release of the files in the backing store follows the same pattern as releasing memory back to the operating system without this feature enabled. When the CAS session process terminates, all of the files used to back the TK memory are released (which ensure there will be no leaked resources).

Additional considerations

Specific memory allocations that are not constrained

While most of the use cases benefit from the “Backing Store for CAS Memory Allocations” feature to prevent OOM kills, there are some exceptions.

Not all memory consumption is constrained by the size of the backing store. These include:

Memory allocated by the main CAS processes on behalf of actions (but it should be a reasonably small amount.)
Memory allocated by third party code (database access drivers provided by the database vendor or any open-source code that an action consumes).
Memory in other programming languages - when an action spawns processes or threads in other programming languages (such as Python or Java) to implement the action's logic

So, for example, Python or Java code running inside the CAS session does not go through the CAS TK Memory allocation and as such will not be subject to the memory control via the TK backing store.

Requires a careful choice of SizeLimit

The SizeLimit value defines the maximum amount of memory that can be used by CAS actions before causing an "out of memory" failure.

For the CAS backing store to be efficient, it is important that this threshold is lower than the threshold that would trigger the OOM killer…the goal here is to ensure that the session fails and display an “out of memory” error message to the end-user running this action, before the CAS container memory limit is exceeded and all processes are being killed.

Testing, with the default CAS auto-resources configuration, has shown that setting the backing store to on the node seems to be effective in preventing OOM kills in most situations. In situations where the CAS resources request and limit are manually set (usually to places more than one CAS pod on a node), it is recommended to use a similar fraction (80%) of the container memory limit.

Coordination with the CAS resources management configuration.

In CAS, it is possible to implement resource management through policies that are created with the Viya CLI. You can create up to 5 five "priority-level" policies per CAS server that is used to place space quotas on table data. Each “priority-level-n” is associated to a group of users. If you enable this kind of configuration, it is also possible to align the backing store configuration to define distinct TK Memory allocation limits for each policy.

Other ways to prevent the OOM “group-kill” issue

Finally, note that SAS R&D is also currently researching additional ways to prevent the OOM killer from terminating CAS processes. One alternative workaround is based on the use of a Kubernetes Daemonset to overwrite the cgroups configuration files created by Kubernetes. The issue with the OOM "group-kill" change in Kubernetes is not specific to CAS, other applications have been impacted and a pull request to allow configuration of the "group OOM kill" behavior should be included in Kubernetes 1.32.

How to implement it ?

When the "Backing Store for CAS Memory Allocations" capability was introduced for the first time in SAS Viya, the official way to enable it was to patch the CASDeployment Custom resource deployment or to directly edit the CASDeployment Custom resource.

But since version 2024.11, there is now an official Kustomize Patch Transformer to apply the changes. A new "Configure a Backing Store for Memory Allocations" paragraph has been added in the "Optional Customization" section of the official documentation.

There are actually 4 distinct transformers to apply depending on the CAS configuration that is in place.

The first one cas-enable-default-backing-store.yaml should be used
- when the CAS autoresourcing is enabled (which is the CAS default configuration) OR
- if you have manually set the CAS resources request and limits. You might have done that to allow several CAS pods to run on the same node, maybe because you need to optimize the infrastructure costs or because you are deploying SAS Viya in a Bare metal environment on a small set of large physical machines as discussed there)
- In these cases, there is no need to set the size for the backing store ( SizeLimit ) as it is automatically set by the CAS operator to 80% of the CAS container memory limit (when it is manually set).

The second one , cas-enable-backing-store.yaml lets you specify the SizeLimit . It should be used when:
- The CAS container doesn't have a memory limit OR
- You don't want to use the 80% default value (A customer might decide something like "I want more memory dedicated to the file cache" or "I'm using the CAS Gateway and it uses lots of memory in Python, so, I want a smaller limit on CAS TK Memory").

Then when CAS resource management policies have been enabled with different priority groups, we can have separate backing stores for each resource management priority group and set different SizeLimit values for the priority groups that correspond to group of users. Two transformers are provided as examples (one with 5 priority groups and another with just one).

As usual, once you have copied and eventually adjusted the values in the transformer YAML file, don't forget to reference it in the "Transformers" section of your main kustomization.yaml file.

Note that the Kubernetes manifest, generated from the Kustomize build with the updated reference, must be re-applied and that the CAS Server should be restarted too to pick up the configuration change.

Here is an example of what you would see in the CAS pod specification when CAS auto-resource is enabled (no use of CAS Resource Management policies) and when the backing store is enabled with the default transformer ( cas-enable-default-backing-store.yml ).

The first screenshot shows the effect of the CAS auto-resource configuration on an 8 CPU/64GB machine. The CAS container requests and limits are automatically computed by the CAS operator.

Then you can see below the effect of enabling the Backing Store in the CASDeployment CR:

Conclusion

The previous posts on this topic talked about switching from cgroupsv2 to cgroupsv1 as a workaround to avoid the OOM Killer issue in Kubernetes…However the Kubernetes community has decided to move cgroupsv1 into maintenance mode for Kubernetes 1.31, so removing cgroups2 as a solution may be poorly received... that’s another reason to opt for this new "Backing Store for CAS Memory Allocations" feature instead.

Finally, nothing really prevents a specific CAS action from running out of memory at some point. It’s always a possibility… maybe because the code has not been optimized or the dataset size is too large, or simply because running this action on this amount of data requires more memory than physically available in the infrastructure. However, with this new "backing store" feature, this type of situation is reported as an "out of memory" condition and only affect the individual user's session, as opposed to all sessions running in a CAS pod being arbitrarily killed.

That’s why this change is really improving the overall reliability and availability of the CAS Server.

Video Bonus

Now, as a little reward to your perseverance in this post reading 😉, you can find below a short 3 minutes video to quickly demonstrate the benefit of using the "Backing Store for CAS Memory Allocations".

(view in My Videos)

Thanks for reading !

Find more articles from SAS Global Enablement and Learning here.