BookmarkSubscribeRSS Feed

Unable to launch a compute session due to timeouts ? check your fsGroupPolicy…

Started 2 weeks ago by
Modified 2 weeks ago by
Views 163

A few months  ago, various SAS teams and groups reported sporadic or systematic issues related to the startup of the SAS Compute Server pod.

 

01_RP_SASStudioComputefailed.pngSeveral tickets were opened to describe this issue and the problem was even reported on the Azure Q&A boardThe symptom takes various forms but the underlying cause is always the same : the compute session could not be created, because the sas-compute-server pod failed to start up within the default 60 seconds timeout.

 

The issue can be seen from different applications (failure to run pipelines in SAS Model studio, ad-hoc analysis on a table within SAS Information Catalog, etc.…)  because, of course, many components of the platform rely on the successful launch and execution of SAS Compute Server sessions.

 

Most of the time, the issue appears randomly, is not consistent and quite often a restart of the SAS Workload Management makes the problem go away...but not always !

 

Some brave individuals were able to understand the root cause of this problem (and how to avoid it) and that’s what we’ll describe here (with a bit of detective work) 😊

 

A little "ToC" should help you to navigate in this rather technical post...

 

 

What really happens : the underlying issue

 

After looking carefully at the logs and doing some research, a promising  lead explaining the compute server pod timeout was found : "For large volumes, checking and changing ownership and permissions can take a lot of time, slowing Pod startup."

 

This intuition was confirmed by additional troubleshooting…When looking at the Kubernetes kubelet's logs on the compute node for this environment it appeared that the mounting of a PVC for the sas-compute-server pod was taking longer than would be expected and was the cause of the init-containers not completing.

 

E0805 18:41:01.568119    3996 pod_workers.go:1301] "Error syncing pod, skipping" 
err="unmounted volumes=[python-volume], unattached volumes=[], failed to process 
volumes=[]: context canceled" pod="rtr/sas-compute-server-caa26268-7d7c-43a4-87c4-
6a61f8b78489-5646" podUID="ec636326-1a66-4131-84e7-2584239b75a5"

 

One of the particularities of the python-volume (that  is mounted into the SAS Compute Server pod when "Integration with External Languages" is configured) is that it contains many folders and files (around 79 000 !). So, it is likely that some operations on the mounted volumes take too much time and cause the SAS Compute Server to time out…

 

 

fsGroup, fsGroupChangePolicy, how are these things working ?

 

At this point, a little reminder on how the mounted volume permissions are set in Kubernetes would not hurt 😊

 

As nicely explained in the Kubernetes official documentation :

 

  • “By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod's securityContext when that volume is mounted.” - The goal is to set the volumes folder permissions to allow whoever is running the container’s process to access the underlying storage.
  • However, the page also describes that changing ownership and permissions in the volume can take time (and significantly slow down the startup of the pod...), and explains that it is possible to use another field, called fsGroupChangePolicy , to control the way that Kubernetes checks and manages ownership and permissions for a volume.
    In summary if the value for  fsGroupChangePolicy is set to OnRootMismatch , then the recursive change of permissions only occurs if the permissions and the ownership of the root directory does not match with expected permissions of the volume.

 

Let’s look at an example of the SecurityContext definition for our SAS Compute Server to illustrate that.

 

02_RP_fsGroup.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

We can see that the fsGroup value is automatically set to the GID (Group ID) value returned by the identities services for the user who starts the compute server (note that in this example, the GID is a randomly generated hash value but in other cases it could be the POSIX "group id" attribute of the end user, fetched from the identity provider).

 

In addition the fsGroupChangePolicy value is set to OnRootMismatch (which means that if a volume root folder is already owned by the group with this GID 515741841, then the recursive change of permissions is skipped).

 

 

Does the recursive change of ownership always happen ?

 

Actually, the fsGroup specification in the Pod's SecurityContext does NOT always cause the recursive change of permissions, depending on the type of volume and File System.

 

As noted there, “For certain multi-writer volume types, such as NFS or Gluster, the cluster doesn’t perform recursive permission changes even if the pod has a fsGroup. Other volume types may not even support chown()/chmod(), which rely on Unix-style permission control primitives.”   

 

From what we’ve seen in our environment, when using static provisioning and standard "nfs" volume type, the fsGroup is NOT enforced. It is also the case when you use the nfs-subdir-external-provisioner automatic provisioner to create your "NFS based" storage class.

 

However, things are different when the NFS folder is exposed and mounted through a Storage Class provided with a CSI driver….

 

 

What changes with a CSI driver ?

 

When the volume is mounted with a CSI driver, then the volume’s "permission and ownership change" behavior can be affected and change from what happens without a CSI driver.

 

When mounting a volume with the CSI driver, the CSI driver configuration itself can drive the way the fsGroup is implemented. The fsGroup effect is delegated to the CSI driver.

 

The fsGroupPolicy is the CSI driver’s field that determine the behavior.

 

When looking at the official Kubernetes CSI doc we can see that there are 3 possible values:

 

03_RP_fsGroupPolicy-Modes.png

 

So with the File mode, the fsGroup effect is enforced, while with the None mode, the volumes are mounted with no ownership or modification changes. The last available mode,  ReadWriteOnceWithFSType only modifies the ownership and permission under 2 conditions : the fsType is defined and the PV’s access mode is ReadWriteOnce.

 

 

Am I affected ? What are the factors ?

 

What appears to be the combined configuration causing the computer server timeout issue is the sas-pyconfig Python volume (required when SAS Viya integration with external languages is configured) coupled with the installation of the new NFS CSI driver (which is now recommended by SAS, as discussed in a previous post…).

 

By default, the new NFS CSI driver (nfs.csi.k8s.io) is configured with a value of File for the fsGroupPolicy, which means that the recursive change of volume's files and folders ownership is enforced when fsGroup is defined in the pod’s Security context (which is the case for our SAS Compute Server podTemplate definition - as seen above).

 

04_RP_nfs-csi-driver-default-fsGroupPolicy-1024x529.png

 

However, note that, while having python integration configured is a common cause of the SAS Compute Server timeout (due to the number of files/folders), some teams also saw it failing due to mounts with some data folders. Integration with python is one example; but it could happen with other mounts as well (especially if there are a lot of folders/files to parse – which could cause permissions changes for all the volumes attached to the pod take longer than the pod timeout limit…).

 

 

A reproductible and observable behavior

 

We were able to reproduce the issue in our lab environment.

 

After having implemented both configurations ("Integration with External languages" and new NFS CSI driver for Kubernetes) and redeployed, a first attempt to start a SAS Studio session gives this error.

 

05_RP_sas-studio-timeout.png

 

However, on the second attempt (with the same user), the startup of the SAS Compute Server is generally successful. It allows us to execute and navigate inside the sas-programming container and to confirm that the ownership of the sas-pyconfig mounted volumes has been changed.

 

06_RP_sas-py-config-ownership-1024x418.png

 

It is, very likely the operation that took too long the first time (almost 80,000 files permissions to change !) and caused the time out of the Compute Server !

 

If we, now connect to the NFS server and look at the physical folders permissions, we can see that the owning group of the python-volume (and other writable volumes) actually depends on who is the last to have started a Compute server session…

 

07_RP_ownership-change-alex-ahmed-1024x539.png

 

Interestingly, we can also see this message in the kubelet logs (that confirms the root cause of the issue discussed in this blog post).

 

Sep 22 13:22:58 sasnode08 kubelet[3431]: W0922 13:22:58.874504    3431 
volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/cc703b02-bba2-
4f5c-b946-913dcc60b2e9/volumes/kubernetes.io~csi/pvc-284a67a5-7fb1-42cd-b77d-
c1f8fce88e4f/mount and fsGroup set. If the volume has a lot of files then setting 
volume ownership could be slow, see 
https://github.com/kubernetes/kubernetes/issues/69699

 

Finally, on an another "timeout" occasion, using this command we could see that, while the pod is trying to perform the next volume mount operation, a message appears and reports that the “pod startup duration” is too long 😊  

 

08_RP_observed-pod-startup-duration-1024x141.png

 

At this point it looks like the detective work is over and that we have caught the main suspects 😊

 

 

What can we do ?

 

To prevent the problem from happening, the solution that was found so far was to change the behavior of the CSI driver by changing the fsGroupPolicy value.

 

We can manually update the value with the kubectl "edit" or "patch" command to change the fsGroupPolicy value from Files to either ReadWriteOnceWithFSType or None, as shown below:

 

kubectl patch csidriver nfs.csi.k8s.io -p '{"spec":{"fsGroupPolicy": "None"}}'

 

It is also possible when installing the CSI driver to disable the fsGroupPolicy.

 

For example in helm you can use the --set feature.enableFSGroupPolicy=false option. However note that the CSI driver fsGroupPolicy value changes from File to ReadWriteOnceWithFSType in this case.

 

 

Conclusion

 

If you are using a CSI driver (such as the newly recommended opensource NFS CSI driver) and have also configured the SAS Viya Platform integration with external languages (with the python volume – which contains a lot of folders and file), then you may noticed some random failures of your SAS Compute Server sessions.

 

In this case, it is likely that you are affected by the fsGroupPolicy defined for the CSI driver.

 

The problem was not observed with the older open-source NFS provisioner tool because it was using the Kubernetes native in-tree NFS support (meaning it just creates PV's with .spec.nfs populated), whereas the CSI driver does the mount itself. 

 

You could avoid the random Compute Server session failure by setting the CSI driver’s fsGroupPolicy to None or ReadWriteOnceWithFSType.

 

Note that this change was implemented in the DaC (Deployment as Code) GitHub project in the release that was published at the end of September 2025, the fsGroupPolicy is now set toReadWriteOnceWithFSType by default.

If you are not using the DaC project to install the NFS CSI driver but have installed it manually on your own (to comply with the latest recommendation from the SAS Documentation) you may also want to consider making this change in the CSI driver configuration.

 

Finally note that with the Viya November 2025 stable version (2025.11), more freedom is given to the SAS Administrators in order to get rid of any CSI driver constraint that would not allow to disable the fsGroup settings. A new configuration option, "fsgroup.enabled" allows to make the PodTemplate's fsGroup and fsGroupPolicy settings optional in the SAS Launcher configuration, so they could be completely disabled when the underlying storage system already enforces access control, or when the volumes are mounted with sufficiently open permissions  (e.g., 0777 or per-user subpaths).

 

I hope you enjoyed this post and learned a few things about Kubernetes and SAS Viya (I know I did !  😊)

 

Find more articles from SAS Global Enablement and Learning here.

Contributors
Version history
Last update:
2 weeks ago
Updated by:

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!

Register now

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Tags