Unable to launch a compute session due to timeouts ? check your fsGroupPolicy…

4 Likes

A few months ago, various SAS teams and groups reported sporadic or systematic issues related to the startup of the SAS Compute Server pod.

Several tickets were opened to describe this issue and the problem was even reported on the Azure Q&A board. The symptom takes various forms but the underlying cause is always the same : the compute session could not be created, because the sas-compute-server pod failed to start up within the default 60 seconds timeout.

The issue can be seen from different applications (failure to run pipelines in SAS Model studio, ad-hoc analysis on a table within SAS Information Catalog, etc.…) because, of course, many components of the platform rely on the successful launch and execution of SAS Compute Server sessions.

Most of the time, the issue appears randomly, is not consistent and quite often a restart of the SAS Workload Management makes the problem go away...but not always !

Some brave individuals were able to understand the root cause of this problem (and how to avoid it) and that’s what we’ll describe here (with a bit of detective work) 😊

A little "ToC" should help you to navigate in this rather technical post...

What really happens : the underlying issue (COMPTRIAGE-1784)
What Can we do?
Conclusion

What really happens : the underlying issue

After looking carefully at the logs and doing some research, a promising lead explaining the compute server pod timeout was found : "For large volumes, checking and changing ownership and permissions can take a lot of time, slowing Pod startup."

This intuition was confirmed by additional troubleshooting…When looking at the Kubernetes kubelet's logs on the compute node for this environment it appeared that the mounting of a PVC for the sas-compute-server pod was taking longer than would be expected and was the cause of the init-containers not completing.

E0805 18:41:01.568119    3996 pod_workers.go:1301] "Error syncing pod, skipping" 
err="unmounted volumes=[python-volume], unattached volumes=[], failed to process 
volumes=[]: context canceled" pod="rtr/sas-compute-server-caa26268-7d7c-43a4-87c4-
6a61f8b78489-5646" podUID="ec636326-1a66-4131-84e7-2584239b75a5"

One of the particularities of the python-volume (that is mounted into the SAS Compute Server pod when "Integration with External Languages" is configured) is that it contains many folders and files (around 79 000 !). So, it is likely that some operations on the mounted volumes take too much time and cause the SAS Compute Server to time out…

fsGroup, fsGroupChangePolicy, how are these things working ?

At this point, a little reminder on how the mounted volume permissions are set in Kubernetes would not hurt 😊

As nicely explained in the Kubernetes official documentation :

“By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod's securityContext when that volume is mounted.” - The goal is to set the volumes folder permissions to allow whoever is running the container’s process to access the underlying storage.
However, the page also describes that changing ownership and permissions in the volume can take time (and significantly slow down the startup of the pod...), and explains that it is possible to use another field, called fsGroupChangePolicy , to control the way that Kubernetes checks and manages ownership and permissions for a volume.
In summary if the value for fsGroupChangePolicy is set to OnRootMismatch , then the recursive change of permissions only occurs if the permissions and the ownership of the root directory does not match with expected permissions of the volume.

Let’s look at an example of the SecurityContext definition for our SAS Compute Server to illustrate that.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

We can see that the fsGroup value is automatically set to the GID (Group ID) value returned by the identities services for the user who starts the compute server (note that in this example, the GID is a randomly generated hash value but in other cases it could be the POSIX "group id" attribute of the end user, fetched from the identity provider).

In addition the fsGroupChangePolicy value is set to OnRootMismatch (which means that if a volume root folder is already owned by the group with this GID 515741841, then the recursive change of permissions is skipped).

Does the recursive change of ownership always happen ?

Actually, the fsGroup specification in the Pod's SecurityContext does NOT always cause the recursive change of permissions, depending on the type of volume and File System.

As noted there, “For certain multi-writer volume types, such as NFS or Gluster, the cluster doesn’t perform recursive permission changes even if the pod has a fsGroup. Other volume types may not even support chown()/chmod(), which rely on Unix-style permission control primitives.”

From what we’ve seen in our environment, when using static provisioning and standard "nfs" volume type, the fsGroup is NOT enforced. It is also the case when you use the nfs-subdir-external-provisioner automatic provisioner to create your "NFS based" storage class.

However, things are different when the NFS folder is exposed and mounted through a Storage Class provided with a CSI driver….

What changes with a CSI driver ?

When the volume is mounted with a CSI driver, then the volume’s "permission and ownership change" behavior can be affected and change from what happens without a CSI driver.

When mounting a volume with the CSI driver, the CSI driver configuration itself can drive the way the fsGroup is implemented. The fsGroup effect is delegated to the CSI driver.

The fsGroupPolicy is the CSI driver’s field that determine the behavior.

When looking at the official Kubernetes CSI doc we can see that there are 3 possible values:

So with the File mode, the fsGroup effect is enforced, while with the None mode, the volumes are mounted with no ownership or modification changes. The last available mode, ReadWriteOnceWithFSType only modifies the ownership and permission under 2 conditions : the fsType is defined and the PV’s access mode is ReadWriteOnce.

Am I affected ? What are the factors ?

What appears to be the combined configuration causing the computer server timeout issue is the sas-pyconfig Python volume (required when SAS Viya integration with external languages is configured) coupled with the installation of the new NFS CSI driver (which is now recommended by SAS, as discussed in a previous post…).

By default, the new NFS CSI driver (nfs.csi.k8s.io) is configured with a value of File for the fsGroupPolicy, which means that the recursive change of volume's files and folders ownership is enforced when fsGroup is defined in the pod’s Security context (which is the case for our SAS Compute Server podTemplate definition - as seen above).

04_RP_nfs-csi-driver-default-fsGroupPolicy-1024x529.png

However, note that, while having python integration configured is a common cause of the SAS Compute Server timeout (due to the number of files/folders), some teams also saw it failing due to mounts with some data folders. Integration with python is one example; but it could happen with other mounts as well (especially if there are a lot of folders/files to parse – which could cause permissions changes for all the volumes attached to the pod take longer than the pod timeout limit…).

A reproductible and observable behavior

We were able to reproduce the issue in our lab environment.

After having implemented both configurations ("Integration with External languages" and new NFS CSI driver for Kubernetes) and redeployed, a first attempt to start a SAS Studio session gives this error.

However, on the second attempt (with the same user), the startup of the SAS Compute Server is generally successful. It allows us to execute and navigate inside the sas-programming container and to confirm that the ownership of the sas-pyconfig mounted volumes has been changed.

It is, very likely the operation that took too long the first time (almost 80,000 files permissions to change !) and caused the time out of the Compute Server !

If we, now connect to the NFS server and look at the physical folders permissions, we can see that the owning group of the python-volume (and other writable volumes) actually depends on who is the last to have started a Compute server session…

Interestingly, we can also see this message in the kubelet logs (that confirms the root cause of the issue discussed in this blog post).

Sep 22 13:22:58 sasnode08 kubelet[3431]: W0922 13:22:58.874504    3431 
volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/cc703b02-bba2-
4f5c-b946-913dcc60b2e9/volumes/kubernetes.io~csi/pvc-284a67a5-7fb1-42cd-b77d-
c1f8fce88e4f/mount and fsGroup set. If the volume has a lot of files then setting 
volume ownership could be slow, see 
https://github.com/kubernetes/kubernetes/issues/69699

Finally, on an another "timeout" occasion, using this command we could see that, while the pod is trying to perform the next volume mount operation, a message appears and reports that the “pod startup duration” is too long 😊

At this point it looks like the detective work is over and that we have caught the main suspects 😊

What can we do ?

To prevent the problem from happening, the solution that was found so far was to change the behavior of the CSI driver by changing the fsGroupPolicy value.

We can manually update the value with the kubectl "edit" or "patch" command to change the fsGroupPolicy value from Files to either ReadWriteOnceWithFSType or None, as shown below:

kubectl patch csidriver nfs.csi.k8s.io -p '{"spec":{"fsGroupPolicy": "None"}}'

It is also possible when installing the CSI driver to disable the fsGroupPolicy.

For example in helm you can use the --set feature.enableFSGroupPolicy=false option. However note that the CSI driver fsGroupPolicy value changes from File to ReadWriteOnceWithFSType in this case.

Conclusion

If you are using a CSI driver (such as the newly recommended opensource NFS CSI driver) and have also configured the SAS Viya Platform integration with external languages (with the python volume – which contains a lot of folders and file), then you may noticed some random failures of your SAS Compute Server sessions.

In this case, it is likely that you are affected by the fsGroupPolicy defined for the CSI driver.

The problem was not observed with the older open-source NFS provisioner tool because it was using the Kubernetes native in-tree NFS support (meaning it just creates PV's with .spec.nfs populated), whereas the CSI driver does the mount itself.

You could avoid the random Compute Server session failure by setting the CSI driver’s fsGroupPolicy to None or ReadWriteOnceWithFSType.

Note that this change was implemented in the DaC (Deployment as Code) GitHub project in the release that was published at the end of September 2025, the fsGroupPolicy is now set toReadWriteOnceWithFSType by default.

If you are not using the DaC project to install the NFS CSI driver but have installed it manually on your own (to comply with the latest recommendation from the SAS Documentation) you may also want to consider making this change in the CSI driver configuration.

Finally note that with the Viya November 2025 stable version (2025.11), more freedom is given to the SAS Administrators in order to get rid of any CSI driver constraint that would not allow to disable the fsGroup settings. A new configuration option, "fsgroup.enabled" allows to make the PodTemplate's fsGroup and fsGroupPolicy settings optional in the SAS Launcher configuration, so they could be completely disabled when the underlying storage system already enforces access control, or when the volumes are mounted with sufficiently open permissions (e.g., 0777 or per-user subpaths).

I hope you enjoyed this post and learned a few things about Kubernetes and SAS Viya (I know I did ! 😊)

Find more articles from SAS Global Enablement and Learning here.

a_SAS_sin

Hi @RPoumarede ,

I am getting a timeout on my SAS Studio on Viya environment (LTS 2025.09) running on AKS.

This is the error:

The compute session could not be created.   The process initialization for the Compute server with the ID "3d158d84-d486-43d8-bc8f-6f7b1a9b46a9" timed out after 120 seconds.

  Error code: 2049

I have tried to increase the login time, which is why it is showing 120 seconds.

I thought it might be related to your article, but when I check the csidrivers I do not get the driver mentioned in your article. This is what I get:

NAME                 ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS                REQUIRESREPUBLISH   MODES                  AGE
disk.csi.azure.com   true             false            false             <unset>                      false               Persistent             2y48d
file.csi.azure.com   false            true             false             api://AzureADTokenExchange   false               Persistent,Ephemeral   2y48d
smb.csi.k8s.io       false            true             false             <unset>                      false               Persistent             593d

How can I investigate if this issue is the same as yours? Do you recommend applying the same fix to my csidrivers?

Thanks in advance