A few months ago, various SAS teams and groups reported sporadic or systematic issues related to the startup of the SAS Compute Server pod.
Several tickets were opened to describe this issue and the problem was even reported on the Azure Q&A board. The symptom takes various forms but the underlying cause is always the same : the compute session could not be created, because the sas-compute-server pod failed to start up within the default 60 seconds timeout.
The issue can be seen from different applications (failure to run pipelines in SAS Model studio, ad-hoc analysis on a table within SAS Information Catalog, etc.…) because, of course, many components of the platform rely on the successful launch and execution of SAS Compute Server sessions.
Most of the time, the issue appears randomly, is not consistent and quite often a restart of the SAS Workload Management makes the problem go away...but not always !
Some brave individuals were able to understand the root cause of this problem (and how to avoid it) and that’s what we’ll describe here (with a bit of detective work) 😊
A little "ToC" should help you to navigate in this rather technical post...
After looking carefully at the logs and doing some research, a promising lead explaining the compute server pod timeout was found : "For large volumes, checking and changing ownership and permissions can take a lot of time, slowing Pod startup."
This intuition was confirmed by additional troubleshooting…When looking at the Kubernetes kubelet's logs on the compute node for this environment it appeared that the mounting of a PVC for the sas-compute-server pod was taking longer than would be expected and was the cause of the init-containers not completing.
E0805 18:41:01.568119 3996 pod_workers.go:1301] "Error syncing pod, skipping"
err="unmounted volumes=[python-volume], unattached volumes=[], failed to process
volumes=[]: context canceled" pod="rtr/sas-compute-server-caa26268-7d7c-43a4-87c4-
6a61f8b78489-5646" podUID="ec636326-1a66-4131-84e7-2584239b75a5"
One of the particularities of the python-volume (that is mounted into the SAS Compute Server pod when "Integration with External Languages" is configured) is that it contains many folders and files (around 79 000 !). So, it is likely that some operations on the mounted volumes take too much time and cause the SAS Compute Server to time out…
At this point, a little reminder on how the mounted volume permissions are set in Kubernetes would not hurt 😊
As nicely explained in the Kubernetes official documentation :
Let’s look at an example of the SecurityContext definition for our SAS Compute Server to illustrate that.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
We can see that the fsGroup value is automatically set to the GID (Group ID) value returned by the identities services for the user who starts the compute server (note that in this example, the GID is a randomly generated hash value but in other cases it could be the POSIX "group id" attribute of the end user, fetched from the identity provider).
In addition the fsGroupChangePolicy value is set to OnRootMismatch (which means that if a volume root folder is already owned by the group with this GID 515741841, then the recursive change of permissions is skipped).
Actually, the fsGroup specification in the Pod's SecurityContext does NOT always cause the recursive change of permissions, depending on the type of volume and File System.
As noted there, “For certain multi-writer volume types, such as NFS or Gluster, the cluster doesn’t perform recursive permission changes even if the pod has a fsGroup. Other volume types may not even support chown()/chmod(), which rely on Unix-style permission control primitives.”
From what we’ve seen in our environment, when using static provisioning and standard "nfs" volume type, the fsGroup is NOT enforced. It is also the case when you use the nfs-subdir-external-provisioner automatic provisioner to create your "NFS based" storage class.
However, things are different when the NFS folder is exposed and mounted through a Storage Class provided with a CSI driver….
When the volume is mounted with a CSI driver, then the volume’s "permission and ownership change" behavior can be affected and change from what happens without a CSI driver.
When mounting a volume with the CSI driver, the CSI driver configuration itself can drive the way the fsGroup is implemented. The fsGroup effect is delegated to the CSI driver.
The fsGroupPolicy is the CSI driver’s field that determine the behavior.
When looking at the official Kubernetes CSI doc we can see that there are 3 possible values:
So with the File mode, the fsGroup effect is enforced, while with the None mode, the volumes are mounted with no ownership or modification changes. The last available mode, ReadWriteOnceWithFSType only modifies the ownership and permission under 2 conditions : the fsType is defined and the PV’s access mode is ReadWriteOnce.
What appears to be the combined configuration causing the computer server timeout issue is the sas-pyconfig Python volume (required when SAS Viya integration with external languages is configured) coupled with the installation of the new NFS CSI driver (which is now recommended by SAS, as discussed in a previous post…).
By default, the new NFS CSI driver (nfs.csi.k8s.io) is configured with a value of File for the fsGroupPolicy, which means that the recursive change of volume's files and folders ownership is enforced when fsGroup is defined in the pod’s Security context (which is the case for our SAS Compute Server podTemplate definition - as seen above).
However, note that, while having python integration configured is a common cause of the SAS Compute Server timeout (due to the number of files/folders), some teams also saw it failing due to mounts with some data folders. Integration with python is one example; but it could happen with other mounts as well (especially if there are a lot of folders/files to parse – which could cause permissions changes for all the volumes attached to the pod take longer than the pod timeout limit…).
We were able to reproduce the issue in our lab environment.
After having implemented both configurations ("Integration with External languages" and new NFS CSI driver for Kubernetes) and redeployed, a first attempt to start a SAS Studio session gives this error.
However, on the second attempt (with the same user), the startup of the SAS Compute Server is generally successful. It allows us to execute and navigate inside the sas-programming container and to confirm that the ownership of the sas-pyconfig mounted volumes has been changed.
It is, very likely the operation that took too long the first time (almost 80,000 files permissions to change !) and caused the time out of the Compute Server !
If we, now connect to the NFS server and look at the physical folders permissions, we can see that the owning group of the python-volume (and other writable volumes) actually depends on who is the last to have started a Compute server session…
Interestingly, we can also see this message in the kubelet logs (that confirms the root cause of the issue discussed in this blog post).
Sep 22 13:22:58 sasnode08 kubelet[3431]: W0922 13:22:58.874504 3431
volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/cc703b02-bba2-
4f5c-b946-913dcc60b2e9/volumes/kubernetes.io~csi/pvc-284a67a5-7fb1-42cd-b77d-
c1f8fce88e4f/mount and fsGroup set. If the volume has a lot of files then setting
volume ownership could be slow, see
https://github.com/kubernetes/kubernetes/issues/69699
Finally, on an another "timeout" occasion, using this command we could see that, while the pod is trying to perform the next volume mount operation, a message appears and reports that the “pod startup duration” is too long 😊
At this point it looks like the detective work is over and that we have caught the main suspects 😊
To prevent the problem from happening, the solution that was found so far was to change the behavior of the CSI driver by changing the fsGroupPolicy value.
We can manually update the value with the kubectl "edit" or "patch" command to change the fsGroupPolicy value from Files to either ReadWriteOnceWithFSType or None, as shown below:
kubectl patch csidriver nfs.csi.k8s.io -p '{"spec":{"fsGroupPolicy": "None"}}'
It is also possible when installing the CSI driver to disable the fsGroupPolicy.
For example in helm you can use the --set feature.enableFSGroupPolicy=false option. However note that the CSI driver fsGroupPolicy value changes from File to ReadWriteOnceWithFSType in this case.
If you are using a CSI driver (such as the newly recommended opensource NFS CSI driver) and have also configured the SAS Viya Platform integration with external languages (with the python volume – which contains a lot of folders and file), then you may noticed some random failures of your SAS Compute Server sessions.
In this case, it is likely that you are affected by the fsGroupPolicy defined for the CSI driver.
The problem was not observed with the older open-source NFS provisioner tool because it was using the Kubernetes native in-tree NFS support (meaning it just creates PV's with .spec.nfs populated), whereas the CSI driver does the mount itself.
You could avoid the random Compute Server session failure by setting the CSI driver’s fsGroupPolicy to None or ReadWriteOnceWithFSType.
Note that this change was implemented in the DaC (Deployment as Code) GitHub project in the release that was published at the end of September 2025, the fsGroupPolicy is now set toReadWriteOnceWithFSType by default.
If you are not using the DaC project to install the NFS CSI driver but have installed it manually on your own (to comply with the latest recommendation from the SAS Documentation) you may also want to consider making this change in the CSI driver configuration.
Finally note that with the Viya November 2025 stable version (2025.11), more freedom is given to the SAS Administrators in order to get rid of any CSI driver constraint that would not allow to disable the fsGroup settings. A new configuration option, "fsgroup.enabled" allows to make the PodTemplate's fsGroup and fsGroupPolicy settings optional in the SAS Launcher configuration, so they could be completely disabled when the underlying storage system already enforces access control, or when the volumes are mounted with sufficiently open permissions (e.g., 0777 or per-user subpaths).
I hope you enjoyed this post and learned a few things about Kubernetes and SAS Viya (I know I did ! 😊)
Find more articles from SAS Global Enablement and Learning here.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and save with the early bird rate—just $795!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.