Kubernetes Storage Patterns for SASWORK – part 2

Welcome to Part 2 of this series about storage patterns in Kubernetes in support of SASWORK. If you haven't yet, please refer Part 1 to become familiar with many of the baseline concepts that we will assume here.

For this part, we will focus on setting up SASWORK so that it can support Checkpoint-Restart functionality.

--- See this series: [ Part 1 | Part 2 | Part 3 ]

About Checkpoint-Restart

Checkpoint-Restart is an optional feature when submitting jobs using SAS Workload Management offered as part of the SAS Viya platform running in Kubernetes. Our ability to use Checkpoint-Restart is limited to include some requirements, such as:

The batch SAS job must be identified as restartable when submitted
The queue to which the batch job was submitted is configured to restart jobs
The original execution of the batch job must be preempted (that is, interrupted) by SAS Workload Orchestrator (SWO) for reason (like priority)
- or -
The batch job's execution host goes offline unexpectedly.

To be clear, this means that batch jobs which fail for other reasons are not eligible for Checkpoint-Restart processing. They must be resubmitted to run in the normal fashion.

There are three different forms that Checkpoint-Restart can take by specifying how a job is restartable when it's submitted:

Datastep Restart: The status of DATA and PROC steps in the SAS program code are recorded so that restart will resume work at the interrupted step.
Label Restart: The status of SAS program code at user-defined labels are recorded so that restart will resume work at the interrupted label.
Job Restart: The restart will begin at the top of the program code.

This is just at a high level, of course. And there is functionality, like the CHECKPOINT EXECUTE_ALWAYS statement, that can be used in conjunction with your own program logic to refine processing to suit your needs.

The purpose of Checkpoint-Restart is for your SAS program code to resume where it left off when it was interrupted. This can save hours of rework for very long-running batch jobs. This also means that we need to preserve the state of files and processing at the point where the job was interrupted. In particular, we focus on SASWORK here. For one thing, it's the default location of the Checkpoint Library that knows where things were when the job was interrupted. But it might also contain data sets, catalogs, utility files or whatever was in-flight when the job was running in its original execution. We want to get back to those files. So, SASWORK needs to be backed by a persistent volume accessible to multiple nodes concurrently.

If your site doesn't require this functionality, then you don't need to architect the environment and configure the software to support Checkpoint-Restart.

The Process

Let's walk through the process at a high level to understand what's going on.

Using the sas-viya CLI utility, you submit a batch job to run your SAS program code and specify a queue that has its job restart attribute enabled and with a parameter like "--restart-label" to mark the job itself as restartable using the label checkpoints. SWO will track the job's status from PENDING to STARTING to RUNNING as it corresponds to the state of your sas-batch-server pod in Kubernetes.

After that, let's say a second job with a higher priority is submitted and the SWO determines that your batch job must be preempted to allow the high-priority job to run. The SWO will kill your job and mark its status back to PENDING. In Kubernetes, your sas-batch-server pod will be terminated and importantly, your job's SASWORK location is not deleted.

The high-priority job runs to completion and the job slot is available to run your batch job again. The SWO restarts your job and the SAS Batch Service configures it to use the same set of directories and files as it used before, so the SAS Batch Server can find the original SASWORK files and status in the Checkpoint Library. The SAS runtime spins up and takes a survey of where things left off, placing notes in the log indicating that it's restarting from a checkpoint:

NOTE: Begin CHECKPOINT (at label) execution mode.
NOTE: Begin CHECKPOINT-RESTART(at label: MyImportantTask-6) execution mode.

In this case, it's announcing that it will resume execution at the point labeled as "MyImportantTask-6" which you know is about midway through your SAS program. Nice!

And then your batch job runs the rest of the way to completion as expected.

SASWORK for Checkpoint-Restart

In the previous post we showed examples where SASWORK was defined to locations dynamically provisioned for each SAS Compute Server pod. That is, each pod had its own dedicated space for SASWORK and, ideally, that space could be automatically released when SAS no longer needed it.

But now, we need a volume for SASWORK that a) persists to hold on to the data that might be needed for restart and b) is shared for access by multiple instances of the SAS runtime at once.

In Kubernetes parlance then, we need an RWX volume (instead of an RWO volume). RWX volumes are managed by specific types of storage providers. For example, an NFS server is often the low-cost, easy approach to get RWX volumes. The cloud infrastructure also has managed offerings to do the job, like Amazon Elastic File Store, Azure NetApp Files, and Google Filestore, among others.

And then finally, one more thing - and it's important. SAS requires that these volumes are fully POSIX compliant for backing SASWORK. That's because SAS often needs to manage file owner or group permissions in those volumes for security reasons. Refer to the "Requirements for the CAS Server and Programming Runtime Environment" for details.

Provide SAS Batch Server with Static PVC to RWX Storage

We don't need all SASWORK for all instantiations of the SAS runtime to use RWX storage. Checkpoint-Restart functionality is only available for batch jobs, which means they're submitted to run using the SAS Batch Server (not SAS Compute or SAS Connect Servers). Therefore, we only need to modify the sas-batch-server pod template to use the Static PVC to RWX Storage for the "/viya" mount. SAS Compute and Connect Servers can continue to use their own individual SASWORK locations.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

That said, this change is so that all SAS Batch Servers will mount the one, same persistent volume for their SASWORK. You read that right - and it's difficult to illustrate clearly - but all instances of the sas-batch-server will refer to the same statically defined PVC to use the same PV that is backed by the same single volume in the shared storage provider.

A few considerations here:

What technology should I use for this RWX volume?

That depends on many factors including where your infrastructure is hosted, how much Checkpoint-Restart activity your site expects, budget tolerance for higher-end clustered file system offerings, and more.

SAS generally recommends avoiding cheap and simple NFS for SASWORK in production environments due to the heavy usage interactions that are likely. On the other end of the spectrum, managed cluster file system offerings can be expensive in the cloud and might not be desirable if Checkpoint-Restart is rarely used.

This illustration based on Amazon infrastructure shows using Amazon FSx as the shared storage backend. That's in part because Amazon Elastic File Storage (EFS) isn't compatible with Viya's POSIX requirements for SASWORK. But really, there are a variety of suitable storage providers, managed and unmanaged (i.e., managed by your IT team directly), that can work as long as they meet the minimum requirements of the software and workload.

How do we prevent collisions then between different users or jobs that might name things the same?

The good news is: we don't have to.

The SAS runtime is already setup to gracefully handle this for us. It uses a directory naming convention and structure to separate users and their jobs from each other, preventing any inadvertent file naming collisions. Sweet!

Here's an example of such a path as seen from inside the sas-programming-runtime container of the sas-batch-server pod:

/opt/sas/viya/config/var/run/batch/default/uid518005308/JOB_20250822_162405_845_1/WORK/SAS_workB1860000028F_sas-batch-server-9bf32816-9a42-4ce1-8ce1-05d47268aea6-60/

I want to call out special attention to the uid and JOB_id portions of the path. These work together in support of Checkpoint-Restart. The JOB_id in particular is used by the SAS Batch Server as a Fileset ID. When a preempted job is scheduled for restart, then the SAS Batch Service advertises the previous Fileset ID to the new job so that it can find the files it needs, including the Checkpoint Library.

What size should this one SASWORK volume that's shared by all SAS Batch Server instances be?

For those familiar with answering this question for SAS 9.4 deployments, you'll find things haven't changed all that much. The volume needs to be the size that's necessary for concurrent usage of jobs performing SAS processing with ancillary utility files (like interim caching of sort processing), and whatever users' program code puts in it (temporary data, catalogs, whatever).

It also will need some regular maintenance to delete orphaned files from failed jobs.

Sidebar:

Greg Wootton recently authored a new open-source tool called "sas-cleanwork-vk" as part of the SAS Technical Support team's project in the SAS Communities GitHub organization. It offers a Viya equivalent to the SAS 9.4 Cleanwork utility. Of course, if we choose to use ephemeral volumes for SASWORK, then we don't need it.

However, for the persistent RWX volume necessary for Checkpoint-Restart, then it's possible that SASWORK files might get left behind on occasion and they'll remain there consuming space until they're removed. Just be careful: we really do want those files to stay as long as they’re needed to support a subsequent “-restart” of the job.

The Kubernetes approach to such a problem is to manually remove the files when they're not needed. If you're able to schedule a maintenance window and take Viya offline briefly, then you could simply delete the PVC itself and Kubernetes will then delete the volume for SASWORK (assuming reclaimPolicy=Delete). Then redefine the static PVC with the same storage class as before and when you bring the Viya services back up, the PVC will request a new persistent volume for SASWORK through the associated CSI driver.

Another approach is to create your own temporary pod and attach the PVC used for SASWORK to it. Then exec into the pod itself, navigate into the directory structure to find the "SAS_work" directories, identify which are no longer actively used, and delete them individually. Basically, that's what "sas-cleanwork-vk" is doing. The challenge with this approach is that the pod needs root-level privileges to do its job - and some sites don't allow it. But if uninterrupted uptime is important, then this might be the way to go.

What if I don't want all Batch Server instances to use a shared SASWORK?

So far, the discussion has been about modifying the SAS Batch Server (specifically, the "/viya" mount point in its pod template) as it's defined for the initial deployment of the SAS Viya platform. However, it is also possible to set up additional Batch Contexts and those can specify different pod template names that you might define.

In that way, you can keep the default Batch Context using ephemeral storage for SASWORK. And, you can provide a special Batch Context where the Batch Server has a pod template where the "/viya" mount implements a shared SASWORK volume. Then for those jobs that need it, specify that special Batch Context when submitting the batch job so that it can assume the benefits of Checkpoint-Restart afforded by a shared SASWORK volume.

Consider a use-case where the really big, long-running jobs are only submitted once a month or once a quarter. Then this approach would allow for spinning up the expensive clustered file system only when it's really needed. The rest of the time, the normal batch jobs from users could rely on the ephemeral storage approach for SASWORK.

Some final notes

Checkpoint-Restart is very powerful and useful. It provides an elegant approach for managing the lifecycle of long-running batch jobs as part of the overall SAS workload in a way that Kubernetes cannot match.

And I think it's interesting how our reliance on clustered file systems has evolved since SAS 9.4 Grid Manager. For that software offering, we require a serious clustered file system right up front. And then by having a clustered file system on hand we use it for all kinds of things, like hosting one set of SAS executable binaries to run across multiple hosts, SASWORK of course, other grid provider software, local data mart files, and more.

Now with the SAS Viya platform that runs in Kubernetes, a lot of those uses for clustered file system have been moved into other areas. Storage is abstracted in such a way that we can address specific needs with exactly the right solution in each case. Take for example the fact that we can differentiate storage for SASWORK depending on the type of SAS runtime (SAS Compute, SAS Connect, SAS Batch). And we don't need to mount a shared file system on multiple hosts for shared software binaries - we use a container registry for that purpose now.

That said, while the number of use-cases has been reduced, there are some important occasions where performant and scalable cluster file system technology is still needed for SAS Viya, especially where plain old NFS just won't cut it. For Checkpoint-Restart, we need SASWORK for batch jobs backed by shared storage. And to handle the I/O stress that SAS can require of a file system, then we need to provision a storage backend with the necessary horsepower to keep up.

References and Resources

In addition to the links above, the GEL team also provides courses on learn.sas.com:

SAS Deployment Learning Subscription: Installing and Configuring the SAS Viya platform for Kubernetes and different cloud platforms.
SAS Architecture and Security Learning Subscription: Considerations for planning the technical architecture and security for the SAS Viya platform.

Find more articles from SAS Global Enablement and Learning here.