SAS Viya High Throughput Batch Processing: Part 1 – Reusable Batch Servers

4 Likes

SAS Innovate is an amazing venue where customers can deep dive into SAS’ latest breakthroughs. This April I had the privilege to attend and present at SAS Innovate 2024 in Las Vegas. Amongst many energizing sessions, I particularly enjoyed the Super Demo “Need for Speed: Fast-Tracking Batch Workloads with SAS Viya Workload Management”, presented by Wes Gelpi, Director R&D, SAS Viya Compute, and Prasad Poozhikalayil, Sr Product Manager, SAS Viya Platform. The topic caught my attention, and I decided to look deeper into some details and test the new capabilities.

In short, starting with SAS Viya 2024.04 stable release, SAS introduced new batch capabilities that benefit our customers who engage in extensive batch processing, involving time sensitive, high volumes of SAS jobs of very short duration. The new features address some inefficiencies noted by customers in their previous experience, such as longer turn-around time, excessive resource utilization, potential timeouts and job failures due to sudden surge in number of pod creation.

This article is the first in a series where I share the result of my reading and researching. It describes reusable batch servers.

The second article will focus on optimized input and output file storage.

Finally, the third and last will present some performance tests that compare the throughput and total run time of a bust of short-running jobs when run on default batch servers versus high-throughput batch servers.

SAS Viya Batch Servers

The SAS batch processing infrastructure in the SAS Viya platform is designed to provide an ad-hoc set of servers and services that, by default, are tailored to processing long-running, scheduled batch jobs. It also supports interactive execution to simplify development and testing, but at its core the focus are long-running, fire-and-forget batch jobs.

In this default configuration, each job submission is processed by a batch server (a dedicated SAS process) running in a Kubernetes pod that is started on-demand, executes that one job and then shuts down.

What if your use case centers on managing a significant volume of small, time-sensitive data jobs? With the default batch server, you might face a critical performance bottleneck in the latency from when a pod starts to when it actually begins running each job. Starting a new pod can add an overhead of up to 10/15 seconds to the total execution time. Even worse, if your Kubernetes cluster is configured for dynamic auto-scaling and there are no compute nodes available, starting and initializing a new node might add an additional overhead of a few minutes. All these overheads do not impact too much a set of scheduled jobs when each job runs for a few hours – or even a few minutes. Executing a job in 20 minutes or 20 minutes and 15 seconds is a small difference of about 1%. But when your use case is submitting quick burst of hundreds of 15-seconds batch jobs, adding 15 seconds to start up a pod each time means doubling the total execution time, or, if want to look at it from a different angle, the total throughput in terms of jobs per minute halves. That’s a significant performance hit! The high number of pod creation and destruction also stresses the Kubernetes API services, with the risk of excessive resource utilization leading to potential timeouts and failures.

Animation showing default batch servers processing 3 short jobs.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Reusable Batch Servers

Starting with SAS Viya 2024.04 stable release, the SAS Viya platform provides a new feature: “Reusable Batch Servers”. A reusable batch server is a batch server pod that can process multiple jobs sequentially, without stopping/restarting the pod. These servers decouple jobs from pod creation and initialization.

SAS developed this feature to benefit most of our customers who engage in extensive batch processing, involving time sensitive, high volumes of SAS jobs of very short duration. It addresses the inefficiencies noted above, such as longer total execution time, excessive resource utilization due to sudden surge in number of pod creation, potential timeouts and job failures.

Animation showing reusable batch servers processing 3 short jobs.

How to Configure and Enable Reusable Batch Servers

To use reusable batch servers, you must enable the capability in the batch context, configure a few settings – such as the minimum number of servers that run all the time – and specify a shared account under which they should run.

Let’s describe these steps in more detail, including considerations about creating an efficient workload management configuration.

Configure the shared account credential

To configure the batch server to run under a shared account, you must use the SAS Viya CLI to create a new credential for this shared account. With the CLI, login with an administrative account and use the batch plugin to store the shared account userid and password with the following command:

sas-viya batch credentials create

The CLI will prompt for a username and password, then save them in the credentials microservice.

If you do not want to store a password, you also have the option to use an OAuth token. This is useful, for example, to support SCIM authentication providers. See the documentation for detailed steps.

(Optional) Created a dedicated workload queue.

While not strictly required, creating a dedicated workload queue to be used for high-throughput batch servers can simplify further tuning of the SAS Viya platform. For example, you can segregate high throughput batch jobs from other kinds of compute workloads. An administrator can use the Workload Orchestrator page of SAS Environment Manager to create and configure a dedicated queue.

Create and configure a new batch context

You may want to create a new batch context for high-throughput batch, and leave the default context configured as is, to support long running jobs. An administrator can use the SAS Viya CLI:

sas-viya batch contexts create \
    --name "high_throughput" \
    --launcher-context-name "SAS Batch service launcher context" \
    --queue "batch_high_throughput" \
    --run-as "user1" \
    --reusable \
    --min-servers "7" \
    --max-servers "7"

The same result can be accomplished with the Contexts page of SAS Environment Manager.

As soon as the new context is created, the batch service starts the preset number of minimum servers.

An administrator can use to Workload Orchestrator page of SAS Environment Manager to verify the newly started servers. Notice that they are running under the shared account set in the context configuration.

About Server Sizing

You may have noticed in the screenshots above that both the new workload queue and the new batch context use the same value for the number of servers, in the examples above 7. How did we come to that number?

In this case, we only had one compute node with 8 cores, and we decided to fully dedicate it to this test. A common practice with batch jobs is to dedicate 1 core to each job. We wanted to leave 1 core to the OS and Kubernetes services, which means that our batch service can use 7 cores, leading to the above setting of 7 reusable servers. The workload queue parameter was set accordingly.

Also, for this environment we use a sustained flow of batch jobs, so we set the minimum and maximum number of servers to the same value. This optimizes server allocation, because the maximum number of servers is immediately available at startup, but it comes with the associated cost of full infrastructure utilization. If you know that your batch jobs come in burst of variable throughput, you can set a lower minimum - even down to zero. As usage falls off and the system is increasingly idle, the unused batch servers will terminate automatically, leaving the minimum number up and running and saving on infrastructure cost. The default server idle timeout is 300 seconds, and a custom value can be set when defining the new batch context.

Each customer environment is unique, and these values should be set after proper system sizing and architecture design. But do not feel compelled to find the perfect tuning at first try. SAS Workload management gives you the tools to put into practice the mantra “Plan. Monitor. Refine”.

For further reading, a very detailed discussion about the factors that influence the tuning of these setting is presented by Rob Collum in his article Determine how many SAS Viya analytics pods can run on a Kubernetes node – part 3.

Coming up next

When discussing this new feature with some colleagues, I noticed we were using many different names. We call it high-performance batch, high-throughput batch, reusable batch… In the end, call it as you want: what matters is that your demands for high-volume batch processing can be satisfied in a shorter time. Who does not like that?

In the next article of this series, we’ll explore another change that improves batch performance by eliminating the need to download and upload files exchanged between batch clients and servers. Stay tuned!

Edoardo Riva

Find more articles from SAS Global Enablement and Learning here.

CarleighJoC · ‎11-13-2024

Great blog!