SAS Viya High Throughput Batch Processing: Part 3 – Performance Testing

1 Like

This is the third and final article in the series that investigates the new batch capabilities introduced with SAS Viya 2024.04 stable release. These new features address customers who engage in extensive batch processing, involving time sensitive, high volumes of SAS jobs of very short duration.

[ Part 1 | Part 2 ]

We close the topic by presenting some performance tests that compare the throughput and total run time of a burst of short-running jobs when run on default batch servers versus high-throughput batch servers.

The test environment

The results presented in this article do not come from a comprehensive benchmark. We are simply sharing the results we measured and some considerations noted while exploring these new capabilities in our small internal environment, based on SAS Viya 2024.06 stable release.

The Kubernetes cluster is hosted on-prem, with 4 worker nodes and Kubernetes 1.29. Each node has 8 CPU cores and 64GB of memory. It is not built for performance, rather the opposite. It is the just enough to test the

deployment and configuration of this new capability. Yet, by limiting the concurrency of the submitted jobs to 1 per core (with 1 core left for Kubernetes services) we verified that the results are not impacted by a suffering or overloaded environment. To obtain this objective, we bound the compute server to only run on one of the worker nodes, and we made sure that the node was almost completely dedicated to that. Then we configured SAS Workload Orchestrator to limit 7 concurrent jobs on the default queue, and 7 pre-started batch servers on the high-throughput queue.

The test jobs

We used the very basic short CPU-bound job introduced in the previous article. The batch job reads a small input csv file, outputs a table report and artificially consumes 100% of a cpu core for 5 seconds. The whole program runs in a bit less than 5.5 seconds:

filename csv "!BATCHJOBDIR/iris.csv";

/* Import the CSV file  */
proc import datafile=csv out=work.myiris dbms=csv;
  getnames=yes;
run;

/* Print the first 10 observations **/
proc print data=work.myiris(obs=10);
run;

/* load the CPU for a specified amount of time */
%let maxtime=5;    ** max # of seconds **;
data _null_;
  stt= time();
  i=1;
  do until (i = 0);
    i=i+1;
    j=i**2;
    b=j-9+i;
    now= time();
    elap=(now-stt);
    *put stt now elap;
    if elap > &maxtime then do;
      put "looped " i " times in " elap " seconds";
      i=0;
    end;
  end;
run;

We tested 2 different batch use cases:

the default case, which is the only one available in older SAS Viya versions. It uses on-demand batch servers and uploads/downloads files between the client and the server via the Files service.
the high-throughput case, which leverages the new capabilities introduced with SAS Viya 2024.04: pre-started batch servers and optimized I/O sharing.

For the former case the batch jobs were submitted to the default SAS batch context, while the latter leveraged the high-throughput context, configured as described in the first article of the series.

The submission commands were similar to the ones used in the previous article:

sas-viya batch jobs submit-pgm -c default --name "Default-job" \
  --pgm "importIris.sas" \
  --job-file "iris.csv"

sas-viya batch jobs submit-pgm -c high_throughput --name "HT-job" \
  --rem-pgm-path "/mnt/viya-share/data/importIris.sas" \
  --rem-output-dir "/mnt/viya-share/data/@FILESETID@"

For each case, we used 2 client machines to submit a total of 200 batch jobs, to be able to sustain a submission rate in terms of jobs/minute enough to saturate the backend for a few minutes.

The results: execution times and overall throughput

Each set of tests run for a few minutes, after which we collected and consolidated multiple job runtime statistics. We learned in the previous article that you can get job metrics with the following commands (using json format to avoid rounding of the reported times):

sas-viya --output json batch jobs list --details
sas-viya --output json batch filesets list

From these results, we focused on the following time stamps:

Time Stamp	Description
File Set Creation	This marks the beginning of each submission from the client.
Job Submitted	This is when the client sends the job to the midtier (batch service), after eventually uploading input files into the file set.
Job Started	This is when the backend receives a processing request from the midtier. For on-demand batch servers, it’s the request to start a new pod, while for pre-started batch server it is directly the request to start the job.
Job Ended	This is when the the job ends on the backend server. For on-demand batch servers, this is after shutting down the server pod.
Job Modified	This marks the end of the submission, when the midtier marks the job as completed.

Then, from these time stamps, we calculated the following metrics:

Metric	Formula	Description
Before Job Overhead	Job Submitted - File Set Creation	Includes time to upload the input files.
After Job Overhead	Job Modified - Job Ended	Difference between when job ends on the midtier versus when it ends on the backend. Includes time to upload the output files.
Midtier Overhead	After Job Overhead + Before Job Overhead	The time the midtier spends managing the job.
Job Pending Time	Job Started - Job Submitted	How long the job was kept pending because the backend was already full processing other jobs.
Job Backend Runtime	Job Ended - Job Started	The total job runtime on the backend. For on-demand batch servers, this includes starting and stopping a new pod and server process.
Total Runtime	Job Modified - File Set Creation – Job Pending Time	The total job runtime as seen from the client, excluding the pending time.

The results show significative gains when using high-throughput batch servers:

Metric	Default Batch Server			High-Throughput Batch Server
Metric	Min	Max	Average	Min	Max	Average
Before Job Overhead	0.58	7.36	0.99	0.14	0.75	0.19
After Job Overhead	0.42	8.61	3.76	0.01	1	0.52
Midtier Overhead	1.19	15.97	4.74	0.16	1.33	0.71
Job Pending Time	0	5 min 41 s	2 min 47 s	0	2 min 31 s	1 min 15 s
Job Backend Runtime	14	24	17.2	6.31	8.95	7.41
Total Runtime	16.78	33.30	21.35	7.19	9.28	8.12

Looking at the average measures, we can see how handling the file set adds about 4 seconds to the midtier overhead. Starting and stopping pods more than doubles the job backend runtime and the total runtime, while the overall increase in server utilization doubles the time jobs spend waiting (pending time). Looking at the min and max values we can also understand that the variability between best and worst cases is much higher for default batch servers, while in the high-throughput case times are more consistent.

Considering that the SAS code in each job runs in about 5.2 seconds, we can see that the high-throughput batch only adds a total overhead of less than 3 seconds, while the default server adds on average more than 16 seconds!

After analyzing these “per job” metrics, we also looked at overall times and system throughput.

Running 200 jobs on the default batch server took in total 8 minutes and 7 seconds, while the same test on the high-throughput batch ended in 3 minutes and 47 seconds: less than half. This corresponds to a measured throughput of ~25 jobs/minute in the default case, versus ~53 jobs/minute in the high-throughput case. I’d say the name fits! Again, for comparison, the theoretical max throughput without any overhead would be ~80 jobs/minute (7 cores running 5.2-seconds-jobs for 60 seconds).

Monitoring the system

As we have discussed in the initial article, the new batch capabilities not only lower execution times of short batch jobs, and increase their throughput, but also help lowering resource utilization, including the number of API calls between SAS Viya services and to the Kubernetes API. In turn, this lowers the risk of potential timeouts and failures, and the environment becomes more stable. The metrics above already show how the system is more loaded in the default case, which leads to less consistent execution times with greater variability. We also checked a few Grafana dashboards to verify if this is also visible in other metrics.

The following dashboard shows the spikes in the Files service in the default case (left) versus the optimized case (right): the latter uses less resources for a shorter time.

This is true across other services. For example, the Authorization service dashboard shows almost identical spikes at the same times. This can be understood as each call to the Files services has to be authorized, requiring a call to the Authorization service.

Looking at the Kubernetes API read/write metrics, we can see that the high-throughput case is barely indistinguishable from the background noise. The default case instead clearly sends a lot of API requests to Kubernetes while creating and destroying batch pods:

The pod churn activity is highlighted in the following screenshot taken during the execution of the default case, where you can see batch server pods in different states of starting, running, terminating:

Conclusion

In this series of three articles, we have described the new capabilities that support high-throughput batch server environments, including pre-started batch server and sharing storage between batch clients and servers. We have presented some metrics to compare default batch servers versus reusable batch servers with optimized I/O. The results clearly highlight the improved processing times, better jobs throughput and lowered resources utilization.

Find more articles from SAS Global Enablement and Learning here.