This post is the second in a series where we investigate the new batch capabilities introduced with SAS Viya 2024.04 stable release. Continuing from where we left off, we’ll now look at the options that provide optimized input and output file storage.
[ Part 1 | Part 3 ]
In short, SAS introduced new batch capabilities that benefit customers who engage in extensive batch processing, involving time sensitive, high volumes of SAS jobs of very short duration.
Default Input and Output File Storage
By default, batch clients do not interact directly with batch servers. File transfers are mediated by the Files service, that uses the SAS Infrastructure Data Server (PostgreSQL) to store them. This additional step that can add significant latency to short, high-throughput batch submissions.
Let’s assume you have an input csv file, and you want to run a batch SAS program that reads it, performs some quick ETL and sends back some output. Here is what would happen by default.
1.1 The user prepares the input csv file and program on a local directory on the client.
1.2 The batch client sends the input csv file and the SAS program file to the files service as an input file set.
1.3 The files service writes the content of the file set in PostgreSQL.
2.1 When it’s ready to run the job, the batch server queries the files service to get the input file set.
2.2 The files service retrieves the file set from PostgreSQL.
2.3 The files service sends the file set to the batch server.
2.4 The batch server writes the content of the input file set locally in a temporary directory.
SAS Batch input file set
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Similarly, to get back the results:
3.1 At the end of the process, the batch server writes all output files locally in a temporary directory. This includes a copy of the input SAS program, the LOG and LIST output files, and any output that the client requested.
3.2 The server sends all output files to the files service as an output file set.
3.3 The files service writes the content of the file set in PostgreSQL.
4.1 When the user wants to retrieve the results, the batch client queries the files service to get the output file set.
4.2 The files service retrieves the file set from PostgreSQL.
4.3 The files service sends the file set to the batch client.
4.4 The batch client saves the content of the output file set to a local directory.
SAS Batch output file set
Only at the end of this process can the user open the resulting output or consult the execution logs.
Just going through this list takes time, doesn’t it?
Optimized Input and Output File Storage
Starting with version 1.9 of the batch CLI, available since April 2024, you can use new options to optimize input and output file sharing between the client and the server. In short, if there is a shared storage between the client and the server, you can use it to store both input and output files, and the files service (with its PostgreSQL backend) will not get used at all. The 14-steps process listed above is shortened to 2 simple steps:
The user prepares the input csv file and program on a shared directory that both client and server can use.
At the end of the process, the batch server saves all output files to the shared directory, immediately available to be used.
Easy, short, effective.
SAS Batch optimized I/O
A note about the shared storage architecture.
Although the focus of this article is not to discuss how to best architect a storage solution that can satisfy the use case at hand, I don’t want to leave you wandering in the dark. Here are some links to get you started on the topic:
File System and Shared Storage Recommendations: https://go.documentation.sas.com/doc/en/sasadmincdc/default/itopssr/n0ampbltwqgkjkn1j3qogztsbbu0.htm#p0u8ihdebannnxn1oe7fh89kavwj
SAS on Azure Architecture: https://learn.microsoft.com/en-us/azure/architecture/guide/sas/sas-overview#permanent-remote-storage-for-sas-data
Shared file system to use SAS Grid Manager or SAS Viya: https://docs.aws.amazon.com/whitepapers/latest/best-practices-for-deploying-sas-server/shared-file-system-to-use-sas-grid-manager.html
Azure Storage for SAS Architects: Do You Want to Share? https://communities.sas.com/t5/SAS-Communities-Library/Azure-Storage-for-SAS-Architects-Do-You-Want-to-Share/ta-p/687864
Keep in mind that most of the online literature on the topic correctly focuses on architecting a storage solution that can satisfy the high throughput demands of SAS Viya computing processes. Our use case presents 2 key differences:
As the picture above shows, the storage area should be shared between a submitting client and the Kubernetes nodes where a batch server could start, not only between the Kubernetes nodes.
This area is only used to host input and output files, not the whole computing process data I/O. For this reason, the I/O requirements may be much lower than those usually required i.e. for SASWORK folders.
The default NFS storage optionally created by the Infrastructure as Code (IaC) GitHub projects is a perfect starting point to use as a practical example.
Configuring SAS Viya with the Shared File System
At this point, let’s assume you have a shared file system that can be used for the batch clients and servers. In our test environment, we used the SAS Viya 4 Infrastructure as Code (IaC) for Open Source Kubernetes to create a Jump server and an NFS server while deploying the infrastructure for the Kubernetes cluster. The Jump server will be our client, so we deployed there the sas-viya CLI with the batch plugin. We used the SAS Viya 4 Deployment project to deploy SAS Viya and its pre-requisites, including configuring the shared storage for us. That includes mounting the NFS storage on the Jump host:
It also creates the patch transformer to mount the proper subfolder of that path into all SAS server pods:
In summary, the path /mnt/viya-share/data (as seen by server processes) and the path /viya-share/dac/data (as seen by the client) both map to the shared NFS folder /export/dac/data.
Now, it’s time to use it!
Let’s use the Shared Storage
To use the shared storage, we must properly reference input and output files so that both the client and the backend process know where to find them, avoiding the file sets managed by the Files service. We can follow the examples and references available in Using the batch Plug-In for the SAS Viya Platform Command-Line Interface
For the input files, it’s just a matter of refencing them correctly. If we look the Example: Add Input Data to the File Set, we see the filename statement is using the BATCHJOBDIR path:
filename csv "!BATCHJOBDIR/mydata.csv";
BATCHJOBDIR always points to the default temporary server directory where the batch server reads/writes files, including those read from /written to the file set. We can simply change the statement to use the backend path where the input file is stored. In our test environment that’s the /mnt/viya-share/data/ path discussed in the previous section, so that statement becomes:
filename csv "/mnt/viya-share/data/mydata.csv";
For the output files, we can use the new options introduced with version 1.9 of the batch CLI described in Output Specifications and File Locations, as shown in the Example: Specify a Location That Is Mounted in a Pod: --rem-output-dir, --rem-list-path, --rem-log-path.
Let’s not forget the SAS program with the code to execute. The same Example: Specify a Location That Is Mounted in a Pod shows how to use the --rem-pgm-path option instead of --pgm when the program is directly accessible by the batch server in the Kubernetes pod. This instructs the batch CLI to avoid uploading the program to the file set, and similarly tells the batch server to avoid downloading it.
To summarize with an example, compare the following batch client commands:
sas-viya batch jobs submit-pgm -c high_throughput --name "Import-fileset" \
--pgm "importIris.sas" \
--job-file "iris.csv"
sas-viya batch jobs submit-pgm -c high_throughput --name "Import-filesystem" \
--rem-pgm-path "/mnt/viya-share/data/importIris.sas" \
--rem-output-dir "/mnt/viya-share/data/@FILESETID@"
The first command leverages the file set facility: --pgm and --job-file point to files available locally on the client, and there are no “output options”.
The second command leverages shared storage: there are no input files to upload, --rem-pgm-path tells the batch server where to find the SAS program inside its pod, and --rem-output-dir tells the batch server where to save the output log and listing, again inside its pod and skipping the file set. The input program is not listed in this second command, since it is directly referenced in the revised code as explained above.
Note the @FILESETID@@ placeholder in the output directory specification. This will be replaced at runtime with the value of the file set ID (every batch job gets a file set ID even if it’s not storing any files in the file set), and the output directory will be automatically created by the batch server.
Also note that, while these examples use the high-throughput context, leveraging shared storage does not require pre-started batch servers. But we can use both capabilities to build a high-throughput batch facility!
What is the measurable impact?
SAS Viya documentation explains that, by default, sas batch clients and servers only store selected files in the job's file set: the SAS program file, LOG and LIST output files. You can optionally use the --job-file option to specify input files that are also stored in the file set to be copied to the batch server pod and used by the SAS program. This latter option should be used for small files. Large files should always be handled differently, i.e. data should be stored in a database or files stored in a volume mounted to the batch server pod.
In summary, a job’s file set should be small both in size and number of files. Does it really make any difference, in terms of total runtime, whether you use it or not?
In our test environment we measured the runtime difference between the sample commands listed in the previous section.
We extracted job statics with the following command (using json format to avoid rounding of the reported times):
sas-viya --output json batch jobs list --details
Here are the results:
{
"items": [
{
"contextId": "99dddb38-c821-463e-89c3-f81fd9f9d656",
"createdBy": "viya_admin",
"creationTimeStamp": "2024-10-08T21:00:17.680645Z",
"endedTimeStamp": "2024-10-08T21:00:25Z",
"id": "9b3dbeb7-e4b3-4fa8-b5c3-553b2d9e9503",
"modifiedBy": "user1",
"modifiedTimeStamp": "2024-10-08T21:00:25.043386Z",
"name": "Import-fileset",
"processId": "6d28ba2e-d6b1-4255-9025-f57943795d8d",
"returnCode": 0,
"startedTimeStamp": "2024-10-08T21:00:17.763363Z",
"state": "completed",
"submittedTimeStamp": "2024-10-08T21:00:17.680645Z"
},
{
"contextId": "99dddb38-c821-463e-89c3-f81fd9f9d656",
"createdBy": "viya_admin",
"creationTimeStamp": "2024-10-08T21:00:32.707924Z",
"endedTimeStamp": "2024-10-08T21:00:39Z",
"id": "c075e008-3b97-4266-95c9-f685266b9d90",
"modifiedBy": "user1",
"modifiedTimeStamp": "2024-10-08T21:00:39.675398Z",
"name": "Import-filesystem",
"processId": "88ebaa58-92e8-4518-ae6c-b8688ac5989e",
"returnCode": 0,
"startedTimeStamp": "2024-10-08T21:00:32.847071Z",
"state": "completed",
"submittedTimeStamp": "2024-10-08T21:00:32.707924Z"
}
]
}
There are a lot of timestamps! Let’s take the earliest and latest for both: modifiedTimestamp - creationTimesStamp. This gives about 7.4s (using the file set) versus 7s (using the storage). This result proved consistent with a few repetitions of the test. It seems the I/O optimization can save up to 5% of the runtime. But there is a catch: the file set is created, and files are uploaded, BEFORE the job gets submitted. When these job timestamps are recorded, that time has already passed, and it has not been captured. We could increase the logging levels to record more detailed transaction times, but there is a simpler way. We have seen that a file set ID is always created, even when it’s not used. We can check the file set creation times with the following command:
sas-viya --output json batch filesets list
Here is the log for our two sample jobs:
{
"items": [
{
"contextId": "99dddb38-c821-463e-89c3-f81fd9f9d656",
"createdBy": "viya_admin",
"creationTimeStamp": "2024-10-08T21:00:16.621873Z",
"id": "JOB_20241008_210016_621_1",
"modifiedBy": "viya_admin",
"modifiedTimeStamp": "2024-10-08T21:00:16.621873Z"
},
{
"contextId": "99dddb38-c821-463e-89c3-f81fd9f9d656",
"createdBy": "viya_admin",
"creationTimeStamp": "2024-10-08T21:00:32.554613Z",
"id": "JOB_20241008_210032_554_1",
"modifiedBy": "viya_admin",
"modifiedTimeStamp": "2024-10-08T21:00:32.554613Z"
}
]
}
The difference between file set creation time and job creation time is about 1 second when the file set is used, and only about 0.15 seconds when not. We can see that this optimization actually saves more than 10% of the runtime.
Repeating this test multiple times, we measured that uploading the files into the file set creation took between 0.5s and 1s, while the time without file uploading was consistent between 0.14s and 0.16s.
In summary, in our little experiment the optimized I/O resulted in a 10% runtime optimization with better runtime consistency.
Is it worth? With a single job like our simple test, maybe not. In a high-throughput environment with hundreds of short-lived batch jobs, the savings can compound and become noticeable.
Coming up next
We have seen how to configure shared storage, usable from batch clients and servers, and what options are required to leverage it to avoid uploading/downloading files to a file set managed by the Files service.
In the next post we’ll present a longer performance study that compares default batch servers versus reusable batch servers with optimized I/O.
Find more articles from SAS Global Enablement and Learning here.
... View more