In Part 2, we explored how SAS Model Manager simplifies the use of CAS Gateway. We learned that the standard integration uses a Python wrapper on a single CAS node (single=True). This “single-thread” execution is still exceptionally fast due to Zero-Copy memory access (Apache Arrow).
But the underlying CAS Gateway architecture is capable of much more. It was built for massive parallel processing (MPP).
In this final part, we will pop the hood and show you how to write custom code to unlock the full power of your CAS cluster. This is for the advanced user who needs to score hundreds of millions of rows in seconds.
In this three-part series, we will unpack this technology:
Here is a diagram of the architecture we are working with:
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
We can see an MPP CAS server called BU1 (short for business unit 1) with 4 nodes and one table partitioned across those nodes.
The gateway.run action exposes two critical parameters that control parallelism:
If you have a 10-node CAS cluster and set single=False (and nthreads=1), you are effectively launching 10 parallel Python processes. Each process sees only the slice of data local to its node.
Figure: Comparison of two CAS Gateway usage patterns. The top panel shows the standard SAS Model Manager scoring flow (single=True): Model Manager uses a Python wrapper that runs on a single CAS worker node while still leveraging CAS Gateway and Apache Arrow for zero-copy memory access, this reduces serialization overhead without distributing your scoring logic. The lower panel shows a custom code approach (for example, run from SAS Studio) that uses gateway.run with single=False: the gateway orchestration dispatches Python interpreters to every worker node so each node processes its local partition in parallel. The custom code approach can deliver at scale much higher throughput when scoring data. To achieve this, additional factoring of stateless scoring logic, careful nthreads tuning, and more operational oversight could be required.".
To leverage this, you must step outside the SAS Model Manager UI and write a custom CAS action call, typically in a Python notebook using swat or directly in SAS Studio.
Let’s look at a concrete example based on the SAS Model Manager Quick Start tutorial’s “Evaluate and Deploy a Python Model” exercise. This example follows the tutorial’s Python scoring flow and shows how the Wrapper code can be adapted for distributed execution with single=False when you need extreme scale.
First, connect and load the action set:
import os
import swat
import numpy as np
# retrieve connection parameters from environment variables
cas_host = os.environ.get('SAS_CAS_SERVER_DEFAULT_CLIENT_SERVICE_HOST')
cas_port = os.environ.get('SAS_CAS_SERVER_DEFAULT_CLIENT_SERVICE_PORT')
auth_token = os.environ.get('SAS_CLIENT_TOKEN')
# establish connection to CAS server
cas_session = swat.CAS(cas_host, cas_port, password=auth_token)
# load the gateway action set for distributed Python/R execution
cas_session.loadActionset("gateway")
Get the Gateway_Wrapper.sas code from the tutorial in SAS Model Manager and adapt it for distributed execution. The key change is that you will set single=False when calling gateway.run, and ensure your code is stateless and can run on any node with only local data access.
Now, define the Python code string that will run on each node. Notice how we handle reading and writing. Key point: gateway.read_table only reads the local partition of data!
In practice: paste the escaped string (everything between the triple “double-quotes”) into a text editor, run a find/replace for and remove them, remove the wrapping triple-quotes or leading/trailing quotes, then save the result as the code block you submit to gateway.run.
distributed_score_code = """
import sys
import pandas as pd
import pyarrow as pa
import traceback
import logging
logging.captureWarnings(True)
sys.path.append("/models/resources/viya/022b89f5-b986-45a5-912f-55a04ca07f23/")
import settings_022b89f5_b986_45a5_912f_55a04ca07f23
settings_022b89f5_b986_45a5_912f_55a04ca07f23.pickle_path = "/models/resources/viya/022b89f5-b986-45a5-912f-55a04ca07f23/"
import hmeq_logistic_score
row_by_row = False
data = gateway.read_table({'caslib': gateway.args['input_library'], 'name': gateway.args['input_table']})
rename_dict = ""
if rename_dict:
data = data.rename(columns=rename_dict)
scoring_data = data[["REASON","JOB","YOJ","DEROG","DELINQ","CLAGE","NINQ","CLNO","DEBTINC"]]
try:
scored_data = hmeq_logistic_score.score_hmeq_log_reg_model(**scoring_data)
output = pd.concat([scored_data.reset_index(drop=True), data.reset_index(drop=True)], axis=1)
except Exception as e:
sys.stdout.write("An error occurred while scoring the model using batch data processing. Error: " + str(e) + "\n")
sys.stdout.write(traceback.format_exc())
sys.stdout.write("Retrying using row-by-row data processing for scoring...\n")
row_by_row = True
if row_by_row:
scored_data = pd.DataFrame(columns=["EM_CLASSIFICATION","EM_EVENTPROBABILITY"])
for index, row in scoring_data.iterrows():
EM_CLASSIFICATION, EM_EVENTPROBABILITY = hmeq_logistic_score.score_hmeq_log_reg_model(**row)
scored_data.loc[index] = [EM_CLASSIFICATION, EM_EVENTPROBABILITY]
output = pd.concat([scored_data.reset_index(drop=True), data.reset_index(drop=True)], axis=1)
sys.stdout.write("Note: Batch data processing is recommended for faster scoring of models.\n")
arrow_output = pa.Table.from_pandas(output)
arrow_output = arrow_output.cast(pa.schema([pa.field(field.name, pa.string() if field.type == pa.null() else field.type) for field in arrow_output.schema]))
gateway.write_table(arrow_output, {'caslib': gateway.args['output_library'], 'name': gateway.args['output_table'], 'promote': True})
"""
# SUBMIT TO CAS (Distributed Execution)
cas_session.gateway.run(
code=distributed_score_code,
args={
'input_library': 'Public',
'input_table': 'SCORE_1773700035_4',
'output_library': 'CASUSER',
'output_table': 'SCORE_1773700035_4'
},
# CRITICAL PARAMETERS FOR PARALLELISM
single=False, # Run on ALL nodes
nthreads=4 # Use 4 threads per node (if code supports multi-threading)
)
Tuning nthreads
How many threads should you use? In our brief but repeatable GEL (Global Enablement & Learning) lab tests, we found a possible “sweet spot”:
Do not use all available vCPUs. The CAS main process needs CPU cycles to manage the table and orchestration.
Recommendation: Set nthreads to roughly half of the available vCPUs on your worker node. * Example: If your CAS worker has 8 vCPUs, try nthreads=4. Going higher often resulted in diminishing returns due to resource contention between Python and CAS.
Practical tuning guidance
Telemetry to monitor: CAS node CPU and memory, per-node system load, gateway action logs, OS-level top/sar metrics, and CAS performance counters (if available).
The chart above shows elapsed time for scoring ~357,900,000 rows (lab test) using the CAS Gateway. The red bar labeled “1 (Single)” is the SAS Model Manager default (single=True) running on a single worker node, it took about 1.4k s (23 minutes) in our test. The blue bars show distributed runs (single=False) where the gateway ran on every worker node; nthreads is the number of threads used per node.
Key takeaways:
When running distributed code, some nodes may receive zero rows (especially with small datasets). This can cause downstream errors if not handled properly. In the example code, we check for empty input early and return an empty DataFrame with the expected schema to avoid KeyErrors later in the scoring logic.
# Handle the case of no rows (empty batch) early to avoid downstream KeyError
if input_array.empty:
print("NOTE: input_array is empty — returning empty batch result")
return pd.DataFrame(
{"EM_CLASSIFICATION": pd.Series(dtype="object"),
"EM_EVENTPROBABILITY": pd.Series(dtype="float")}
)
If running distributed code is so fast, why doesn’t SAS Model Manager do it by default?
The CAS Gateway represents a significant leap forward for open-source integration in SAS Viya.
By understanding these two modes, you can architect the right solution for your specific performance needs.
TL;DR: For extreme-scale scoring, set single=False to run the gateway on every CAS worker and tune nthreads per node. Distributed execution can deliver massive throughput but requires stateless scoring logic, careful resource tuning, and stronger operational practices.
Prerequisites: - SAS Viya LTS 2024.09+ with a multi-node cluster and SAS Model Manager 2025.01+ (or access to gateway actionset via swat). - Python 3.10+ and libraries: pandas, pyarrow, swat (and any model-specific packages) available on each worker node image. - Monitoring access to CAS node CPU/memory metrics and logs for tuning and debugging.
Security / Trust note: Distributed execution runs code on every worker node — only deploy vetted model bundles and avoid loading arbitrary pickles from unknown sources. Maintain artifact provenance and apply standard security reviews before enabling single=False in production.
Find more articles from SAS Global Enablement and Learning here.
Dive into keynotes, announcements and breakthroughs on demand.
Explore Now →The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.