Supercharging SAS Model Manager with CAS Gateway (Part 3: Going Beyond)

2 Likes

In Part 2, we explored how SAS Model Manager simplifies the use of CAS Gateway. We learned that the standard integration uses a Python wrapper on a single CAS node (single=True). This “single-thread” execution is still exceptionally fast due to Zero-Copy memory access (Apache Arrow).

But the underlying CAS Gateway architecture is capable of much more. It was built for massive parallel processing (MPP).

In this final part, we will pop the hood and show you how to write custom code to unlock the full power of your CAS cluster. This is for the advanced user who needs to score hundreds of millions of rows in seconds.

In this three-part series, we will unpack this technology:

Part 1 explores the concept and architecture.
Part 2 explains how SAS Model Manager leverages this today (The “Comfort” Zone).
Part 3 (this post) explores how to unlock maximum performance using code (The “Power” Zone).

The architecture context

Here is a diagram of the architecture we are working with:

01_DE_supercharging-sas-model-manager-cas-gateway-page-3-mpp-cas-1536x564.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

We can see an MPP CAS server called BU1 (short for business unit 1) with 4 nodes and one table partitioned across those nodes.

The Key Parameters: single and nthreads

The gateway.run action exposes two critical parameters that control parallelism:

single (boolean):

If True (MM default): Code runs on one designated worker node.
If False: Code runs on every worker node in the cluster.

nthreads (integer):
- Defines the number of threads per node to use for execution.

If you have a 10-node CAS cluster and set single=False (and nthreads=1), you are effectively launching 10 parallel Python processes. Each process sees only the slice of data local to its node.

Execution with single=True (SAS Model Manager default)

02_DE_supercharging-sas-model-manager-cas-gateway-page-3-singeltrue-1536x990.png

Execution with single=False (Custom code)

03_DE_supercharging-sas-model-manager-cas-gateway-page-3-singelfalse-1024x785.png

Figure: Comparison of two CAS Gateway usage patterns. The top panel shows the standard SAS Model Manager scoring flow (single=True): Model Manager uses a Python wrapper that runs on a single CAS worker node while still leveraging CAS Gateway and Apache Arrow for zero-copy memory access, this reduces serialization overhead without distributing your scoring logic. The lower panel shows a custom code approach (for example, run from SAS Studio) that uses gateway.run with single=False: the gateway orchestration dispatches Python interpreters to every worker node so each node processes its local partition in parallel. The custom code approach can deliver at scale much higher throughput when scoring data. To achieve this, additional factoring of stateless scoring logic, careful nthreads tuning, and more operational oversight could be required.".

Example: Custom Scoring with gateway (Advanced)

To leverage this, you must step outside the SAS Model Manager UI and write a custom CAS action call, typically in a Python notebook using swat or directly in SAS Studio.

Let’s look at a concrete example based on the SAS Model Manager Quick Start tutorial’s “Evaluate and Deploy a Python Model” exercise. This example follows the tutorial’s Python scoring flow and shows how the Wrapper code can be adapted for distributed execution with single=False when you need extreme scale.

First, connect and load the action set:

import os
import swat
import numpy as np

# retrieve connection parameters from environment variables
cas_host = os.environ.get('SAS_CAS_SERVER_DEFAULT_CLIENT_SERVICE_HOST')
cas_port = os.environ.get('SAS_CAS_SERVER_DEFAULT_CLIENT_SERVICE_PORT')
auth_token = os.environ.get('SAS_CLIENT_TOKEN')

# establish connection to CAS server
cas_session = swat.CAS(cas_host, cas_port, password=auth_token)

# load the gateway action set for distributed Python/R execution
cas_session.loadActionset("gateway")

Get the Gateway_Wrapper.sas code from the tutorial in SAS Model Manager and adapt it for distributed execution. The key change is that you will set single=False when calling gateway.run, and ensure your code is stateless and can run on any node with only local data access.

04_DE_supercharging-sas-model-manager-cas-gateway-page-3b.png

Now, define the Python code string that will run on each node. Notice how we handle reading and writing. Key point: gateway.read_table only reads the local partition of data!

In practice: paste the escaped string (everything between the triple “double-quotes”) into a text editor, run a find/replace for and remove them, remove the wrapping triple-quotes or leading/trailing quotes, then save the result as the code block you submit to gateway.run.

distributed_score_code = """
import sys
import pandas as pd
import pyarrow as pa
import traceback
import logging
logging.captureWarnings(True)
sys.path.append("/models/resources/viya/022b89f5-b986-45a5-912f-55a04ca07f23/")
import settings_022b89f5_b986_45a5_912f_55a04ca07f23
settings_022b89f5_b986_45a5_912f_55a04ca07f23.pickle_path = "/models/resources/viya/022b89f5-b986-45a5-912f-55a04ca07f23/"
import hmeq_logistic_score
row_by_row = False
data = gateway.read_table({'caslib': gateway.args['input_library'], 'name': gateway.args['input_table']})
rename_dict = ""
if rename_dict:
  data = data.rename(columns=rename_dict)
scoring_data = data[["REASON","JOB","YOJ","DEROG","DELINQ","CLAGE","NINQ","CLNO","DEBTINC"]]
try:
  scored_data = hmeq_logistic_score.score_hmeq_log_reg_model(**scoring_data)
  output = pd.concat([scored_data.reset_index(drop=True), data.reset_index(drop=True)], axis=1)
except Exception as e:
  sys.stdout.write("An error occurred while scoring the model using batch data processing. Error: " + str(e) + "\n")
  sys.stdout.write(traceback.format_exc())
  sys.stdout.write("Retrying using row-by-row data processing for scoring...\n")
  row_by_row = True
if row_by_row:
  scored_data = pd.DataFrame(columns=["EM_CLASSIFICATION","EM_EVENTPROBABILITY"])
  for index, row in scoring_data.iterrows():
    EM_CLASSIFICATION, EM_EVENTPROBABILITY = hmeq_logistic_score.score_hmeq_log_reg_model(**row)
    scored_data.loc[index] = [EM_CLASSIFICATION, EM_EVENTPROBABILITY]
  output = pd.concat([scored_data.reset_index(drop=True), data.reset_index(drop=True)], axis=1)
  sys.stdout.write("Note: Batch data processing is recommended for faster scoring of models.\n")
arrow_output = pa.Table.from_pandas(output)
arrow_output = arrow_output.cast(pa.schema([pa.field(field.name, pa.string() if field.type == pa.null() else field.type) for field in arrow_output.schema]))
gateway.write_table(arrow_output, {'caslib': gateway.args['output_library'], 'name': gateway.args['output_table'], 'promote': True})
"""

# SUBMIT TO CAS (Distributed Execution)
cas_session.gateway.run(
code=distributed_score_code,
args={
'input_library': 'Public',
'input_table': 'SCORE_1773700035_4',
'output_library': 'CASUSER',
'output_table': 'SCORE_1773700035_4'
},
# CRITICAL PARAMETERS FOR PARALLELISM
single=False,  # Run on ALL nodes
nthreads=4     # Use 4 threads per node (if code supports multi-threading)
)

Pro Tip

Tuning nthreads

How many threads should you use? In our brief but repeatable GEL (Global Enablement & Learning) lab tests, we found a possible “sweet spot”:

Do not use all available vCPUs. The CAS main process needs CPU cycles to manage the table and orchestration.

Recommendation: Set nthreads to roughly half of the available vCPUs on your worker node. * Example: If your CAS worker has 8 vCPUs, try nthreads=4. Going higher often resulted in diminishing returns due to resource contention between Python and CAS.

Practical tuning guidance

Start with nthreads = max(1, floor(vCPUs / 2)) on a small test partition.
Measure node-level metrics during the run: CPU% (per-core), CAS CPU%, memory RSS, swap, load average, and I/O wait.
If Python processes are CPU-bound (high Python CPU%) and CAS CPU% is low, try increasing nthreads by 1 and re-test.
If CAS CPU% or memory pressure increases significantly, reduce nthreads to avoid impacting CAS responsiveness.
Iterate with 2–3 runs; select the nthreads that provides the best throughput without causing sustained high memory/swap or CAS instability.

Telemetry to monitor: CAS node CPU and memory, per-node system load, gateway action logs, OS-level top/sar metrics, and CAS performance counters (if available).

Lab benchmark: runLang elapsed time vs nthreads

supercharging-sas-model-manager-cas-gateway-page-3c.png

The chart above shows elapsed time for scoring ~357,900,000 rows (lab test) using the CAS Gateway. The red bar labeled “1 (Single)” is the SAS Model Manager default (single=True) running on a single worker node, it took about 1.4k s (23 minutes) in our test. The blue bars show distributed runs (single=False) where the gateway ran on every worker node; nthreads is the number of threads used per node.

single=True (single node): ~1,400 s (baseline)
single=False, nthreads=1: 294.4 s
single=False, nthreads=2: 160.1 s
single=False, nthreads=4: 97.8 s
single=False, nthreads=8: 65.1 s
single=False, nthreads=16: 50 s

Key takeaways:

Distributed execution (single=False) delivers a very large speedup vs the single-node default (up to ~28x in this run).
Increasing nthreads per node improves throughput, but shows diminishing returns, the biggest gains are seen moving from 1→4 threads; gains from 8→16 are smaller.
Use the tuning guidance above (start at roughly half the vCPUs, monitor CPU/memory/CAS metrics) to find the sweet spot for your environment.

Handling Empty Batches

When running distributed code, some nodes may receive zero rows (especially with small datasets). This can cause downstream errors if not handled properly. In the example code, we check for empty input early and return an empty DataFrame with the expected schema to avoid KeyErrors later in the scoring logic.

# Handle the case of no rows (empty batch) early to avoid downstream KeyError
if input_array.empty:
  print("NOTE: input_array is empty — returning empty batch result")
  return pd.DataFrame(
    {"EM_CLASSIFICATION": pd.Series(dtype="object"),
    "EM_EVENTPROBABILITY": pd.Series(dtype="float")}
  )

Limitations and Considerations

If running distributed code is so fast, why doesn’t SAS Model Manager do it by default?

State Management: Distributed code must be “stateless” across rows. Calculating a simple mean() of a column is tricky because each node only sees a subset of rows.
Resource Usage: Launching 50 Python processes on a cluster requires careful resource management.
Debugging: An error on Node #7 can be harder to troubleshoot than an error on a single node.

Conclusion

The CAS Gateway represents a significant leap forward for open-source integration in SAS Viya.

For the small and medium T-shirt size workloads: Stick with the SAS Model Manager defaults (single=True). You get huge speed gains from Zero-Copy/Arrow without complexity.
For the large and XL workloads: Write custom gateway actions with single=False. You can process billions of rows by throwing hardware at the problem.

By understanding these two modes, you can architect the right solution for your specific performance needs.

Other Resources

TL;DR, Prerequisites, and Security

TL;DR: For extreme-scale scoring, set single=False to run the gateway on every CAS worker and tune nthreads per node. Distributed execution can deliver massive throughput but requires stateless scoring logic, careful resource tuning, and stronger operational practices.

Prerequisites: - SAS Viya LTS 2024.09+ with a multi-node cluster and SAS Model Manager 2025.01+ (or access to gateway actionset via swat). - Python 3.10+ and libraries: pandas, pyarrow, swat (and any model-specific packages) available on each worker node image. - Monitoring access to CAS node CPU/memory metrics and logs for tuning and debugging.

Security / Trust note: Distributed execution runs code on every worker node — only deploy vetted model bundles and avoid loading arbitrary pickles from unknown sources. Maintain artifact provenance and apply standard security reviews before enabling single=False in production.

Find more articles from SAS Global Enablement and Learning here.