SAS Agentic AI – Deploy and Score Models – The Big Picture

3 Likes

Welcome back to the SAS Agentic AI Accelerator series! Today we’ll explore how to deploy and score code-wrapped Large Language Models (LLMs) in Azure, then call them from Agentic AI workflows inside SAS Viya.

To keep things clear, the topic is split into two parts:

The Big Picture – a high-level overview with a short video and comparison tables that help you choose a deployment method. Azure is our example cloud.
The Nitty-Gritty – a hands-on guide with deployment and scoring scripts.

Where We Are In The Series

In Part 1, Register and Publish Models, we introduced code-wrapped LLMs and showed how to publish them with the SAS Container Runtime (SCR). The end result was Docker images in a container registry.

Part 2 — this post — covers the deployment options.

Deployment and Scoring Overview

After registering and publishing an LLM code wrapper, you can deploy it as a Docker image in various environments. Once deployed, you can score using the SAS Container Runtime API.

(view in My Videos)

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Deployment Options

Here’s a quick overview of the deployment options in the Azure cloud. That’s not an exhaustive list. This list reflects only what I tested:

Deployment Option	Use Case	Scalability	Ease of Setup	Security
Azure Container Instances	Lightweight, quick starts	Low	Simple	Public or private IP (HTTP only)
Azure Container Apps	Event-driven, auto-scaling	Medium	Managed	Public IP (HTTPS)
Azure Web Apps	Managed containers hosting	Medium	Managed	Public IP (HTTPS)
Kubernetes Pods	Large-scale, fully orchestrated	High	Complex (requires YAML, must manage node resources)	Flexible: Private / public IP (HTTP or HTTPS)
Containers on Virtual Machines	Legacy or custom configurations	Medium	Moderate	Flexible (private/public)

Key Considerations for Each Deployment Option

As you can see, there are so many options available in the Azure cloud. The following table should help you choose the one for you needs.

Deployment Option	Feasibility	Advantages	Limitations
Azure Container Instances	Ideal for small open source LLMs like phi-3-mini-4k (4 vCPUs, 16 GB RAM).	Easiest to launch.	Limited to 4 CPUs, 16 GB RAM.
Azure Container Apps	Middle ground between Container Instances and Kubernetes clusters.	Built-in HTTPS ingress and auto-scale.	Capped at 2 CPUs and 8 GB RAM; may cause out-of-memory errors.
Azure Web Apps	Simple to deploy and scale.	Works for lighter workloads. Supports deployment slots.	Resource limits can bottleneck performance. Adding resources doesn’t always improve performance.
Kubernetes Pods	Production-grade control over resources, scaling, and isolation.	Fine-grained control over resources, scaling, and isolation.	Requires Kubernetes skills (that’s what customers are always telling us).
Containers on Virtual Machines	Highly flexible for legacy systems or custom configurations.	Complete control over CPU, RAM, and disk.	Higher cost and operational effort.

Pricing Comparison

Any deployment choice, in a cloud, has a cost.

To help you evaluate the cost of each deployment option, the table below summarizes typical daily costs based on Azure pricing estimates. These values may vary depending on region, container size, and configuration. The estimates assume low, infrequent traffic, a few requests per day from your Agentic AI workflow to your deployed LLM.

Deployment Option	Estimated Daily Cost	Details
Azure Container Instances	~$3–$10/day	Cost depends on CPU and memory allocation (e.g., 2 CPUs and 8 GB of memory).
Azure Container Apps	~$5–$12/day	Includes management costs, ingress, and scalability features.
Azure Web Apps	~$8–$15/day	Managed service costs include app hosting and container runtime fees.
Kubernetes Pods	~$10–$20/day	Costs vary based on cluster size, node configuration, and resource requirements.
Containers on Virtual Machines	~$15–$25/day	Includes VM hosting fees, container runtime costs, and storage costs for legacy systems.

Observed Price Comparison

I compiled my experiments for one week, by using one of our own Azure tenants. Here’s the actual cost, per day, for a phi-3-mini-4k LLM deployment in Azure Container Instances, Container Apps, Web Apps, and Kubernetes Service. Next to it I highlighted the response time, in seconds.

Deployment Type	Cost Components	Estimated Cost Per Day (USD)	Response Time (sec)
Container Instances	Compute Costs	$5.60	48.31
Container Apps	Base Pricing	$3.22	45
App Service Plans (Web Apps)	Premium v3 P1mv3	$4.32	90
	Premium v3 P3mv3	$18.48	25
	Premium v3 P4mv3	$35.52	15
	Premium v3 P5mv3	$71.24	10
Kubernetes Deployment - Extra Node	Compute Costs (Standard_D4as_v5, 4 vCPUs, 16GB)	$14.40 (approx.)	42
	Disk Costs	$2.16
	Total (Compute + Disk)	$16.56 (approx.)	42
	Compute Costs (Standard_D16as_v5, 16 vCPUs, 64GB)	$57.60 (approx.)	43
	Disk Costs	$2.16
	Total (Compute + Disk)	$59.76 (approx.)	43

Findings:

Container Instances: Listed first due to its simplicity and lower cost for isolated deployments:
1. To deploy the small open-source LLMs, qwen-25-05-b and phi-3-mini-4k we found that 4 CPUs and 16 GB of memory are more appropriate. And these are small models.
2. To deploy a larger open-source LLM, such as a LLaMA 2-7B (Large Language Model Meta AI), you may require 16–32 GB and 4–8 vCPUs and lots of disk space, 20 – 50 GB. You could deploy it on 4 vCPUs and 16 GB RAM but that means your model may be quite constrained and the response time will be quite high, if you get a response at all.Container Instances: Listed first due to its simplicity and lower cost for isolated deployments:
Container Apps: Second, as it offers scalable, event-driven microservices at a competitive cost. The scalability feature is interesting, allowing the app to spin more pods for concurrent scoring requests. Built-in ingress (HTTPS endpoints) and scalability make this an excellent choice for lightweight open source model deployments.
Web Apps: Listed next, reflecting managed hosting options with varying performance tiers. Choose configurations that allow at least 4 vCPUs and 16 GB RAM. We tried scaling up the App Service Plan gradually. As you can observe, throwing more resources at a model, doesn’t proportionally reduce the response time. There’s a fine balance between cost and performance. You can only find it by experimenting.
Kubernetes Deployment: Last, as it is best suited for complex workflows requiring high scalability and orchestration. For the phi-3-mini-4k LLM we added a dedicated node and deployed the container in a pod. Scaling the node size didn’t seem to influence the response time. Perhaps other parameters such as disk type an IOPS should be fine-tuned. More work is needed.

Discussion

Some customers avoid proprietary LLMs from OpenAI, Google, or Azure because their data would leave their premises (or their cloud). They ask a way to use open-source, on-prem LLMs.

After a month of testing, I’ve learned the contest isn’t equal:

Proprietary cloud LLMs almost always win on cost, latency, and accuracy.
Self-hosting shifts all compute costs to you, so each request costs more and takes longer.
High latency limits daily throughput, pushing the per-request price even higher.
That premium is the trade-off for keeping data inside your own walls.

Summary

The SAS Agentic AI Accelerator lets you deploy code-wrapped LLMs almost anywhere: Azure services, Kubernetes, or standalone VMs. Use the tables above to balance cost, performance, and operational effort.

Stay tuned for Part 3, where we’ll dig into deployment scripts, scoring calls, and security tips.

Acknowledgements

Thanks to Mike Goddard (@MichaelGoddard) for guidance on SAS Container Runtime Kubernetes deployments.

Additional Resources

Workshop Environment

Agentic AI – How to with SAS Viya workshop now available on learn.sas.com to SAS Customers in the SAS Decisioning Learning Subscription and SAS Employees. This workshop environment provides step-by-step guidance and a bookable environment for creating agentic AI workflows.

If you liked the post, give it a thumbs up! Please comment and tell us what you think about the AI Decisioning Assistant. For further guidance, reach out for assistance. Let us know how this solution works for you!

Find more articles from SAS Global Enablement and Learning here.