Welcome back to the SAS Agentic AI Accelerator series! We’ve already cooked up LLM deployments with Docker and Azure’s managed services. Now, it’s time to turn up the heat with Kubernetes—the espresso machine of the cloud world. Sure, it has a few extra knobs and steam valves, but it gives you barista-level control.
If you crave fine-tuned control, serious scalability, and rock-solid HTTPS security, Kubernetes is your playground. Let’s roll up our sleeves and get an LLM running—with plenty of focus on keeping it secure and scalable. For simpler setups, Azure’s managed options work great, but for ultimate power and flexibility, Kubernetes is where magic happens!
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
In our example, we’ll securely deploy an LLM (the open-source Qwen-25-05b LLM by Alibaba Cloud) behind an HTTPS endpoint on Kubernetes. Why HTTPS? Because you and your security officer will both sleep better at night.
You need a TLS certificate for HTTPS endpoints. Think of it as a VIP badge for secure web traffic. Here’s the concise version:
# Set up secrets directory
secrets_dir=~/project/deploy/models/secrets
mkdir -p "$secrets_dir" && cd "$secrets_dir"
# Variables
RG=resource_group
INGRESS_SAN="${RG}.gelenable.sas.com" # SAS Viya URL or LLM deployment DNS
GELEnvRootCA=my_folder # location of certificates and private key required for signing
# Generate private key and CSR
openssl req -newkey rsa:2048 -sha256 -nodes -keyout scr_key.pem -extensions v3_ca \
-config <(echo "[req]"; echo "distinguished_name=req"; echo "[v3_ca]"; \
echo "extendedKeyUsage=serverAuth"; \
echo "subjectAltName=DNS:${INGRESS_SAN}, DNS:*.${INGRESS_SAN}") \
-subj "/C=US/ST=NC/L=North Carolina/O=SAS/CN=${INGRESS_SAN}" \
-out scr_models.csr
# Sign CSR with Intermediate CA
# These options tell OpenSSL to use the Intermediate CA's certificate and private key to sign the new certificate, rather than creating a self-signed certificate.
echo "01" > scr_models.srl
openssl x509 -req -sha256 -extensions v3_ca \
-extfile <(echo "[v3_ca]"; echo "extendedKeyUsage=serverAuth"; \
echo "subjectAltName=DNS:${INGRESS_SAN}, DNS:*.${INGRESS_SAN}") \
-days 820 -in scr_models.csr \
-CA $GELEnvRootCA/intermediate.cert.pem \
-CAkey $GELEnvRootCA/intermediate.key.pem \
-CAserial scr_models.srl -out scr_cert.pem
# Append full certificate chain
cat $GELEnvRootCA/intermediate.cert.pem >> scr_cert.pem
cat $GELEnvRootCA/ca_cert.pem >> scr_cert.pem
# Remove temporary files
rm scr_models.*
# Optional: Review the certificate
openssl x509 -text -noout -in scr_cert.pem
# Trust the CA certificate system-wide (for cURL etc.)
sudo cp $GELEnvRootCA/ca_cert.pem /etc/pki/ca-trust/source/anchors/
sudo update-ca-trust
The above block assumes you have access to intermediate CA's certificate and private key to sign the new certificate, rather than creating a self-signed certificate. For production, always use certificates signed by a trusted public Certificate Authority (CA), such as Let's Encrypt, DigiCert, or your organization's enterprise CA. This ensures secure, trusted, and verifiable connections for all clients.
That’s it, no need to get lost in a cryptographic jungle. I am simply reproducing a very reliable "TLS jungle trekking guide" produced by our SAS colleague, @MichaelGoddard. @StuartRogers is an authoritative source on TLS for SAS Viya and has plenty of trustworthy articles on SAS Communities
Clear any coffee spills and set up a clean playground for your models:
kubectl delete ns llm
kubectl create ns llm
Large Language Models (LLMs) can be quite resource hungry. Open-source LLMs need lots of storage for model files, plus plenty of CPU and memory for processing. To keep everything running smoothly (and avoid stepping on other workloads’ toes), it’s best to give your LLMs their own dedicated node pool. Remember: choose the size of your node pool carefully, based on the specific LLMs you want to deploy and their technical requirements.
az aks nodepool add \
--resource-group $RG \
--cluster-name $AKS_NAME \
--name llmnp \
--node-count 1 \
--node-vm-size Standard_D16as_v5 \
--max-count 1 \
--min-count 0 \
--enable-cluster-autoscaler \
--node-taints workload=llm:NoSchedule \
--labels workload=llm node.kubernetes.io/name=llm workload/class=models
Check that your node is ready and properly labeled:
kubectl get nodes --show-labels
You should see labels like workload=llm and node.kubernetes.io/name=llm
Think of these node labels as 'Reserved for LLMs' parking spots.
Load your certificate and key into Kubernetes as a secret:
kubectl -n llm create secret tls scr-certificate \
--key="scr_key.pem" \
--cert="scr_cert.pem"
# Check it’s there
kubectl -n llm get secrets
kubectl -n $NS describe secret scr-certificate
Think of a pod as the smallest shipping box in Kubernetes. Inside that box is your running application, in our case, the containerized LLM model. The pod wraps it up with the resources, environment variables, and storage it needs. If the pod isn’t running, your LLM isn’t either.
A service is like the shipping label on the box. It makes sure traffic can find and reach your pod, even if the pod moves around, inside the cluster. In our YAML manifest, the service listens on port 443 (HTTPS) and forwards traffic to your LLM’s container, running inside the pod.
Ingress is the front desk or receptionist of your Kubernetes office building. It’s the entry point for outside traffic. Ingress decides which service gets what request, handles HTTPS/TLS, and acts as a secure gateway from the internet to your application.
# Variables
RG=Resource_group
INGRESS_HOST=SAS_Viya_Ingress
echo $INGRESS_HOST
az login
ACR_NAME=Your_Azure_Container_Registry
# LLM image must be stored here as a container image
az acr login --name $ACR_NAME
# LLM name
LLM=qwen_25_05b
LLMDASH=${LLM//_/-}
echo $LLM & echo $LLMDASH
# Create the deployment YAML file
tee ~/project/deploy/models/${LLMDASH}-tls-deployment.yaml > /dev/null <<EOF
# ${LLMDASH} model deployment
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: ${LLMDASH}
workload/class: models
name: ${LLMDASH}
spec:
# modify replicas to support the requirements
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: ${LLMDASH}
template:
metadata:
labels:
app: ${LLMDASH}
app.kubernetes.io/name: ${LLMDASH}
workload/class: models
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/mode
operator: NotIn
values:
- system
- key: node.kubernetes.io/name
operator: In
values:
- llm
containers:
- name: ${LLMDASH}
image: ${ACR_NAME}.azurecr.io/${LLM}:latest
imagePullPolicy: Always # IfNotPresent or Always
resources:
requests: # Minimum amount of resources requested
cpu: 1
memory: 8Gi
limits: # Maximum amount of resources requested
cpu: 4
memory: 16Gi
ports:
- containerPort: 8080
name: http # Name the port "http"
- containerPort: 8443
name: https # Name the port "https"
env:
- name: SAS_SCR_SSL_ENABLED
value: "true"
- name: SAS_SCR_SSL_CERTIFICATE
value: /secrets/tls.crt
- name: SAS_SCR_SSL_KEY
value: /secrets/tls.key
- name: SAS_SCR_LOG_LEVEL_SCR_IO
value: TRACE
volumeMounts:
- name: tls
mountPath: /secrets
volumes:
- name: tls
secret:
secretName: scr-certificate
items: # Explicitly define the keys to mount
- key: tls.crt
path: tls.crt
- key: tls.key
path: tls.key
tolerations:
- key: workload/class
operator: Equal
value: models
effect: NoSchedule
- key: workload
operator: Equal
value: llm
effect: NoSchedule
---
# TLS service definition
apiVersion: v1
kind: Service
metadata:
name: ${LLMDASH}-tls-svc
labels:
app.kubernetes.io/name: ${LLMDASH}-tls-svc
spec:
selector:
app.kubernetes.io/name: ${LLMDASH}
workload/class: models
ports:
- name: ${LLMDASH}-https
port: 443
protocol: TCP
targetPort: 8080
type: ClusterIP
---
# TLS ingress definition
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ${LLMDASH}-ingress
annotations:
nginx.ingress.kubernetes.io/backend-protocol: HTTP
labels:
app.kubernetes.io/name: ${LLMDASH}-ingress
spec:
ingressClassName: nginx
tls:
- hosts:
- ${INGRESS_HOST}
secretName: scr-certificate
rules:
- host: ${INGRESS_HOST}
http:
paths:
- path: /${LLM}
pathType: Prefix
backend:
service:
name: ${LLMDASH}-tls-svc
port:
number: 443
EOF
You’ll see three main sections in the YAML file. Here’s what each one does:
Deployment
Service
Ingress
https://your-dns/qwen_25_05b
Deploy your model:
# Deploy
kubectl apply -f qwen-25-05b-tls-deployment.yaml -n llm
# Wait for the pod to be ready (watch for the “Ready” status):
kubectl get pods -n llm
# Check logs
kubectl logs -n llm
With everything live, you can send HTTPS requests to your Kubernetes ingress endpoint and watch your LLM do its magic.
curl --location --request POST "https://${INGRESS_HOST}/qwen_25_05b" \
--header 'Content-Type: application/json' \
--header 'Accept: application/vnd.sas.microanalytic.module.step.output+json' \
--data-raw '{
"inputs": [
{"name":"userPrompt","value":"customer_name: Xin Little; loan_amount: 20000.0; customer_language: EN"},
{"name":"systemPrompt","value":"You are tasked with drafting an email to respond to a customer whose mortgage loan application has been accepted by the SAS AI Bank. You will be provided with customer_name, loan_amount, customer_language. Follow the guidelines for a professional, friendly response."},
{"name":"options","value":"{temperature:1,top_p:1,max_tokens:800}"}
]
}' | jq
If you get a smart response, such as the following sample, congratulations! You’ve just deployed a secure, scalable LLM using Kubernetes.
Large Language Models (LLMs) are heavy weightlifters. They need generous CPU, memory, and storage, especially when running open-source versions. For best results, give LLMs their own dedicated node pool (or multiple node pools). This ensures your models won’t compete for resources with other workloads, keeping everything running smoothly.
When it comes to scaling, Kubernetes shines. You can adjust the number and size of nodes in your pool to match your workload. Just remember: the bigger the LLM, the beefier your node needs to be. Choose your node pool size based on the technical requirements of your models, don’t try to squeeze a heavyweight model into a tiny node!
For ultra-responsive performance, monitor CPU and memory usage and scale up as needed. And if you’re aiming for production-grade speed, keep an eye on response times as you adjust resources.
Security isn’t just an add-on—it’s essential. Always use HTTPS to protect data in transit. This means securing both your public endpoints and the internal traffic between your ingress, service, and pod. For extra peace of mind, forward traffic from the ingress to your pod over port 8443 (HTTPS), not just 8080 (HTTP).
Make sure:
Deploying LLMs in Kubernetes gives you flexibility, scalability, and strong security, if you set things up right. With these best practices in place, your LLMs will run smoothly, securely, and ready for whatever comes next.
And remember: in the world of Kubernetes, a little resource planning goes a long way. Happy deploying!
Thanks for following along! If you find this post helpful, give it a thumbs up, share your stories or questions in the comments, and let’s keep building better AI workflows together. Stay tuned for more!
Thanks to @MichaelGoddard for sharing his time and resources.
SAS offers a full workshop in the SAS Decisioning Learning Subscription with step-by-step exercises for deploying and scoring models using Agentic AI and SAS Viya on Azure.
Access it on learn.sas.com. This workshop environment provides step-by-step guidance and a bookable environment for creating agentic AI workflows.
For further guidance, reach out for assistance.
Find more articles from SAS Global Enablement and Learning here.
@Bogdan_Teleuca thank you for explaining complex things in a simple manner. For the extra piece of mind, you advised to use secure backend communication, one shall update service to this?
- name: ${LLMDASH}-https
port: 443 # Service port exposed inside the cluster
protocol: TCP
targetPort: 8443 # <-- forward to pod HTTPS port
I was actually using port 8080 inside the cluster. But I will consider to change it.
@touwen_k that's the ultra secure version. Try it. Working on 8080 inside the cluster should be ok for most of the use cases. If you have ultra sensitive data and you don't want users with cluster access to potentially snoop it, working with 8443 should be the way to go forward. Bear in mind that when you open 8443, 8080 stays open. The doc explains how to close it.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.