SAS Agentic AI – Deploy and Score Models – Kubernetes

3 Likes

Welcome back to the SAS Agentic AI Accelerator series! We’ve already cooked up LLM deployments with Docker and Azure’s managed services. Now, it’s time to turn up the heat with Kubernetes—the espresso machine of the cloud world. Sure, it has a few extra knobs and steam valves, but it gives you barista-level control.

If you crave fine-tuned control, serious scalability, and rock-solid HTTPS security, Kubernetes is your playground. Let’s roll up our sleeves and get an LLM running—with plenty of focus on keeping it secure and scalable. For simpler setups, Azure’s managed options work great, but for ultimate power and flexibility, Kubernetes is where magic happens!

Where We Are In The Series

In Part 1, Register and Publish Models, we introduced code-wrapped LLMs and showed how you can register them in SAS Model Manager, then how to publish them as Docker images using SAS Container Runtime (SCR).
In Part 2, SAS Agentic AI – Deploy and Score Models – The Big Picture, we compared deployment options, costs, and performance trade-offs in Azure.
In Part 3.1, SAS Agentic AI – Deploy and Score Models – Containers, we got our hands dirty deploying Azure Container Instances.
In Part 3.2, SAS Agentic AI – Deploy and Score Models – Apps, we discovered Azure Container Apps and Web Apps for scalable, secure LLM deployments.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

TLS Certificates Briefly

In our example, we’ll securely deploy an LLM (the open-source Qwen-25-05b LLM by Alibaba Cloud) behind an HTTPS endpoint on Kubernetes. Why HTTPS? Because you and your security officer will both sleep better at night.

You need a TLS certificate for HTTPS endpoints. Think of it as a VIP badge for secure web traffic. Here’s the concise version:

Generate a private key and certificate signing request (CSR).
Get the CSR signed by your internal or trusted certificate authority (CA).
Combine the certificate and full chain.
Load this into your Linux trust store (so tools like curl trust it).
Create a Kubernetes secret from the key and certificate.

# Set up secrets directory  
secrets_dir=~/project/deploy/models/secrets  
mkdir -p "$secrets_dir" && cd "$secrets_dir"  
  
# Variables
RG=resource_group
INGRESS_SAN="${RG}.gelenable.sas.com" # SAS Viya URL or LLM deployment DNS
GELEnvRootCA=my_folder # location of certificates and private key required for signing
  
# Generate private key and CSR  
openssl req -newkey rsa:2048 -sha256 -nodes -keyout scr_key.pem -extensions v3_ca \  
  -config <(echo "[req]"; echo "distinguished_name=req"; echo "[v3_ca]"; \  
    echo "extendedKeyUsage=serverAuth"; \  
    echo "subjectAltName=DNS:${INGRESS_SAN}, DNS:*.${INGRESS_SAN}") \  
  -subj "/C=US/ST=NC/L=North Carolina/O=SAS/CN=${INGRESS_SAN}" \  
  -out scr_models.csr  
  
# Sign CSR with Intermediate CA  
# These options tell OpenSSL to use the Intermediate CA's certificate and private key to sign the new certificate, rather than creating a self-signed certificate. 
echo "01" > scr_models.srl  
openssl x509 -req -sha256 -extensions v3_ca \  
  -extfile <(echo "[v3_ca]"; echo "extendedKeyUsage=serverAuth"; \  
    echo "subjectAltName=DNS:${INGRESS_SAN}, DNS:*.${INGRESS_SAN}") \  
  -days 820 -in scr_models.csr \  
  -CA $GELEnvRootCA/intermediate.cert.pem \  
  -CAkey $GELEnvRootCA/intermediate.key.pem \  
  -CAserial scr_models.srl -out scr_cert.pem  
  
# Append full certificate chain  
cat $GELEnvRootCA/intermediate.cert.pem >> scr_cert.pem  
cat $GELEnvRootCA/ca_cert.pem >> scr_cert.pem  
  
# Remove temporary files  
rm scr_models.*  
  
# Optional: Review the certificate  
openssl x509 -text -noout -in scr_cert.pem  
  
# Trust the CA certificate system-wide (for cURL etc.)  
sudo cp $GELEnvRootCA/ca_cert.pem /etc/pki/ca-trust/source/anchors/  
sudo update-ca-trust

The above block assumes you have access to intermediate CA's certificate and private key to sign the new certificate, rather than creating a self-signed certificate. For production, always use certificates signed by a trusted public Certificate Authority (CA), such as Let's Encrypt, DigiCert, or your organization's enterprise CA. This ensures secure, trusted, and verifiable connections for all clients.

That’s it, no need to get lost in a cryptographic jungle. I am simply reproducing a very reliable "TLS jungle trekking guide" produced by our SAS colleague, @MichaelGoddard. @StuartRogers is an authoritative source on TLS for SAS Viya and has plenty of trustworthy articles on SAS Communities

Prepare Your Kubernetes Cluster

Clean Up and Create a Namespace

Clear any coffee spills and set up a clean playground for your models:

kubectl delete ns llm
kubectl create ns llm

Add a Dedicated Node Pool

Large Language Models (LLMs) can be quite resource hungry. Open-source LLMs need lots of storage for model files, plus plenty of CPU and memory for processing. To keep everything running smoothly (and avoid stepping on other workloads’ toes), it’s best to give your LLMs their own dedicated node pool. Remember: choose the size of your node pool carefully, based on the specific LLMs you want to deploy and their technical requirements.

az aks nodepool add \  
  --resource-group $RG \  
  --cluster-name $AKS_NAME \  
  --name llmnp \  
  --node-count 1 \  
  --node-vm-size Standard_D16as_v5 \  
  --max-count 1 \  
  --min-count 0 \  
  --enable-cluster-autoscaler \  
  --node-taints workload=llm:NoSchedule \  
  --labels workload=llm node.kubernetes.io/name=llm workload/class=models

Check that your node is ready and properly labeled:

kubectl get nodes --show-labels

You should see labels like workload=llm and node.kubernetes.io/name=llm

Think of these node labels as 'Reserved for LLMs' parking spots.

Deploy the LLM to Your Kubernetes Cluster

Add Your TLS Secret

Load your certificate and key into Kubernetes as a secret:

kubectl -n llm create secret tls scr-certificate \ 
  --key="scr_key.pem" \  
  --cert="scr_cert.pem"
# Check it’s there
kubectl -n llm get secrets
kubectl -n $NS describe secret scr-certificate

Create Your Deployment YAML

The Big Three: Pod, Service, and Ingress (What Do They Do?)

Pod

Think of a pod as the smallest shipping box in Kubernetes. Inside that box is your running application, in our case, the containerized LLM model. The pod wraps it up with the resources, environment variables, and storage it needs. If the pod isn’t running, your LLM isn’t either.

Service

A service is like the shipping label on the box. It makes sure traffic can find and reach your pod, even if the pod moves around, inside the cluster. In our YAML manifest, the service listens on port 443 (HTTPS) and forwards traffic to your LLM’s container, running inside the pod.

Ingress

Ingress is the front desk or receptionist of your Kubernetes office building. It’s the entry point for outside traffic. Ingress decides which service gets what request, handles HTTPS/TLS, and acts as a secure gateway from the internet to your application.

YAML

# Variables
RG=Resource_group
INGRESS_HOST=SAS_Viya_Ingress
echo $INGRESS_HOST
az login
ACR_NAME=Your_Azure_Container_Registry
# LLM image must be stored here as a container image
az acr login --name $ACR_NAME

# LLM name
LLM=qwen_25_05b
LLMDASH=${LLM//_/-}
echo $LLM & echo $LLMDASH

# Create the deployment YAML file
tee  ~/project/deploy/models/${LLMDASH}-tls-deployment.yaml > /dev/null <<EOF
# ${LLMDASH} model deployment
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: ${LLMDASH}
    workload/class: models
  name: ${LLMDASH}
spec:
  # modify replicas to support the requirements
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: ${LLMDASH}
  template:
    metadata:
      labels:
        app: ${LLMDASH}
        app.kubernetes.io/name: ${LLMDASH}
        workload/class: models
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.azure.com/mode
                operator: NotIn
                values:
                - system
              - key: node.kubernetes.io/name
                operator: In
                values:
                - llm
      containers:
        - name: ${LLMDASH}
          image: ${ACR_NAME}.azurecr.io/${LLM}:latest
          imagePullPolicy: Always  # IfNotPresent or Always
          resources:
            requests:  # Minimum amount of resources requested
              cpu: 1
              memory: 8Gi
            limits:  # Maximum amount of resources requested
              cpu: 4
              memory: 16Gi
          ports:
            - containerPort: 8080
              name: http # Name the port "http"
            - containerPort: 8443
              name: https # Name the port "https"
          env:
          - name: SAS_SCR_SSL_ENABLED
            value: "true"
          - name: SAS_SCR_SSL_CERTIFICATE
            value: /secrets/tls.crt
          - name: SAS_SCR_SSL_KEY
            value: /secrets/tls.key
          - name: SAS_SCR_LOG_LEVEL_SCR_IO
            value: TRACE
          volumeMounts:
          - name: tls
            mountPath: /secrets
      volumes:
        - name: tls
          secret:
            secretName: scr-certificate
            items:  # Explicitly define the keys to mount
              - key: tls.crt
                path: tls.crt
              - key: tls.key
                path: tls.key
      tolerations:
      - key: workload/class
        operator: Equal
        value: models
        effect: NoSchedule
      - key: workload
        operator: Equal
        value: llm
        effect: NoSchedule
---
# TLS service definition
apiVersion: v1
kind: Service
metadata:
  name: ${LLMDASH}-tls-svc
  labels:
    app.kubernetes.io/name: ${LLMDASH}-tls-svc
spec:
  selector:
    app.kubernetes.io/name: ${LLMDASH}
    workload/class: models
  ports:
  - name: ${LLMDASH}-https
    port: 443
    protocol: TCP
    targetPort: 8080
  type: ClusterIP
---
# TLS ingress definition
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ${LLMDASH}-ingress
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTP
  labels:
    app.kubernetes.io/name: ${LLMDASH}-ingress
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - ${INGRESS_HOST}
    secretName: scr-certificate
  rules:
  - host: ${INGRESS_HOST}
    http:
      paths:
      - path: /${LLM}
        pathType: Prefix
        backend:
          service:
            name: ${LLMDASH}-tls-svc
            port:
              number: 443

EOF

What’s Happening in the YAML?

You’ll see three main sections in the YAML file. Here’s what each one does:

Deployment

Spins up your LLM as a container inside a pod.
Makes sure it runs on a special node reserved for LLMs (using labels and taints).
Mounts the TLS certificate and key (so your app can do HTTPS).
Sets environment variables to tell SAS Container Runtime (SCR) where to find the TLS certificates and how to behave.
Requests and limits resources (CPU and memory) so your LLM has enough “brainpower” to run but can’t block the whole cluster.

Service

Exposes your pod inside the cluster on port 443.
Acts as a stable “in-cluster” address for your LLM, so other components (like ingress) can always find it, even if pods are replaced or moved.

Ingress

Sets up a public HTTPS endpoint using your DNS name and TLS certificate.
Routes any incoming request like the service below which then sends it to your LLM pod.

https://your-dns/qwen_25_05b

Uses annotations to tell the NGINX ingress controller to expect HTTP traffic behind the scenes, even though users connect over HTTPS.

Apply and Go!

Deploy your model:

# Deploy
kubectl apply -f qwen-25-05b-tls-deployment.yaml -n llm  
# Wait for the pod to be ready (watch for the “Ready” status):
kubectl get pods -n llm  
# Check logs
kubectl logs  -n llm

Score the LLM

With everything live, you can send HTTPS requests to your Kubernetes ingress endpoint and watch your LLM do its magic.

curl --location --request POST "https://${INGRESS_HOST}/qwen_25_05b" \  
  --header 'Content-Type: application/json' \  
  --header 'Accept: application/vnd.sas.microanalytic.module.step.output+json' \  
  --data-raw '{  
    "inputs": [  
      {"name":"userPrompt","value":"customer_name: Xin Little; loan_amount: 20000.0; customer_language: EN"},  
      {"name":"systemPrompt","value":"You are tasked with drafting an email to respond to a customer whose mortgage loan application has been accepted by the SAS AI Bank. You will be provided with customer_name, loan_amount, customer_language. Follow the guidelines for a professional, friendly response."},  
      {"name":"options","value":"{temperature:1,top_p:1,max_tokens:800}"}  
    ]  
  }' | jq

If you get a smart response, such as the following sample, congratulations! You’ve just deployed a secure, scalable LLM using Kubernetes.

Performance and Scaling Notes

Large Language Models (LLMs) are heavy weightlifters. They need generous CPU, memory, and storage, especially when running open-source versions. For best results, give LLMs their own dedicated node pool (or multiple node pools). This ensures your models won’t compete for resources with other workloads, keeping everything running smoothly.

When it comes to scaling, Kubernetes shines. You can adjust the number and size of nodes in your pool to match your workload. Just remember: the bigger the LLM, the beefier your node needs to be. Choose your node pool size based on the technical requirements of your models, don’t try to squeeze a heavyweight model into a tiny node!

For ultra-responsive performance, monitor CPU and memory usage and scale up as needed. And if you’re aiming for production-grade speed, keep an eye on response times as you adjust resources.

Security Corner

Security isn’t just an add-on—it’s essential. Always use HTTPS to protect data in transit. This means securing both your public endpoints and the internal traffic between your ingress, service, and pod. For extra peace of mind, forward traffic from the ingress to your pod over port 8443 (HTTPS), not just 8080 (HTTP).

Make sure:

Your container exposes containerPort: 8443 in the YAML.
Your ingress annotation is set to nginx.ingress.kubernetes.io/backend-protocol: HTTPS.
Certificates are properly managed, and secrets are stored securely in Kubernetes.

Recommendations

Always give your LLMs a dedicated node pool, sized appropriately for their needs. This avoids resource conflicts and keeps things running smoothly.
If it feels like your LLM is trying to eat your entire cluster, it probably is. Time to beef up those nodes.
Watch your CPU, memory, and response times. Scale up resources as needed and adjust node pool sizes to meet demand.
Use HTTPS end-to-end, not just at the edge. Enable secure backend communication by forwarding traffic on port 8443 and setting the right ingress annotations.
For anything beyond a quick test, secure your endpoints, manage certificates properly, and keep secrets safe.

Summary

Deploying LLMs in Kubernetes gives you flexibility, scalability, and strong security, if you set things up right. With these best practices in place, your LLMs will run smoothly, securely, and ready for whatever comes next.

And remember: in the world of Kubernetes, a little resource planning goes a long way. Happy deploying!

Thanks for following along! If you find this post helpful, give it a thumbs up, share your stories or questions in the comments, and let’s keep building better AI workflows together. Stay tuned for more!

Acknowledgment

Thanks to @MichaelGoddard for sharing his time and resources.

Additional Resources

Want More Hands-On Guidance?

SAS offers a full workshop in the SAS Decisioning Learning Subscription with step-by-step exercises for deploying and scoring models using Agentic AI and SAS Viya on Azure.

Access it on learn.sas.com. This workshop environment provides step-by-step guidance and a bookable environment for creating agentic AI workflows.

For further guidance, reach out for assistance.

Find more articles from SAS Global Enablement and Learning here.

touwen_k

@Bogdan_Teleuca thank you for explaining complex things in a simple manner. For the extra piece of mind, you advised to use secure backend communication, one shall update service to this?
- name: ${LLMDASH}-https
port: 443 # Service port exposed inside the cluster
protocol: TCP
targetPort: 8443 # <-- forward to pod HTTPS port

I was actually using port 8080 inside the cluster. But I will consider to change it.

Bogdan_Teleuca

@touwen_k that's the ultra secure version. Try it. Working on 8080 inside the cluster should be ok for most of the use cases. If you have ultra sensitive data and you don't want users with cluster access to potentially snoop it, working with 8443 should be the way to go forward. Bear in mind that when you open 8443, 8080 stays open. The doc explains how to close it.