10-14-2024
MichaelGoddard
SAS Employee
Member since
05-07-2017
- 24 Posts
- 1 Likes Given
- 0 Solutions
- 0 Likes Received
About
Michael is a Principal Technical Architect in the Global Enablement and Learning Team, SAS R&D. His primary focus is on architecture and deployment of the SAS platform.
-
Latest posts by MichaelGoddard
Subject Views Posted 1413 10-13-2024 09:27 PM 1534 06-09-2024 07:23 PM 1835 05-13-2024 04:53 PM 1381 04-15-2024 12:40 AM 1511 02-28-2024 05:48 PM 1218 12-18-2022 06:00 PM 1550 12-18-2022 05:12 PM 4403 10-20-2022 06:53 PM 1801 10-19-2022 06:38 PM 1775 08-04-2022 06:13 PM -
Activity Feed for MichaelGoddard
- Tagged Running SingleStore Studio within the SAS Viya namespace on SAS Communities Library. 10-13-2024 09:29 PM
- Posted Running SingleStore Studio within the SAS Viya namespace on SAS Communities Library. 10-13-2024 09:27 PM
- Tagged Running SAS Viya on a shared Kubernetes cluster – Part 2 on SAS Communities Library. 06-09-2024 07:25 PM
- Tagged Running SAS Viya on a shared Kubernetes cluster – Part 2 on SAS Communities Library. 06-09-2024 07:25 PM
- Tagged Running SAS Viya on a shared Kubernetes cluster – Part 2 on SAS Communities Library. 06-09-2024 07:25 PM
- Posted Running SAS Viya on a shared Kubernetes cluster – Part 2 on SAS Communities Library. 06-09-2024 07:23 PM
- Posted Re: Creating custom SAS Viya topologies – realizing the workload placement plan on SAS Communities Library. 05-13-2024 04:53 PM
- Tagged Running SAS Viya on a shared Kubernetes cluster – Part 1 on SAS Communities Library. 04-15-2024 12:41 AM
- Tagged Running SAS Viya on a shared Kubernetes cluster – Part 1 on SAS Communities Library. 04-15-2024 12:41 AM
- Posted Running SAS Viya on a shared Kubernetes cluster – Part 1 on SAS Communities Library. 04-15-2024 12:40 AM
- Liked Using NFS Premium shares in Azure Files for SAS Viya on Kubernetes for AbhilashPA. 03-03-2024 03:12 PM
- Tagged SAS Viya topologies: sharing a node pool for Compute and CAS on SAS Communities Library. 02-28-2024 05:50 PM
- Tagged SAS Viya topologies: sharing a node pool for Compute and CAS on SAS Communities Library. 02-28-2024 05:50 PM
- Tagged SAS Viya topologies: sharing a node pool for Compute and CAS on SAS Communities Library. 02-28-2024 05:49 PM
- Posted SAS Viya topologies: sharing a node pool for Compute and CAS on SAS Communities Library. 02-28-2024 05:48 PM
- Tagged Deploying SAS Container Runtime models on Azure Container Instances on SAS Communities Library. 12-18-2022 06:01 PM
- Posted Deploying SAS Container Runtime models on Azure Container Instances on SAS Communities Library. 12-18-2022 06:00 PM
- Posted Exploring the configuration: using Python with SAS Analytics Pro on SAS Communities Library. 12-18-2022 05:12 PM
- Tagged Exploring the configuration: using Python with SAS Analytics Pro on SAS Communities Library. 12-18-2022 05:12 PM
- Posted Creating model publishing destinations using the SAS Viya CLI on SAS Communities Library. 10-20-2022 06:53 PM
-
Posts I Liked
Subject Likes Author Latest Post 7 -
My Library Contributions
Subject Likes Author Latest Post 0 3 1 1 0
10-13-2024
09:27 PM
In this post we will look at running SingleStore Studio in the Kubernetes cluster, within the SAS Viya namespace. As some context, I’m talking about SAS with SingleStore deployments (orders). The SingleStore tools and SingleStore Studio are not shipped as part of the SAS order, and are usually installed on a machine external to the Kubernetes cluster.
There are many benefits from running SingleStore Studio (S2 Studio) within the SAS Viya namespace. But there are also some challenges, namely SingleStore do not provide a standalone container image for deploying SingleStore Studio. Note, the SingleStore documentation also uses the term SingleStoreDB Studio.
Here we will look at creating a container image to run the SingleStore Client (command-line) and S2 Studio, and deploying it to the SAS Viya namespace.
I would like to start by saying that SingleStore do provide an image containing S2 Studio, it is in the ‘singlestore/cluster-in-a-box’ image. As the name suggests this image contains a complete environment, which is targeted at developers.
SingleStore have several images on Docker Hub, see: https://hub.docker.com/u/singlestore. But they do not provide an image for just running S2 Studio, nor do SAS include this as part of the SAS Viya with SingleStore order.
As some background, with a SAS with SingleStore order, all the SingleStore components, the memsql cluster runs within the Viya namespace.
Why run SingleStore Studio within the Viya namespace?
Let’s start by discussing the benefits of running S2 Studio on Kubernetes, within the Viya namespace.
The key benefits of running S2 Studio in the Kubernetes cluster are simplified networking and security, as the S2 Studio server application is connecting directly to the SingleStore services running within the SAS Viya namespace.
However, for a secure connection to the SingleStore cluster a WebSocket Proxy implementation is used. This means that a direct connection from the user’s browser to the backend is required. I will talk more about this in a follow-up post on enabling TLS security for the S2 Studio application.
The SingleStore documentation states the following:
“For situations where REQUIRE SSL is not mandatory, and if the additional configuration required to use a direct WebSocket connection becomes a bottleneck, it may be simpler to use the existing Studio architecture, where Studio is served over HTTPS and the singlestoredb-studio server is co-located with the Master Aggregator.”
The REQUIRE SSL attribute is a memsql user setting.
Therefore, running the singlestoredb-studio server within the Viya namespace effectivity collocates it with the memsql cluster, the Master Aggregator. The communication over port 3306 (which is unencrypted) is contained to within the Kubernetes cluster, and not exposed to the outside world.
The SingleStoreDB Studio Architecture page also states that multiple S2 Studio instances can communicate with an individual cluster, you can easily scale out S2 Studio by creating new instances to manage user load. Hence, running S2 Studio as a Kubernetes deployment is another advantage of running it on Kubernetes, rather than being installed on a host machine outside of the K8s cluster.
Building the container image
To run S2 Studio on Kubernetes you first need to build a container image. For this you need to select an image that contains the base packages for S2 Studio to run. This became a process of research (looking at what SingleStore were using for their images) and trial and error. The Centos image works well and contains utilities like systemctl, but the image ends up being very large at over 600MB.
In the end I settle on the almalinux/8-init as my base. The nice thing about this and the Centos image, is that it allowed for the standard install process for the SingleStore Client (CLI) and Studio, to build the container image.
Remember, when selecting an OS image for the container build it is important to do the due diligence on the security of that image, can it be trusted.
You must create your own Docker build file (Dockerfile), the following image shows my build file. As mentioned above, I decided to build an image that contained the SingleStore CLI and Studio.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
In the image, lines 3 to 6 install and update the required packages. Once that is in place the SingleStore Studio and CLI are installed (lines 8 to 13). Line 10 sets the permissions on the ‘singlestoredb-studio.hcl’ configuration file. This is required as the install runs under root, while the container will run as the memsql user (this is set on line 16).
In lines 18 – 20 I added several labels for the image. Lines 22 and 23 show the ports that are exposed. Note, I could have also used ports 80 and 443.
Finally, line 25 starts the S2 Studio server (application), specifies the command to run within the container.
At this point I would like to acknowledge the assistance from Marc Price (Senior Principal Technical Support Engineer) in getting the Docker buildfile configuration finalised.
The next step is to build the image from the Dockerfile. The following is an example build command:
docker build --tag singlestore-tools --file singlestore-tools .
Note, it is important to include the dot at the end of the command.
This produced an image that was 479MB in size.
Once the image has been built you can use the ‘docker history’ command to review the image layers. For example.
Now that I had an image, I tested it by running it on the Docker server. For example:
docker run -d -p 8080:8080 --name singlestore-tools singlestore-tools:latest
Here you can see SingleStore Studio running on my Docker server.
Once I was happy with the image, I tagged it and pushed it to my container registry.
Deploying SingleStore Studio to the K8s cluster
Now that you have an image, the next step is to create the deployment manifests. You need to create the configuration for deploying the S2 Studio application, along with a service and ingress definition. To pre-configure the ‘studio.hcl’ file a Kubernetes ConfigMap is also required.
To deploy the S2 Studio application, it is possible to deploy it as a single pod or use a Kubernetes deployment to scale the S2 Studio deployment. In this example I will show how to use a K8s deployment for S2 Studio. An overview of the configuration is shown in the diagram below.
A key decision is where should the S2 Studio application run?
In this example, it is configured to run on the Stateful nodes, nodeAffinity for the Stateful nodes. But I could have also configured it to run in the singlestore node pool, as this is where the SingleStore Master Aggregator is running.
With that decided, the next decision is how many replicas do you want to run, here I specified 2 replicas. I was testing in Microsoft Azure using an Azure Container Registry.
---
# singleStore-tools deployment YAML
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: singlestore-tools
workload.sas.com/class: singlestore
name: singlestore-tools
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: singlestore-tools
template:
metadata:
labels:
app: singlestore-tools
app.kubernetes.io/name: singlestore-tools
workload.sas.com/class: singlestore
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/mode
operator: NotIn
values:
- system
- key: workload.sas.com/class
operator: In
values:
- stateful
containers:
- image: myregistry.azurecr.io/singlestore-tools:latest
imagePullPolicy: Always # IfNotPresent or Always
name: s2tools
resources:
requests: # Minimum amount of resources requested
cpu: 1
memory: 128Mi
limits: # Maximum amount of resources requested
cpu: 2
memory: 256Mi
ports:
- containerPort: 8080 # The container exposes this port
name: http # Name the port "http"
volumeMounts:
- name: studio-files-volume
mountPath: /tmp/s2studio-files
lifecycle:
postStart:
exec:
command:
- /bin/sh
- '-c'
- |
cp /tmp/s2studio-files/studio.hcl /var/lib/singlestoredb-studio/studio.hcl
tolerations:
- effect: NoSchedule
key: workload.sas.com/class
operator: Equal
value: stateful
volumes:
- name: studio-files-volume
configMap:
name: studio-files
A consideration for creating the deployment manifest is that when a ConfigMap is mounted as a volume it becomes read-only. Therefore, you can’t directly mount the studio.hcl file into the target location (as the S2 Studio server requires read-write access to the studio.hcl file).
Above you can see the ‘studio-files’ ConfigMap is mounted as the volume: ‘studio-files-volume’, with a mountPath of ‘/tmp/s2studio-files’.
So, the configMap file(s) are loaded into a temporary location, then copied into the configuration. This is achieved with the following copy command:
cp /tmp/s2studio-files/studio.hcl /var/lib/singlestoredb-studio/studio.hcl
This copies my pre-configured cluster definition, studio.hcl file, into the Studio server configuration with the required permissions.
Another consideration when deploying multiple replicas is whether to define Pod Affinity / AntiAffinity rules.
For my test environment I defined a single node pool, called services, for the Viya stateful and stateless services. It had the stateful label and taint applied to the nodes. Below you can see that while I hadn’t defined any podAntiAffinity rules, I ended up with the S2 Studio pods (singlestore-tools) running on different nodes.
Creating the Service and Ingress definitions
To be able to access the S2 Studio application, a service and ingress definition is required. We will first look at the service definition.
---
apiVersion: v1
kind: Service
metadata:
name: s2studio-http-svc
labels:
app.kubernetes.io/name: s2studio-http-svc
spec:
selector:
app.kubernetes.io/name: singlestore-tools
ports:
- name: s2studio-http
port: 80
protocol: TCP
targetPort: 8080
type: ClusterIP
Here you can see the service definition, the service was called s2studio-http-svc, and that I have mapped port 80 to port 8080 on the container(s).
To access the S2 Studio application, I also needed a DNS name that would resolve for the S2 Studio application, the host name in the ingress definition. In my environment I had a DNS wildcard for:
*.camel-a20280-rg.gelenable.sas.com
Therefore, I used a host name of: s2studio.camel-a20280-rg.gelenable.sas.com
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: s2studio-ingress
annotations:
kubernetes.io/ingress.class: nginx
labels:
app.kubernetes.io/name: s2studio-ingress
spec:
rules:
- host: s2studio.camel-a20280-rg.gelenable.sas.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: s2studio-http-svc
port:
number: 80
Here you can see the ingress is targeting service: s2studio-http-svc
Create the Studio Server configuration
Given that the S2 Studio application is running in the Kubernetes cluster with SAS Viya it is possible to use the internal service name for the memsql cluster. The key advantage of using the service name is that it keeps the connection from the S2 Studio application to the memsql cluster internal to the K8s cluster.
The service name is also a known value for a SAS Viya with SingleStore deployment, which means it is possible to pre-configure the studio.hcl file with a connection profile for the memsql cluster.
The DDL service name is: svc-sas-singlestore-cluster-ddl
The following was the ‘studio.hcl’ definition that I created.
version = 1
cluster "ViyaS2Profile" {
name = "SAS Viya DDL Connection"
description = "Connection using port 3306"
hostname = "svc-sas-singlestore-cluster-ddl"
port = 3306
profile = "DEVELOPMENT"
websocket = false
websocketSSL = false
kerberosAutologin = false
}
Once the file has been created, the following command can be used to create the ConfigMap.
kubectl -n namespace create configmap configmap_name --from-file=file_name
Note, it would have been possible to create an inline definition for the studio.hcl file in the S2 Studio deployment yaml. However, I prefer to keep this separate as it provides more flexibility and makes it easier to load (define) multiple files. We will see this in Part 2 of this post.
In my opinion it also makes it easier to create the files, as you don’t have to worry about yaml indentation. You just create the files as required.
The only consideration for this approach is that the ConfigMap must be in place prior to applying the deployment for the S2 Studio application.
The Results…
With the above configuration in place, you are set to start using SingleStore Studio. Below you can see the SingleStore Studio home page with the pre-configured cluster definition.
To review the configuration, the studio.hcl file has a pre-configured profile and the S2 Studio pods connect to the SingleStore Master Aggregator on port 3306 using the DDL service (svc-sas-singlestore-cluster-ddl).
Conclusion
Here we have looked at how to create a container image for the SingleStore Client and Studio. The configuration shown is using HTTP to connect to S2 Studio. In Part 2 I will show how to implement TLS using the SAS Viya secrets.
Finally, it is important to remember that the SingleStore Studio application is not maintained by SAS, and it is not shipped with the SAS Viya with SingleStore order. As such, SAS Technical Support will not provide support for this type of deployment.
Thanks for reading…
Michael Goddard
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
- singlestore
- studio
06-09-2024
07:23 PM
3 Likes
This is Part 2 of the post on running SAS Viya on a shared Kubernetes cluster. In Part 1 I discussed some of the challenges that can be encountered when deploying SAS Viya in a cluster that has untainted nodes, as well as the main deployment considerations.
In this post I will discuss implementing required scheduling, required nodeAffinity, to force a desired topology when there are untainted nodes in the cluster.
Again, I will share some of the tests that I ran to help illustrate the issues.
To recap, for my testing I created an Azure Kubernetes Service (AKS) cluster with the required SAS Viya node pools, with the labels and taints applied, but they were scaled to zero. In addition to the four node pools that correspond to the standard SAS Viya workload classes (cas, compute, stateful, and stateless) and a ‘system’ node pool for the Kubernetes control plane, I also created an ‘apps’ node pool (without any taints applied).
This was to simulate a scenario where there are other applications running and having several nodes available that didn’t have any taints applied.
Using required scheduling to force a topology
To ensure that the SAS Viya pods run on the desired nodes (when there are untainted nodes), ‘required node affinity’ must be used.
I will start by saying the easiest path, the simplest configuration option when running with un-tainted nodes is to just focus on configuring the CAS and Compute pods. This is easier than reconfiguring all the stateless and stateful service to use required scheduling. There is also a patch transformer for CAS supplied in the overlays: require-cas-label.yaml
See the SAS Viya README: Optional CAS Server Placement Configuration
This means that you only need to focus on the sas-programming-environment components (pods).
I discuss enabling required scheduling in the following Post : Creating custom Viya topologies – Part 2 (using custom node pools for the compute pods).
For this test I configured required scheduling for both CAS and Compute, reset the cas and compute node pools to zero nodes, then deployed SAS Viya.
I created the following patch transformers for the sas-programming-environment pods:
sas-compute-job-config-require-compute-label.yaml
sas-batch-pod-template-require-compute-label.yaml
sas-launcher-job-config-require-compute-label.yaml
sas-connect-pod-template-require-compute-label.yaml
To summarize, the configuration the following patch update is applied to update the nodeAffinity for the Compute components:
patch: |-
- op: remove
path: /template/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
- op: add
path: /template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: workload.sas.com/class
operator: In
values:
- compute
While this worked, it did highlight the problem for the first user to login to SAS Studio, as it is at this point that the first compute node is created. The SAS Studio session timed out waiting for the Compute Server (pod) to download the container images and then start.
I waited for the compute node to fully start then started a new SAS Studio session. This time I got the Compute Server context.
This does beg the question: Should you ever let the compute node pool scale to zero?
Looking at the SAS Viya deployment I still didn’t have any pods running on the stateful or stateless nodes. These node pools were still scaled to zero, all the stateless and stateful pods were running on the ‘apps’ nodes.
But would starting some stateless and stateful nodes prior to the SAS Viya deployment fix this?
At this point I did another test; I manually scaled the stateful node pool to have one node. Before the SAS Viya deployment the AKS cluster had the following nodes.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
You can see that I had three nodes available in the SAS Viya node pools and again I had several ‘apps’ nodes (aks-apps-xxxx-vmssnnnn) available to simulate nodes being used for other applications.
I did this to illustrate the Kubernetes pod scheduling behaviour. Kubernetes (the cluster autoscaler) will not scale a node pool util it is needed, regardless of the node affinity for the pod, if an “acceptable” node is available, in this case one without any taints, the pod will be scheduled there as a first choice.
Hence, at the end of this deployment I still only had one stateful node and no stateless nodes.
Once SAS Viya had deployed, I used the following command to view the pods running on the stateful node (aks-stateful-12471038-vmss000000) and found there were only 8 pods running on the node.
kubectl -n namespace get pods --field-selector spec.nodeName=node-name
The rest of the Viya stateful and stateless pods were running on the 'apps' nodes.
Using one node pool dedicated to SAS Viya
In this scenario perhaps the organisation only wants to dedicate, or reserve, only one node pool to SAS Viya. When sharing a cluster with other applications this is probably the simplest approach to ensure that there are nodes available to meet the requirements for the SAS Viya Compute Server and CAS functions.
I like this configuration as you only need to create one additional node pool in an existing cluster. Then using required scheduling for the CAS and Compute pods it also allows for scaling the node pool to zero.
The problem with scaling the node pools to zero is avoided as when the CAS server starts this will trigger the initial scaling of the shared node pool.
For this test I created a ‘viya’ node pool, with the compute label and taint applied.
To avoid pod drift, the sas-programming-environment pods should be updated to use required scheduling using the standard compute labels and taints, as described above.
Additionally, as the CAS and Compute pods are sharing the same node pool, the CAS server configuration had to be updated to target the shared node pool, the ‘viya’ node pool. For this I used the provided CAS overlay as a template and targeted the workload.sas.com/class=compute label. The tolerations also had to be updated for the compute taint.
The patch transformers for the CAS configuration are shown below. The first patch transformer updates the CASDeployment.
# PatchTransformer to make the compute label required and provide a toleration for the compute taint
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: run-cas-on-compute-nodes
patch: |-
# Remove existing nodeAffinity
- op: remove
path: /spec/controllerTemplate/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
# Add new nodeAffinity
- op: add
path: /spec/controllerTemplate/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: workload.sas.com/class
operator: In
values:
- compute
# Set tolerations
- op: replace
path: /spec/controllerTemplate/spec/tolerations
value:
- effect: NoSchedule
key: workload.sas.com/class
operator: Equal
value: compute
target:
group: viya.sas.com
kind: CASDeployment
name: .*
version: v1alpha1
The second patch transformer updates the 'sas-cas-pod-template'.
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: set-cas-pod-template-tolerations
patch: |-
- op: replace
path: /template/spec/tolerations
value:
- effect: NoSchedule
key: workload.sas.com/class
operator: Equal
value: compute
target:
kind: PodTemplate
version: v1
name: sas-cas-pod-template
The final update I made was to adjust the CPU and memory requests and limits for CAS. This was to ensure that there was space available on the shared nodes for the Compute Server pods.
For example, I used nodes with 16vCPU and 128GB memory, then set the CAS limits to 12 vCPU and 96GB memory. I also configured the requests and limits the same to enforce guaranteed QoS. Using guaranteed QoS is important to protect the CAS Server pods when the nodes are busy. It ensures that the CAS pods are among the last to be evicted from a node.
This was my target topology.
To configure the CPU and memory for CAS see the example in sas-bases:
../sas-bases/examples/cas/configure/cas-manage-cpu-and-memory.yaml
When I deployed SAS Viya with a MPP CAS Server, this had the benefit of ensuring that multiple nodes were available for the Compute Server pods. The Compute pods were configured with the resource defaults (CPU and memory requests and limits).
As a side note (running SAS Viya 2024.03), if you inspect the sas-compute-server pod you will see two running containers. The sas-programming-environment container has resource limits of 2 cpu and 2Gi memory, and the sas-process-exporter container has limits of 2 cpu and 4Gi memory.
Once SAS Viya was running, I then started multiple SAS Studio sessions. In the image below you can see that the Compute Server pods are running on multiple ‘viya’ nodes.
As an end user life was good, as my SAS Studio sessions started without any failures. 😊
Conclusion
Sharing the cluster with other applications is possible, but this needs to be carefully planned to ensure the best result is achieved for ALL applications.
As always, it’s important to focus on the requirements for the SAS Viya platform. When the cluster has untainted nodes, you should configure required scheduling to ensure you have an operational SAS Viya platform. The simplest approach is probably just to focus on Compute and CAS.
A key question is: Do the untainted nodes meet the system requirements for SAS Viya?
If not, additional node pools WILL be required to run the SAS Viya platform.
Here I have proposed the concept of sharing a node pool for Compute and CAS, and shown how you could reserve some capacity for the two workloads. You should do some capacity planning and sizing to establish suitable node sizes and the appropriate resource reservations.
But keep in mind, it is possible to have a shared node pool for CAS and Compute and still use CAS auto-resources to effectively dedicate some nodes within the node pool to running CAS.
Finally, I have left you with a couple of questions ponder (from an end-user perspective):
Should you ever let the compute node pool scale to zero?
How many nodes should you have available for the Compute pods to avoid SAS Studio timeouts?
I’m sure this will lead to interesting discussions! Maybe it’s a topic for another day…
... View more
- Find more articles tagged with:
- architecture
- deployment
- GEL
- SAS Viya
Labels:
05-13-2024
04:53 PM
Hi @EyalGonen
The key reasons for possible incompatibilities are around the system requirements and there could be changes introduced to the cluster-wide resources that would break the older Viya deployments.
Therefore, when sharing a cluster for multiple SAS Viya deployments it is important to ensure that they can coexist. For those reasons I also wouldn't recommend collocating Stable cadence and LTS cadence versions on the same cluster. As that would / may increase the risk of incompatibilities.
I hope that helps.
... View more
04-15-2024
12:40 AM
1 Like
When we think about running SAS Viya in a shared Kubernetes cluster there are two use cases:
Using a dedicated Kubernetes cluster for SAS Viya, running multiple SAS Viya deployments.
Running SAS Viya in a Kubernetes cluster that is shared with other (3rd party) applications.
With the withdrawal of support for application multi-tenancy within SAS Viya, as of Stable 2023.10 and LTS 2023.10, it means that the requirement for running multiple SAS Viya environments in a shared cluster may become more common.
I will start by stating, it is possible to run SAS Viya in a shared cluster, in both cases, but there are several considerations. In this post I will focus on running SAS Viya in a Kubernetes cluster shared with other applications.
When using a Kubernetes (K8s) cluster for multiple applications it is important to understand all the application workloads, their individual system requirements and the service levels associated with the applications.
These factors impact the K8s cluster design including the design for the node pools and workload placement. Complications often arise when an organisation doesn’t want to dedicate nodes to applications and does not want to taint nodes.
Applications such as SAS Viya have some specific system requirements, for example:
the need for local storage on the nodes for SASWORK and CAS disk cache
the in-memory processing requirements for the CAS server
a requirement to use GPUs.
This is not common for other business applications.
SAS Viya also has a need for specific node taints and labels. Here I’m particularly thinking about the sas-programming-environment (Compute Server) pods and CAS. Functions like the SAS Workload Orchestration and the sas-programming pre-pull function rely on having nodes labelled with: workload.sas.com/class=compute
Remember, in Kubernetes, node taints are used to repel pods and node labels are used to attract pods. These are used in conjunction with the node affinity and toleration definitions in the application pods. See the Kubernetes documentation: Assigning Pods to Nodes
As the SAS Viya configuration defaults to using preferred scheduling, appropriately labelling and tainting nodes is important to achieving the desired workload placement for a (specific) topology.
The node pools are a way to optimise the deployment for specific application requirements (like GPUs and storage) and workloads.
Another consideration when sharing the cluster with other applications is the level of permissions required to deploy and run SAS Viya. This can be a concern for some organisations. For example, the need for cluster-wide admin permissions.
SAS Viya also has specific system requirements for functions such as the ingress controller, these might be different to, or in conflict with, the application ecosystem that is currently in place.
Hence, the system requirements and admin permissions for SAS Viya can be key drivers for dedicating a cluster to running SAS Viya.
With that said, let’s now look at some deployment scenarios. I felt the easiest way to illustrate some of the deployment concerns was to run some tests.
Running in a cluster with untainted nodes
As the default SAS Viya deployment uses preferred scheduling, one of the key concerns to achieving a target topology is what I call “pod drift”. Pod drift is where the Viya pods end up running on nodes that aren’t designated for the SAS Viya processing. Therefore, you are not achieving the target workload placement for the desired topology for SAS Viya.
For most components this isn’t a problem, but for functions that have specific node requirements this can lead to poor performance and/or resource utilization, or worse, it might even break the Viya deployment in some way!
To illustrate this, I tested the following scenario when running SAS Viya in a cluster with untainted nodes:
The organisation doesn’t taint nodes but has agreed to label and taint some nodes for SAS Viya.
The SAS Viya node pools have been defined to allow them to scale to zero, and
Dedicating a single node pool to SAS Viya.
Test 1
In my first test I created an AKS cluster with the required SAS Viya node pools, with the labels and taints applied, but they were scaled to zero. In addition to four node pools that correspond to the standard SAS Viya workload classes (cas, compute, stateful, stateless) and a ‘system’ node pool for the Kubernetes control plane, we also have the ‘apps’ node pool (without any taints applied).
This was to simulate a scenario where there are other applications running and several nodes available that didn’t have any taints applied.
At the start of the SAS Viya deployment the state of the cluster is shown in the image below.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
The ‘apps’ node pool had 6 nodes running and the system node pool had one node.
When I deployed SAS Viya the only node pool that was used was the apps node pool. Even with the CAS auto-resources enabled the Kubernetes scheduler selected to start a new node in the apps node pool rather than using the cas node pool. This can be seen in the image below.
In the image you can see that sas-cas-server-default-controller is running on the seventh node (aks-apps-37136449-vmss000006) in the 'apps' node pool.
When I started a SAS Studio session, I couldn’t get a compute server as there were no compute nodes available. This is shown below.
In the image, you can see that I failed to get a SAS Studio compute context, and the SAS Compute Service and the SAS Launcher pods are both running in the apps node pool, but there isn’t any SAS Compute Server pod. There were no nodes available with the compute label:
workload.sas.com/class=compute
At this point I should state that I had the SAS Workload Orchestrator (SWO) enabled, I was using the default configuration, as there weren’t any nodes available with the workload.sas.com/class=compute label the Compute Server for the SAS Studio session was not started.
Obviously, this isn’t a good experience for the SAS users.
It also highlights that it is critical to have node(s) available with the compute label when using the SAS Workload Orchestrator, and what happens when the cluster autoscaler doesn’t trigger the scaling of the compute node pool.
So, what would happen if there were cas and compute nodes available?
Test 2
In test 2 I manually scaled the cas and compute node pools to have one node each. The image below shows the state of the cluster when I started the test.
The hope here was that I would get the CAS server and Compute Server pods running on the desired nodes.
This deployment was better as the Kubernetes scheduler selected to run pods on the cas and compute nodes. The CAS Server (sas-cas-server-default-controller) was running on the cas node, and I was able to get a compute server when I started SAS Studio. This is shown in the following image.
Here you can see that the sas-compute-server pod is running on the compute node, and the SAS Compute Service and SAS Launcher pods are running in the ‘apps’ node pool.
This still wasn’t perfect as in this test I still didn’t get any pods running on the Stateless or Stateful nodes. I would have had to have nodes available in the stateless and stateful node pools for this to happen.
Another problem with this Kubernetes configuration is when SWO is disabled, there is no guarantee that all the sas-compute-server pods will be running in the compute node pool. Let’s assume that there is a significant programming workload and there isn’t sufficient capacity on the existing compute server node(s).
Rather than starting a new compute node the K8s scheduler could select one of the existing apps nodes to run the new sessions. At this point the users would start to see the error shown above in the first test. The timeout in SAS Studio waiting for the Compute Server pod (the SAS Studio compute context) to start. This is due to the time required to pull the sas-programming-environment image down to a node.
A user could strike it lucky, and the Compute Server could be scheduled onto a node that already has the sas-programming-environment image. But the Compute Server still has dependencies on things like storage for SASWORK and maybe the need for GPUs.
This illustrates the problem of using preferred scheduling in a cluster with untainted nodes.
You can probably live with the stateless and stateful pods running on any available node in the cluster, but how do you ensure that the compute and cas pods are running on the target nodes even when the node pools scale to zero?
To ensure that the Viya pods run on the desired nodes (when there are untainted nodes) ‘required node affinity’ (required scheduling) must be used. I will discuss this in more detail in part 2.
Conclusion
Sharing the cluster with other applications is possible, but this needs to be carefully planned to ensure the best result is achieved for ALL applications.
The SAS Viya system requirements and admin permissions need to be discussed and can be key drivers for dedicating a cluster to running SAS Viya.
As we have seen, there are many factors that affect the Kubernetes scheduling, including node availability, the labels and taints on nodes, as well as the application configuration (in this case SAS Viya), to name a few.
In part 2 we will look at using required scheduling to force a topology (when there are untainted nodes) and dedicating a single node pool to SAS Viya within the shared cluster.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- deployment
- GEL
- SAS Viya
02-28-2024
05:48 PM
1 Like
This is another post in my SAS Viya Topologies series. This time we will look at using a two-node pool configuration and how to get the desired topology. We will examine sharing a node pool for Compute and CAS processing. For this I tested using both a SMP CAS Server and using a MPP CAS Server.
For this testing I was working in the Microsoft Azure Cloud and using the SAS Infrastructure as Code GitHub project to build the Kubernetes cluster.
Let’s have a look at the details and how to get the desired topology.
Desired topology
As the title suggests, the desired topology was to use two node pool for my SAS Viya deployment. That is, a node pool for the microservices and a single node pool for the Compute and CAS pods. My goal was to use “small” commodity VMs for all services other than the Compute and CAS engines (pods). This is shown in the image below.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
I had an objective of simplifying the deployment topology by using a single node type, node pool, for the “compute-tier”.
The rationale for using a single node pool for the Compute and CAS processing is that they have similar node requirements. Nodes with ample CPU and memory with local ephemeral disk. It is also a recognition that you can share the nodes by letting Kubernetes control the pod scheduling.
Deployment Decision
A key deployment, or architectural decision, is how to implement the single node pool for the compute-tier?
To minimise the “custom configuration” I wanted to make use of the standard SAS Viya workload class labels and taints where possible. So, should you configure the Compute pods to run on the CAS nodes, or is it better to configure CAS to run on the compute nodes?
I have previously written about Creating custom Viya topologies – Part 2 (using custom node pools for the compute pods), in that blog I discussed the need to update the following configuration:
sas-compute-job-config
sas-batch-pod-template
sas-launcher-job-config
sas-connect-pod-template
Two additional considerations that I would like to highlight when moving the compute pods are, firstly the prepull function for the SAS programming environment container image, and a second consideration is whether SAS Workload Management (WLM) will be enabled. WLM requires at least one node to have the “compute” workload class label.
With the above in mind, and based on my testing, the best and/or simplest approach is to configure the CAS pods to run on the “compute” nodes. That is, nodes that have the ‘workload.sas.com/class=compute’ label and taint applied.
Realising the topology – IAC configuration
As I said in the introduction, I was working in the Azure Cloud and used the SAS Viya 4 Infrastructure as Code (IaC) for Microsoft Azure GitHub project to build the AKS cluster. The image below shows the node pool definitions that I used for my testing.
Here you can see the compute node pool definition is using the Standard_E8ds_v4 instance type, this provides 8 vCPUs with 64GiB of memory and 300GB of SSD Temp storage. This will be used for the Compute and CAS pods.
The nodes have the standard labels and taint applied for the compute nodes.
The second node pool is called generic and does not have any labels or taints applied. It is using the Standard_D4s_v4 instance type. It provides 4 vCPUs with 16GiB of memory, it was my “commodity” VM instance. Note, this node pool has “max_nodes” set to 20, this isn’t needed for SAS Viya to run, in fact using this instance type the deployment spun up 6 nodes.
Tip! When using the IAC and defining nodes without any label or taint, you still must specify the “node_labels” and “node_taints” parameters, with null values, as shown above.
Realising the topology – SAS Viya configuration
The advantage of using the standard compute node configuration is that you only need to focus on the CAS configuration. Let’s look at what is required.
The core of the configuration is that CAS pods need to target the compute nodes and must have a toleration for the compute (workload.sas.com/class=compute) taint.
For this I used the require-cas-label.yaml as a template to configure required scheduling to use the compute label. The ‘require-cas-label.yaml’ can be found in the ../sas-bases/overlays/cas-server folder.
I also used this to set the tolerations for the CASDeployment. The following is the configuration that I used.
Line 10 is highlighted and shows the definition for the required scheduling. The example transformer in sas-bases only has the first -op: add statement, which provides the configuration to target CAS nodes, I updated this to target the Compute nodes.
In addition to this update, on lines 16 – 36 you can see the update that I added to replace the tolerations. This configuration also illustrates a change that was introduced at Stable 2023.05 (May 2023), the addition of two new workload classes for CAS.
For this, lines 23 – 29 set the tolerations for the controllerTemplateAdditions and lines 30 – 36 set the tolerations for the workerTemplateAdditions. As you can see the tolerations should be set in three definitions now, not just on the controllerTemplate definition.
Here is the template should you need to copy and paste it.
# PatchTransformer to make the compute label required
# in addition to the azure system label
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: require-compute-label
patch: |-
- op: add
path: /spec/controllerTemplate/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: workload.sas.com/class
operator: In
values:
- compute
- op: replace
path: /spec/controllerTemplate/spec/tolerations
value:
- effect: NoSchedule
key: workload.sas.com/class
operator: Equal
value: compute
- op: replace
path: /spec/controllerTemplateAdditions/spec/tolerations
value:
- effect: NoSchedule
key: workload.sas.com/class
operator: Equal
value: compute
- op: replace
path: /spec/workerTemplateAdditions/spec/tolerations
value:
- effect: NoSchedule
key: workload.sas.com/class
operator: Equal
value: compute
target:
group: viya.sas.com
kind: CASDeployment
name: .*
version: v1alpha1
In addition to the PatchTransformer above, you also need to set the tolerations for the sas-cas-pod-template. This is done using the following configuration (set-cas-pod-template-tolerations.yaml).
# Patch to update the sas-cas-pod-template pod configuration
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: set-cas-pod-template-tolerations
patch: |-
- op: replace
path: /template/spec/tolerations
value:
- effect: NoSchedule
key: workload.sas.com/class
operator: Equal
value: compute
target:
kind: PodTemplate
version: v1
name: sas-cas-pod-template
The two PatchTransformers shown above form the core of the configuration to use the Compute nodes for the CAS pods.
An additional consideration is whether to use the CAS auto-resources configuration. I don’t recommend doing this for a couple of reasons. Firstly, and most importantly from my testing using the CAS auto-resourcing affect the Compute prepull function from operating.
Secondly, the auto-resourcing is intended to dedicate nodes to the CAS pods, and this configuration is looking to share the nodes (between CAS and SAS programming workloads), so it doesn’t make sense to implement the auto-resourcing. See the Deployment Guide: Adjust RAM and CPU Resources for CAS Servers.
However, there is a final configuration that I would recommend. You should set the resource requests and limits for the CAS pods and implement Guaranteed Quality of Service (QoS).
Implementing Guaranteed QoS provides additional protection for the CAS pods and ensures that they will not be killed by the out-of-memory (OOM) processing. The Compute pods will be evicted from the nodes should an out-of-memory situation occur. It should be noted that the sas-compute pods are transient, this is a normal configuration, not something that is an unintended consequence of using the two-node pool topology.
If a Compute pod gets evicted it just affects one user, while if a CAS pod is evicted it will have an impact on all CAS users (depending on the CAS Server configuration and how the data has been loaded).
To set the CAS pod requests and limits you can use the cas-manage-cpu-and-memory.yaml example in the ../sas-bases/examples/cas/configure folder.
To implement the Guaranteed QoS you set the requests and limits to the same value. For my environment I was using the Standard_E8ds_v4 instance type, this provides 8 vCPUs with 64GiB of memory. For my testing I set the memory requests and limits to 48GiB and the CPU requests and limits to 6. This is shown in the example below.
# This block of code is for adding resource requests and resource limits for
# memory and CPU.
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: cas-manage-cpu-and-memory
patch: |-
- op: add
path: /spec/controllerTemplate/spec/containers/0/resources/limits
value:
memory: 48Gi
- op: replace
path: /spec/controllerTemplate/spec/containers/0/resources/requests/memory
value:
48Gi
- op: add
path: /spec/controllerTemplate/spec/containers/0/resources/limits/cpu
value:
6
- op: replace
path: /spec/controllerTemplate/spec/containers/0/resources/requests/cpu
value:
6
target:
group: viya.sas.com
kind: CASDeployment
# Uncomment this to apply to all CAS servers:
name: .*
# Uncomment this to apply to one particular named CAS server:
#name: {{ NAME-OF-SERVER }}
# Uncomment this to apply to the default CAS server:
#labelSelector: "sas.com/cas-server-default"
version: v1alpha1
Using this configuration will leave 2vCPU and 16GiB of memory for other pods. By default, each compute session will request 50millicores and 300MB of memory.
Finally, the kustomization.yaml needs the following updates to implement the configuration. For my environment I used a ‘cas’ folder under ‘/site-config’ to hold the configuration. The configuration needs to be added to the transformers section. For example.
transformers:
:
- site-config/cas/require-compute-label.yaml
- site-config/cas/set-cas-pod-template-tolerations.yaml
- site-config/cas/cas-manage-cpu-and-memory.yaml
Looking at the results
I tested using a SMP CAS Server and MPP CAS Server. One of the nice things about the MPP CAS Server deployment was that there were now multiple compute nodes available for the Compute pods. For example.
Here you can see my MPP CAS deployment, a Controller with 4 Workers, are all running in the compute node pool, the compute nodes. Each on different compute nodes due to the CPU and memory resource reservations and pod anti-affinity settings.
To further test the configuration, I started two SAS Studio sessions, you can see that one sas-compute pod started on compute node vmss000001 and the other session started on node vmss000003. This is highlighted in the yellow box.
In this second example, I deployed a SMP CAS Server. The first SAS Studio session is using the same node as the CAS Server (vmss00000g). I then manually scaled the Compute node pool to have two nodes. Once the second node was ready, I started a second SAS Studio session, you can see that it is using the vmss00000h node.
Finally, I wanted to confirm the CAS pod configuration. For this I used the kubectl describe pod command.
Here you can see that the sas-cas-server pods have the requests and limits set as configured.
Conclusion
Hopefully this demonstrates that it is relatively simple process to configure the SAS Viya deployment to share a node pool for the Compute and CAS pods.
I see this type of configuration mainly being used for Visual Analytics deployments supporting a small number of programmers. For large environments supporting many programmers and/or heavy CAS processing, or environment looking to further optimise the deployment then dedicated Compute and CAS node pools would still be used.
It should be noted that configuring the node pools using the IaC is a relatively trivial process, so a valid question is whether the added configuration complexity is worth the effort? I will let you decide that. But if your customer wants to limit the number of node pools, it is possible.
Finally, to recap, for a scenario where a shared node pool is desired for CAS and Compute:
Disable the cas auto-resourcing when sharing the nodes for both Compute and CAS workloads.
Manually configure the CAS CPU and memory requests and limits. I recommend using Guaranteed QoS so that the CAS pods are not killed by any OOM processing.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- architecture
- deployment
- GEL
- SAS Viya
Labels:
12-18-2022
06:00 PM
In this post we will look at using Azure Container Instances to run SAS Container Runtime (SCR) model images. The SAS Container Runtime is a lightweight Open Container Initiative (OCI) compliant container that provides a runtime environment for SAS models and decision flows.
Once the model or decision flow has been published to a container registry it is truly portable. It can run on a Docker or containerd runtime environment, Kubernetes cluster or one of the Cloud Providers serverless platforms. The Azure Container Instances (ACI) is Microsoft’s serverless platform.
Let’s look at deploying SAS Container Runtime models to ACI…
For simplicity when I refer to “SCR models” I mean analytical models and decision flows.
The illustration below provides a summary of the deployment options, depicting using SAS Model Manager to publish SAS models.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Before I get into the details of running SCR models on ACI, I want to take a moment to let you know about a key update that was provided in November (Stable 2022.11). As of Stable 2022.11 TLS support is now provided for the SCR model images. Prior to this only unencrypted access was supported. See the SAS Container Runtime documentation: Configuring TLS Security.
Azure Container Instances – what do you need to know?
The Microsoft documentation positions ACI as follows:
“Run Docker containers on-demand in a managed, serverless Azure environment. Azure Container Instances is a solution for any scenario that can operate in isolated containers, without orchestration. Run event-driven applications, quickly deploy from your container development pipelines, and run data processing and build jobs.”
See the Azure Container Instances documentation - serverless containers, on demand
So, is it a good fit for running the SCR models? I guess the short answer is “yes” or “it depends”!
I do think it is a good fit, the “it depends” comes down to the requirements and is a serverless platform (ACI) the best option.
Using Azure Container Instances
Deploying container images to ACI is a straightforward process, and like most things in the Azure cloud there are multiple options when it comes to deploying and configuring objects. You can use the Azure Portal GUI or one of the command line interfaces. In this article I will provide examples using the az command-line interface. The ‘az container create’ command is used to create an ACI container instance.
The ACI instance is associated with a resource group, and at a minimum the following key parameters must be specified:
The name for the ACI instance.
The name of the container image and any credentials to access (download) the image.
The port to be used by the ACI container.
Whether the ACI container will have a public IP address or whether it will run on a private network. If the container has a public IP address, you must also specify a DNS label. The DNS label is not used with private networking.
It is also possible to override the default for CPU and memory allocated to the ACI instance. But if increasing the CPU and memory allocations you must be mindful of the Azure Region you are using, as the limits do vary by region. At the time of writing this, the default in Azure EASTUS was 1 core with 1.5 GB memory. For more details see the following Microsoft documentation: Resource availability for Azure Container Instances in Azure regions.
As far as running SCR model images is concerned, the key consideration is that the Tomcat instance running in the SCR container is configured to use port 8080 by default, and port 8443 when TLS is configured. It is listening on port 8080 or 8443. However, it is not possible to remap port 8080 or 8443 as part of the ACI deployment. The ‘--ports’ parameter must be set to ‘8080’ or ‘8443’.
The following example is for running a model with a Public IP address, with public access. Therefore, a DNS label must be specified.
az container create
--resource-group my-resource_group-rg
--subscription ${SUBSCRIPTION}
--name scr-qstree1
--image myacr.azurecr.io/qs_tree1:latest
--registry-login-server ${ACR_SERVER}
--registry-username ${APP_CLIENT_ID}
--registry-password ${APP_CLIENT_SECRET}
--dns-name-label qstree1-xxxx
--ports 8080
In this example, the SCR model image was stored in the Azure Container Registry. An App Registration (Service Principle) is used to authenticate to the registry. The App ID and secret were store in variables.
Once the ACI instance has been created you can view it using the command-line or in the Azure portal. Using the command-line, the ‘az container show’ command gives the following output.
$ az container show --name scr-qstree1 --resource-group ${RG} -o table
Name ResourceGroup Status Image IP:ports Network CPU/Memory OsType Location
----------- -------------------- -------- -------------------------------- ------------------ --------- --------------- -------- ----------
scr-qstree1 my-resource_group-rg Running myacr.azurecr.io/qs_tree1:latest 52.152.247.22:8080 Public 1.0 core/1.5 gb Linux eastus
The following images are from my test environment. The first shows the resource group and the ACI instance, called ‘scr-qstree1’.
Looking at the ACI instance, you can see the DNS name that was created. This has a format of: dns-label.region.azurecontainer.io
This probably isn’t the best deployment as the model is open to the world and the session traffic is unencrypted, it is not using TLS (HTTPS) encryption.
Therefore, a better approach is to deploy the SCR model image, the ACI instance, using secure private networking (and to configure TLS security). When using private networking the model would only be accessible from within the Azure Cloud, to surface the model a frontend load-balancer or an Azure Application Gateway could be used.
Using a frontend proxy (load-balancer or Azure Application Gateway) will allow the SCR port to be remapped, typically to port ‘80’ or ‘443’.
Using Private Networking
When using private networking you don’t define a DNS label, but you need to specify that the IP address is ‘private’ and the vnet and subnet to be used. For example, the following command would create an ACI instance using private networking:
az container create
--resource-group my-resource_group-rg
--subscription ${SUBSCRIPTION}
--name scr-qstree1
--image myacr.azurecr.io/qs_tree1:latest
--registry-login-server ${ACR_SERVER}
--registry-username ${APP_CLIENT_ID}
--registry-password ${APP_CLIENT_SECRET}
--ports 8080
--ip-address Private
--vnet my-vnet
--subnet my-aci-subnet
The VNET must be created before you can deploy the ACI instance. For example, this can be done using the ‘az network’ command:
az network vnet subnet create --name my-aci-subnet
--address-prefixes 192.168.3.0/24
--resource-group my-resource_group-rg
--vnet-name my-vnet
--network-security-group my-nsg
The --address-prefixes specifies the CIDR range, in my environment I used 192.168.3.0/24. You can confirm the subnet creation using the following command:
az network vnet subnet list -g ${RG} --vnet-name ${vnet} -o table
In my environment this gave the following result. Note, my resource group had a number of subnets defined, the last one in the list was created for the ACI instance.
Conclusion
I hope you can see that it is very easy to deploy a SCR model image to ACI, but you do need to think about how the model(s) will be secured.
Finally, a limitation of using ACI is that there isn’t the concept of a replica sets. If a model needs to be highly available (maybe distributed across Availability Zones - not all regions support Availability Zones) or scaled to multiple instances for performance reasons, you must manage that yourself. This could be done by deploying multiple ACI instances and then configurating the frontend load-balancer to distribute the workload across the available ACI containers.
If the model(s) being published are mission critical, with service-level requirements mandating high availability and/or workload scalability, then it could be better to use a Kubernetes deployment.
Useful resources
GitHub project: SAS Container Runtime (SCR) - the Low Footprint, High Performance Container for SAS Models
Thanks for reading. Michael Goddard
... View more
- Find more articles tagged with:
- Azure
- GEL
- SAS Container Runtime
Labels:
12-18-2022
05:12 PM
1 Like
In this post we will look at using Python with SAS Analytics Pro. More precisely, calling SAS (Analytics Pro) from a Python programming environment.
SAS provides several mechanisms for integrating the Python language with SAS data and analytics capabilities. One such tool is SASPy, which is a module that creates a bridge between Python and SAS (the SAS Foundation). In this post we will look at the configuration required to integrate SASPy with Analytics Pro.
What is SASPy?
SASPy provides Python APIs to the SAS system. Allowing the Python programmer to start a SAS session and run analytics from Python through a combination of object-oriented methods or explicit SAS code submission. Data can be moved between SAS data sets and Pandas dataframes; SASPy also allows the exchange of values between python variables and SAS macro variables.
Let’s have a look at how this works with Analytics Pro.
SASPy connectivity
SASPy supports several connection methods which are described in the SASPy documentation, see here.
When connecting to Analytics Pro, regardless of where it is running (Windows, Linux, Intel macOS, etc), the SSH (STDIO over SSH) connection method needs to be used. Reading the SASPy documentation you will see that this is for connecting to SAS environments running on a Linux platform. As Analytics Pro is running in a container whose base image is built on Linux [Red Hat Universal Base Image (UBI) 8], SSH connectivity must be used.
Note, in late 2020, Apple began the transition from Intel processors to Apple silicon in Mac computers. Analytics Pro is currently not supported on devices using this new CPU architecture.
Configuring SAS Analytics Pro
The Analytics Pro documentation describes the required configuration, see Enable Use of SASPy.
In my testing I was running Analytics Pro on a Linux server. The following graphic illustrates the environment that I used for my testing.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
As previously stated, the SASPy connection uses SSH, so SSH or passwordless SSH is required. The use of passwordless SSH is often used (preferred) as it eliminates the need to prompt the user for a password on the connection to Analytics Pro.
To enable the Analytics Pro container, you must provide the following:
Enable the SSH port, port 22 by default. Port 22 in the container needs to be mapped to a port on the Docker host. This is done on the ‘docker run’ command using the ‘--publish’ parameter. For example, ‘--publish 8022:22’. This maps port 22 to port 8022 on the Docker host.
SSHD configuration: a ‘sshd.conf’ configuration file is required in the sasinside/sasosconfig directory. The file doesn’t have to have any content.
In addition to the SSH configuration, two Linux capabilities are required. You need to enable the ‘AUDIT_WRITE’ and ‘SYS_ADMIN’ capabilities. You do this with the ‘--cap-add’ parameters on the ‘docker run’ command. The capabilities are required by the container operating system (UBI 8 ) when enabling SSH (it is not an Analytics Pro requirement).
Once you have started Analytics Pro with this configuration, then it is ready for SASPy connections from the Python programming clients.
Python environment configuration
The configuration of the Python environment is fairly straight forward. The SAS documentation states that along with installing the SASPy package, you also need to install the ‘wheel’ and ‘pandas’ packages.
You also need to generate an SSH key pair (public and private keys) when using passwordless SSH.
It is important to note that Windows doesn’t provide the OpenSSH client. However, there are a couple of options here:
Git for Windows, installs it is own SSH client
There is also a GitHub project for OpenSSH, see PowerShell/Win32-OpenSSH.
Once you have generated the SSH key it needs to be copied to the Analytics Pro container.
Using a Linux programming client
On a Linux client you can use the ‘ssh-copy-id’ command to copy the SSH key to the Analytics Pro container. For example:
ssh-copy-id -i identities_file -l login_username docker_server -p port
where:
Identities_file: is the SSH key (for example, ‘my_rsa_key’).
login_username: is the username for login to Analytics Pro.
port: is the port of the docker host that is being mapped to port 22 on the Analytics Pro container.
Using a Windows programming client
The OpenSSH client for Windows doesn’t provide the ‘ssh-copy-id’ command. So, manual steps are needed to copy the public key to the Analytics Pro container.
The contents of the public key must be copied to the ‘authorized_keys’ file in the user’s .ssh folder. This is in the user’s /home/.ssh folder in the Analytics Pro container. Depending on the setup of the environment it may be necessary to create the .ssh folder prior to creating the authorized_keys file.
In my environment I also used the ‘ASKPASS’ utility to help with the SSH commands. It is used to pass the password to the SSH command. For example, I ran the following commands from PowerShell ISE to copy the public key to Analytics Pro.
# Create the users .ssh directory
$env:ASKPASS_PASSWORD = 'xxxxxxx'
$env:SSH_ASKPASS_REQUIRE = "force"
$env:SSH_ASKPASS = "C:\Program Files\OpenSSH\askpass_util.exe"
ssh -o StrictHostKeyChecking=accept-new -p 8022 docker_server -l username "mkdir .ssh"
# Use askpass to copy SSH Public Key to remote host
$env:ASKPASS_PASSWORD = ' xxxxxxx'
$env:SSH_ASKPASS_REQUIRE = "force"
$env:SSH_ASKPASS = "C:\Program Files\OpenSSH\askpass_util.exe"
type $env:USERPROFILE\.ssh\my_rsa_key.pub | ssh -p 8022 docker_server -l username "cat > .ssh/authorized_keys"
SASPy configuration
The final set-up step, once you have the SSH key copied to Analytics Pro, is to create the saspy configuration file, called ‘sascfg_personal.py’ by default.
Below is an example of the SSH profile when using my Windows client as the python programming environment. Note, the ‘identity’ parameter needs to use the ‘\\’ (UNC path) to be in a format that Python can read.
SAS_config_names = ['ssh']
SAS_config_options = {'lock_down': False,
'verbose' : True,
'prompt' : True
}
#SAS_output_options = {'output' : 'html5'} # not required unless changing any of the default
ssh = {'saspath' : '/opt/sas/viya/home/SASFoundation/sas',
'ssh' : 'C:\Program Files\OpenSSH\ssh',
'identity' : 'C:\\Users\\student\\.ssh\\my_rsa_key',
'host' : 'docker_server',
'luser' : 'username',
'port' : '8022',
'options' : [""-fullstimer""]
}
Looking at the ‘ssh’ profile:
The ‘saspath’ parameter specifies the path to the SAS foundation in the Analytics Pro container.
The ‘ssh’ parameter is the path to the SSH command on the programming client. In the profile on my Linux client this was set to ‘/user/bin/ssh’.
The ‘identity’, ‘host’, ‘luser’ and ‘port’ parameters provide the information for the SSH connection.
The ‘options’ parameter is used to specify options on the SAS session.
Start program with SAS in Python
With the set-up completed you are now ready to start programming in python and using SAS data and PROCs. For example, here is a simple program that I used to query the SASHELP.CLASS table (using my Windows client).
#!/usr/bin/env python
# coding: utf-8
import saspy
import pandas as pd
# Start the session with Analytics Pro
sas = saspy.SASsession(cfgfile='c:\\Users\\student\\saspy\\sascfg_personal.py', cfgname='ssh', results='text')
# Query SAS data
mydata = sas.sasdata("CLASS","SASHELP")
mydata.head()
mydata.describe()
# Close the session
sas.endsas()
This resulted in the following output.
Conclusion
As can be seen, the set-up of Analytics Pro and the Python programming environment is not complex. The only real complexity is when working on a Windows client, there isn’t a ‘ssh-copy-id’ command, so you have to perform the manual steps to copy the public key to the Analytics Pro container.
A final note on using a Windows client, the SASPy configuration and the python script files need to be UTF-8 encoded.
I hope this is helpful and thanks for reading. @MichaelGoddard.
... View more
Labels:
10-20-2022
06:53 PM
3 Likes
One of the updates with Stable 2022.1.2 was the ability to use the SAS Viya CLI to create model publishing destinations. While a publishing destination can be created using SAS Environment Manager, the credentials domain that is required for some destinations could not.
Prior to Stable 2022.1.2 you had to use the Viya REST API to create a base64 Credentials Domain. The ‘base64’ Credentials Domain is required when publishing a model to a Docker Registry (either a Private Docker destination or one of the Cloud Provider destinations).
I recently tested creating publishing destinations with SAS Viya Stable 2022.1.4 and version 1.20.0 of the SAS Viya CLI.
In this post we will look at this new functionality.
As I stated in the introduction, previously you had to use the Viya REST API to create the credentials domain. Once that has been created you could use either the Viya REST API or the Environment Manager to create the publishing destinations. This is required to publish models or decision flows to a registry, to create the SAS Container Runtime (SCR) docker container image.
In GitHub there is sample code to create the credentials domain and publishing destination. But you are working directly with the Viya REST APIs so it can be a little complicated and definitely harder than using a standard command-line interface (CLI).
See GitHub project: Configuring Publishing Destinations
The first thing to state is you should always be using the latest version of the Viya CLI to get the update for creating the publishing destinations. You can download the SAS Viya CLI directly from the SAS Support website. The download file is available here: Downloads: SAS Viya CLI.
For this support you need the CLI version 1.19.5 or higher. At the time of posting this, version 1.20.0 was available. For general information on using the CLI, see the SAS Viya Administration manual, SAS Viya: Using the Command-Line Interface.
What do you need to know?
Before I get into the details of using the CLI, here are some things to note:
You can’t separately create a credentials domain with the Viya CLI, it is created as part of creating the publishing destination.
This includes setting the user or group information for the domain.
The secrets stored in a base64 credentials domain must be base64 encoded.
While a credentials domain is created as part of creating a publishing destination, the domain can be shared with multiple publishing destination.
The description fields must be quoted.
The Viya CLI is self-documenting, for example to get help on creating a publishing destination:
./sas-viya models destination --help
This gives the following output.
Figure 1. Viya CLI help
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
To get the help for creating an Azure publishing destination, you would use the following:
./sas-viya models destination createAzure --help
Creating Publishing Destinations using the CLI
Here is an example of creating an Azure publishing destination (in my Viya environment). When I use the CLI, I use the Viya namespace name as the profile name.
./sas-viya --profile ${NS} models destination createAzure \
--name "testACR" \
--description "Test ACR" \
--baseRepoURL ${ACR_SERVER} \
--subscriptionId ${SUBSCRIPTION} \
--tenantId ${TENANT} \
--region ${REGION} \
--kubernetesCluster ${AKS_NAME} \
--resourceGroupName ${RG} \
--credDomainID "ACRCredDomain" \
--credDescription "Azure ACR credentials" \
--clientId ${APP_CLIENT_ID} \
--clientSecret ${APP_CLIENT_SECRET} \
--identityType user \
--identityId sasadm
If we look at this in more detail, the image below highlights the parameters that relate to the credentials domain definition, see lines 10 - 15.
For Azure the ‘clientId’ and ‘clientSecret’ is for the Azure App Registration. This is used to authenticate to the Azure Container Registry (ACR). They are stored as part of the base64 credentials domain, as such, the values used must be base64 encoded.
For this I used the following commands to set the variables being used:
export APP_CLIENT_ID=$(echo "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" | base64)
export APP_CLIENT_SECRET=$(echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | base64)
The command output confirms that the publishing destination has been successfully created. For example, running the command above produced the following output:
ID Name Destination Type Description
4372408b-4a01-42a8-99dd-f8848f3285ed testACR azure Test ACR
Once the credentials domain has been created it can be use by other publishing destination definitions. When referring to an existing credentials domain (this could be created via the REST API or using the Viya CLI) you just need to specify the “--credDomainID”, lines 11 to 15 are not required.
For example, creating an Azure publishing destination call 'myACR' using the 'ACRCredDomain' credentials domain.
./sas-viya --profile ${NS} models destination createAzure \
--name "myACR" \
--description "Azure publishing destination for Mike" \
--baseRepoURL ${ACR_SERVER} \
--subscriptionId ${SUBSCRIPTION} \
--tenantId ${TENANT} \
--region ${REGION} \
--kubernetesCluster ${AKS_NAME} \
--resourceGroupName ${RG} \
--credDomainID "ACRCredDomain"
Once the publishing destination has been created you can use the ‘list’ command to confirm the available destinations. For example:
./sas-viya --profile ${NS} models destination list
You can also get the details of the publishing destination using the ‘models destination show’ command. For example, in this case:
./sas-viya --profile ${NS} models destination show -n testACR
When using this command (models destination show) there is a known issue that the ‘models’ plugin assumes a destination type of CAS. This is planned to be fixed, but currently no fix date is available. Therefore, the best approach is still to use Environment Manager to view the details for a publishing destination. This is shown in the following image.
Conclusion
I like this update, as you no longer have to work directly with the Viya REST API. Using the SAS Viya CLI is a much better approach and hides the complexity of working with the REST API.
Finally, as can be seen from Figure 1, it is also possible to update and delete the publishing destinations using the CLI.
I hope this is useful and thanks for reading.
... View more
- Find more articles tagged with:
- GEL
- SAS Model Manager
- SAS Viya CLI
10-19-2022
06:38 PM
2 Likes
I have recently had a number of conversations around ‘autoscalers’, Cluster Autoscalers and Horizontal Pod Autoscalers (HPA). There seems to be some misunderstanding on how these are used. So, I thought it would be a good time to think about these and what is supported with SAS Viya.
In this post we will discuss the difference between Cluster Autoscalers and Horizontal Pod Autoscalers. I will also look at what is required to define a HPA and discuss an example of using an HPA for SAS Micro Analytics Service.
But first some definitions, a Cluster Autoscaler will automatically adjust (grow and shrink) the size of the Kubernetes cluster (the number of underlying nodes) depending on following conditions:
There are pods that fail to run due to insufficient resources (this does not necessarily mean that all nodes are maxed out, as the pod scheduling is controlled by many factors).
There are nodes in the cluster that are underutilized for a defined period, it may be possible for the pods to be placed on another node that meets the scheduling criteria.
Where the Horizontal Pod Autoscalers, as the name suggests, apply to the Kubernetes (K8s) pods. In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload (pods) to match demand. They define the conditions for scaling (up and down) the number of pods replicas.
So, Cluster Autoscaler & Horizontal Pod Autoscaler are two independent features that do have a relationship when we think about the “elasticity” of the Kubernetes cluster, and the infrastructure costs (particularly if running on one of the Cloud Providers platforms). That is, the number of running pods and the HPA definitions can trigger the Cluster Autoscaler.
But what does this mean for SAS Viya?
If we look at the default deployment of SAS Viya, there is some redundancy, High Availability (HA) if you like, for the Stateful services (Consul, RabbitMQ, Postgres, Cache Locator and Server) with multiple pod replicas being configurated for these services. By default, the configurations for CAS (SMP is the default) and OpenSearch are not deployed with redundancy.
But all the Stateless services (including the web applications) have a single pod instance defined.
There is the Kubernetes transformer (enable-ha-transformer.yaml) that enables HA for the Stateless microservices. This provides two replicas for the Stateless microservice pods.
However, at this point in time, the Viya deployment doesn’t support deploying the microservices using a HPA definition with different values for the ‘Min’ and ‘Max’ number of pod replicas. This is because we do not set (define) the Kubernetes HPA behaviors. More research is required on all our microservices to understand their behaviors before doing this.
The SAS documentation states the following “By default, the Horizontal Pod Autoscaler (HPA) setting for all services is set to a replica of 1. If you want to scale up your services or pods to more than 1 replica, then the default HPA setting should be modified.”
To help you understand the SAS Viya deployment, below are a couple of handy commands. For example, to get the summary information for an HPA, in this case for MAS (sas-microanalytic-score), you can use the following command:
kubectl get hpa sas-microanalytic-score -n viya-namespace
You will see output similar to the following.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
In the image you can see the ‘TARGETS’ field, it shows the current CPU utilization and the target utilization. You can also see that the MIN and MAX number of pods is set to 1, and there is only one MAS pod running.
To get more detailed information on the HPA you can use the ‘kubectl describe’ command. For example.
kubectl describe hpa sas-microanalytic-score -n viya-namespace
Below is the output for my SAS Viya deployment.
Here we can see that the CPU resource utilization is expressed as a ‘percentage of the pod requests’. Once again, we can see that the Min replicas and Max replicas is set to one (1).
So, when might we use an HPA?
At this point I can hear you say, “but I thought you just told us not to use custom HPA definitions!”
Well yes, but there might be a limited number of scenarios where this is useful. For example, workloads running in the SAS Micro Analytic Service (MAS) and Event Stream Processing (ESP).
Let’s explore workloads running on SAS Micro Analytic Service. The key thing to remember here is that all the models and decision flows published to MAS (maslocal) run in the pod. Unlike, SAS Container Runtime where there is only one model or decision per container image.
This affects the resources (CPU and memory) that the MAS (sas-microanalytic-score) pods need to run. The number of models and decision flows published will also affect the start-up time for the sas-microanalytic-score pods and the workload that they are handling.
Hence, this could be a good candidate for defining an HPA. Especially when we think about handling bursts of transactions.
But that might be a too simplistic view, as when the models and decision flows are embedded within ‘real-time’ business processes, high availability could be the primary driver, closely followed by latency (performance). Therefore, to meet the HA requirements you might deploy multiple MAS replicas and need multiple nodes for this workload. Remember, by default both MAS and ESP are defined as Stateless services, so will run with all the other Stateless pods.
Which brings me back to the Cluster Autoscaler. Scaling the nodes is not instantaneous, it can take a few minutes to get a new node. This is another key concern when designing the Viya platform to support the real-time processing.
Another consideration is that MAS is not a standalone service, the sas-microanalytic-score pod(s) are dependent on other SAS Viya services. Therefore, the MAS (or real-time) HA requirements will, or can, drive the need for an HA configuration for the SAS Viya environment.
Writing an HorizontalPodAutoscaler
In a former life before joining SAS, when modelling IT systems, we had a rule of thumb that burst traffic could be up to 20 times the average transaction rate. Think of your favorite retailer or airline making a “must have” offer that drives unprecedented demand.
The use of an HPA for MAS or ESP could be a good way to handle such peaks.
But this does drive the need for a deeper understanding of the application pods, including its resource requirements and how long it takes to scale up and be ready.
You also need to decide on the metrics (CPU or Memory utilization) and the threshold that you will use to trigger the HPA. This is all defined in the HPA ‘target:’ spec definition. You should also set the behaviors for the pod. This is used to define the rules around scaling up and down. It should be based on how long it takes for the pod to be ready to accept workload.
To put it simply, the HorizontalPodAutoscaler controller operates on the ratio between desired metric value and current metric value.
It is also important to understand that when a targetAverageValue or targetAverageUtilization is specified, the currentMetricValue is computed by taking the average of the given metric across all Pods in the HorizontalPodAutoscaler's scale target (defined in the ‘minReplicas’ and ‘maxReplicas’ definitions, see the MAS example below).
When managing the scale of a group of replicas using the HorizontalPodAutoscaler, it is possible that the number of replicas keeps fluctuating frequently due to the dynamic nature of the metrics evaluated. This is sometimes referred to as thrashing, or flapping. This is where the HPA behaviors definition comes into play.
Let’s look at some examples
Note, the generic examples below have been taken from the Kubernetes documentation, see the references.
Any HPA target can be scaled based on the resource usage of the pods in the scaling target. When defining the pod specification, the resource requests for cpu and memory should be specified. This is used to determine the resource utilization and used by the HPA controller to scale the target up or down. For example, to use resource utilization-based scaling specify a metric source as follows:
type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
With this definition (metric) the HPA controller will keep the average utilization of the pods in the scaling target at 60%. This is done by scaling up or down the number of pods within the bounds of the ‘minReplicas’ and ‘maxReplicas’ definitions.
Configuring scaling behavior
The ability to define behaviors was introduced with v2 of the HorizontalPodAutoscaler API. The behavior field is used to configure separate scale-up and scale-down behaviors. You specify these behaviors by setting scaleUp and / or scaleDown under the behavior field. Additionally, you can specify a stabilization window that prevents ‘flapping’ the replica count for a scaling target.
The following example shows defining a behavior for scaling down:
behavior:
scaleDown:
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 10
periodSeconds: 60
The periodSeconds indicates the length of time in the past for which the policy must hold true. The first policy (type: Pods) allows at most 4 replicas to be scaled down in one minute. The second policy (type: Percent) allows at most 10% of the current replicas to be scaled down in one minute.
When you define multiple policies like this, by default the policy which allows the highest amount of change is selected. In this example, the second policy will only be used when the number of pod replicas is more than 40. This is because the second policy is specifying 10% of the running pods, this value will only be greater than 4 when there are more than 40 pod replicas.
Setting the Stabilization windows
As previously stated, the stabilization window is used to restrict the ‘flapping’ of the replica count, when the metrics used for scaling keeps fluctuating. Hence, the stabilization window is used to avoid unwanted changes.
For example, the following snippet shows specifying a scale down stabilization window. In this example, all desired states from the past 5 minutes will be considered.
behavior:
scaleDown:
stabilizationWindowSeconds: 300
Pulling this all together, here is a possible example for MAS…
Please note this isn’t a full worked example, which is another way of saying I haven’t tested it. 😊 Perhaps the HPA for MAS might look something like the following.
Let’s unpack this a little to see what I was trying to achieve:
I want a minimum of 2 MAS pods for HA reasons, but no more than 6 pods.
I should trigger the scale event on 60% CPU utilization.
There is no stabilization window for scaleUp events, but only scale 1 pod every 30 seconds. (You need to understand your environment to determine how long it takes for a MAS pod to be fully ready to receive workload).
I have specified two scaleDown policies and the minimum of the two should be used. The first is that no more than 2 pods can be removed in a 60 second period and the second is that 50% of the pods can be removed in a 60 second period.
In all reality I would probably just specify a single scaleDown policy with such a small number of replicas. But I wanted to show an example of using two policies.
Hopefully this example highlights the need to understand the MAS workload and understand how long the MAS pods take to start. Remember it will be dependent on the number of models that have been published.
Conclusion
In this post we have only just scratched the surface of understanding HPAs, it is a truly complex subject. But I hope I have highlighted the need for a deep understanding of Kubernetes and how your applications run (behave) to be able to properly specify an HPA.
While in the MAS example I have shown defining the HPA based on utilization, it is also possible to set the target based on a value. For example, on the number of milli-cores or cores have been used for CPU, or the amount of memory used.
Finally, I would recommend load testing to fine tune the HPA definition.
I hope this is useful and thanks for reading.
References
Kubernetes documentation: Horizontal Pod Autoscaling The high-level description and examples are based on the above Kubernetes documentation.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
- Kubernetes
- Real-time
08-04-2022
06:13 PM
2 Likes
While SAS Analytics Pro doesn’t ship with the SAS Cloud Analytic Services (CAS) server, it is possible for Analytics Pro to use CAS. In this blog we will explore the configuration required to use a CAS Server from Analytics Pro. To enable the connectivity, configuration is required for both SAS Viya and SAS Analytics Pro.
Before we get into the configuration of SAS Analytics Pro, I want to take a moment to remind you that as of April 2022 (Stable 2021.2.6) there are two versions of Analytics Pro: SAS Analytics Pro and SAS Analytics Pro Advanced Programming. SAS Analytics Pro Advanced Programming contains the same SAS Foundation components as SAS Analytics Pro plus the following additional components:
SAS/IML
SAS/OR
SAS/QC, and
SAS/ETS.
Configuring access to CAS
To connect to a CAS Server from Analytics Pro, Analytics Pro must trust the certificate being used by the CAS Server. Along with the CA certificate, the SAS Viya platform needs to be configured to allow access to the CAS Server.
We will start by looking at the SAS Viya configuration.
If you want to connect to CAS in SAS Viya 4 from clients such as SAS Viya 3.5, SAS 9.4, or with open programming clients such as Python, R, and Java, you need to enable the binary CAS communication. In this case the connection is from Analytics Pro.
With SAS Viya now running on Kubernetes, the external connectivity requires additional configuration to enable external connections to CAS (connections from outside of the Kubernetes cluster).
For SAS Viya 4, you enable the CAS connectivity by including the cas-enable-external-services.yaml transformer, which is added to the transformers section of the 'kustomization.yaml' file.
The patch transformer is required to expose the CAS client connectivity ports. This can be done using either a NodePort configuration or a LoadBalancer configuration. For instructions on how to implement this see the SAS Viya Administration manual, Configure External Access to CAS.
You need to copy the example from the sasbases ($deploy/sas-bases/examples/cas/configure/cas-enable-external-services.yaml) to your deployment site-config directory.
The default configuration defines the services as NodePorts, this is shown in the image below. If the SAS Viya platform is running in the cloud, on one of the Cloud Provider platforms (AWS EKS, Azure AKS or Google GKE), using a Loadbalancer configuration is the recommended approach.
The patch transformer (cas-enable-external-services.yaml) contains example configurations to enable the LoadBalancer configuration, see the highlighted code block below.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
You need to uncomment the service spec to define the service type of Loadbalancer.
As part of updating the patch transformer, you should specify the allowed source CIDR ranges to secure the connections to the CAS server, see lines 32 to 34. This is used to define the firewall rules, such as the Azure Network Security Group (NSG) rules.
Note, some cloud providers may require additional configuration. For example, adding metadata annotations.
After the patch transformer has been applied, the following command can be used to get the port mappings for the CAS Server. You will need the port information for the 'sas-cas-server-default-bin' port to connect to CAS.
kubectl -n viya-namespace get svc | grep sas-cas-server
The output for a NodePort configuration will look similar to the following (using egrep to format the output).
And looks like the following for a LoadBalancer configuration. The public IP addresses for the LoadBalancer service are shown in the yellow box.
As can be seen, the LoadBalancer service has two public (External) IP addresses.
You would, or could, use a NodePort configuration when running On-Premises. For example, using RedHat OpenShift or the ‘Open Source Kubernetes’ support. However, it is important to note that when using a NodePort, the port mapping is exposed on ALL nodes within the cluster. Also, the IP address and port mapping is not static, if you redeployed SAS Viya, you would get a new mapping.
Hence, the Loadbalancer configuration is a better approach as it exposes a single public IP address for each service. However, the public IP address will change each time you do a SAS Viya deployment, it is not static. Therefore, it is best to assign a DNS alias to the public IP address, so end-users have an unchanging reference – but you’ll still need to keep that DNS alias up-to-date if the load balancer changes at some point in the future.
A customer could use their own DNS, or if running in one of the Cloud Providers a DNS name can be set against the LoadBalancer resource. For example, in Azure you can use the following to assign a DNS name to the public IP address of the binary communication endpoint for CAS.
node_res_group=MC_xxxxxxxxx_xxxxxxxxxx_xxxxxx
SUBSCRIPTION=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
NS=viya-namespace
# Get LoadBalancer External IP
CASBIN_LBIP=$(kubectl get service -n ${NS} | grep sas-cas-server-default-bin | awk '{print $4}')
# Get the Public IP resource name
PublicIPName=$(az network public-ip list --subscription ${SUBSCRIPTION} --out table | grep ${CASBIN_LBIP} | awk '{print $1}')
# Get the ID for the Public IP
PublicIPID=$(az network public-ip show -g ${node_res_group} -n ${PublicIPName} --query "id" -o tsv)
# Create the DNS name
az network public-ip update \
-g ${node_res_group} \
--ids $PublicIPID --dns-name cas-bin-${NS}
For my environment, this gives a DNS name of ‘cas-bin-test.eastus.cloudapp.azure.com’ when using the viya namespace name (test) as part of the name prefix. You can check the DNS name assignment in the Azure Portal, see the screenshot below.
Note, the prefix must be unique so you could also use the resource group name as part of the DNS name. This would give ‘cas-bin-resource-group.eastus.cloudapp.azure.com’ as the name.
Now you have a logical name to use when connecting to the CAS Server.
Similarly, the following code would set the DNS name for the public IP address of the http communication endpoint for CAS.
node_res_group=MC_xxxxxxxxx_xxxxxxxxxx_xxxxxx
SUBSCRIPTION=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
NS=viya-namespace
# Get LoadBalancer External IP
CASHTTP_LBIP=$(kubectl get service -n ${NS} | grep sas-cas-server-default-http | awk '{print $4}')
PublicIPName=$(az network public-ip list --subscription ${SUBSCRIPTION} --out table | grep ${CASHTTP_LBIP} | awk '{print $1}')
PublicIPID=$(az network public-ip show -g ${node_res_group} -n ${PublicIPName} --query "id" -o tsv)
az network public-ip update \
-g ${node_res_group} \
--ids $PublicIPID --dns-name cas-http-${NS}
Before we look at the Analytics Pro configuration, let’s looking in more detail at the resources that are created. In Azure, the Public IP addresses are defined in a specific resource group, starting with “MC_”. For example, the two highlighted in the screenshot are for the CAS services. The third public IP address shown in the image was created to support the ingress access to the Viya web applications.
Looking at the Kubernetes Load balancer, you will see all the public IP addresses. Again, the highlighted ones are for the CAS services.
Finally, if you look at the NSG, you will see the rules that were created. You can see the rules for ports 5570 and 8777 using the source CIDR addresses that were defined in the patch transformer.
The screenshots confirm that the patch transformer has successfully configured the AKS cluster for my environment.
SAS Analytics Pro configuration
Now that SAS Viya has been configured to allow client accesses, or in this case access from Analytics Pro. The next task is to configure Analytics Pro.
As stated earlier, Analytics Pro needs to trust the certificate being used by the CAS Server. The first step is to obtain the SAS Viya Root CA (Certificate Authority), then you need to include it in the Analytics Pro trusted certificates, in the 'trustedcerts.pem' file. You need a running instance of Analytics Pro to do this.
Step 1. Obtaining the SAS Viya Root CA
The steps to get the CA certificate are detailed in the SAS Viya Administration manual, see: Obtain the Truststore Files or the SAS Viya Generated Root CA Certificate. The following command can be used to retrieve the SAS Viya CA certificate value from the secret:
kubectl -n viya-namespace get secret sas-viya-ca-certificate-secret -o=jsonpath="{.data.ca\.crt}"|base64 -d
Note, you need to run the command above from a client that has been configured to connect to the Kubernetes cluster (i.e., has the kubectl configuration for the cluster where SAS Viya is running).
You can pipe the output of the command to a file to save the certificate – ideally to a local directory corresponding to the mounted volume inside the sas-analytics-pro container referred to as “/sasinside”.
Step 2. Update Analytics Pro to trust the Viya certificate
Now, the Viya CA certificate needs to be included in the Analytics Pro trusted certificates, in the 'trustedcerts.pem' file. This file is found in the sas-analytics-pro container in the /opt/sas/viya/config/etc/SASSecurityCertificateFramework/cacerts/ directory.
To include the Viya certificate you need a running Analytics Pro environment. The easiest way to do this is to ‘exec’ into the Analytics Pro container. You can use the following command to append the certificate information. The example assumes that the Viya CA is stored in the Analytics Pro configuration ‘sasinside’ folder.
docker exec -u=root -it sas-analytics-pro bash \
-c "cat /sasinside/ca_cert.pem >> /opt/sas/viya/config/etc/SASSecurityCertificateFramework/cacerts/trustedcerts.pem"
If you want to confirm that the 'trustedcerts.pem' file has been updated, use the following command. docker exec -u=root -it sas-analytics-pro bash \ -c "cat /opt/sas/viya/config/etc/SASSecurityCertificateFramework/cacerts/trustedcerts.pem" Note, the docker commands assume that the Analytics Pro container is called ‘sas-analytics-pro’.
It is possible to write a script to get the Viya certificate and update Analytics Pro as part of launching Analytics Pro.
Now that Analytics Pro has been updated, it can handle encrypted communication with the CAS server in your Kubernetes cluster. The last part of the configuration is to tell Analytics Pro where to find CAS with the following connection string.
options cashost='dns_alias_for_cas' casport="port" authinfo='~/.authinfo'
For information on creating a ‘authinfo’ file see: SAS Help Center: Client Authentication Using an Authinfo File
For testing I used one of our GEL environments.
You should see the following:
You have successfully started a session called 'TEST'.
The session is using 3 workers, and
You will see that the client user is 'sastest2', this was defined in the authinfo file.
I hope this is useful and thanks for reading.
References
SAS Help Center: Welcome to SAS Analytics Pro SAS Help Center: Client Authentication Using an Authinfo File
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
07-28-2022
05:47 PM
4 Likes
With the release of Stable 2021.2.6 there were some changes that will affect your deployment topology, workload placement plan, and node selection. SAS has recently changed the default set of workload classes, so now the CONNECT workload class is optional and requires additional steps to be enabled. In this post we will discuss the new topology and when you still may need to implement the CONNECT workload class. An additional change that will affect your node selection is that the use of GPUs is now supported with some Compute processing, not just the CAS Server.
CONNECT Workload Class Changes
With Stable 2021.2.6 the default workload classes have changed, the “connect” workload class has been removed from the default configuration.
This is to reflect that when the SAS/CONNECT Spawner is supporting connections from a SAS 9.4M7, Viya 3.5 or another Viya 4 system (client), the Spawner is performing purely as a service, it is not running any of the remote workload.
This change affects the sas-connect-spawner Deployment definition. All references to the connect workload class have been removed from the sas-connect-spawner Deployment definition (this includes the labels, nodeAffinity and tolerations) and have been replaced with the “stateless” workload class. The result is that the SAS/CONNECT Spawner will now be scheduled on “stateless” nodes by default.
To illustrate the changes, I ran the ‘icdiff’ command to show the differences between Stable 2021.2.5 and 2021.2.6 for the sas-connect-spawner deployment.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
In (1) you will see the label change to categorize the Spawner as a stateless service, applying the stateless workload class label, (2) shows the node affinity for the stateless nodes, and (3) shows the update to the pod tolerations. As with the other stateless services, the Spawner pod has a toleration for both the stateful and stateless taints.
Stepping back from the yaml changes. Let’s take a moment to discuss SAS/CONNECT and the different session types that are supported, and how this can affect your deployment topology (the number of node pools).
The following description is from the SAS Viya Programming Documentation. “SAS/CONNECT software is a SAS client/server toolset that provides the ability to manage, access, and process data in a distributed and parallel SAS environment. As a client/server application, SAS/CONNECT links a SAS client session to a SAS server (SAS/CONNECT Server) session.”
The SAS/CONNECT Spawner is a SAS Viya service that launches processes on behalf of SAS/CONNECT clients. The client processes can be launched in their own pods (referred to as “dynamically launched pods”) or in the SAS/CONNECT Spawner pod (in this mode, the Spawner pod supports the sessions from multiple clients).
When the client process is launched in its own pod, the “dynamically launched pod”, the new pod is started using a Kubernetes PodTemplate (sas-connect-pod-template) and runs on the Compute nodes by default. The dynamically launched pod contains the SAS/CONNECT Server for that client session.
In the second case, when the client process is launched in the SAS/CONNECT Spawner pod, the SAS/CONNECT Server process is running in the Spawner pod, and the Spawner pod may be supporting multiple client sessions. We could call this “legacy” mode, it is how the legacy clients are supported.
Note, clients from SAS 9.4M6 and earlier releases, and SAS Viya 3.4 and earlier, do NOT support dynamically launched pods. So, by default their processes are launched in the SAS/CONNECT Spawner pod. They are the SAS/CONNECT “legacy clients”.
This begs the question “When do I need a node pool dedicated to the SAS/CONNECT workload”?
I have created a decision flow to help answer this question, see later in this post.
From a resource consumption perspective, the dynamically launched pods are similar to the SAS Compute Server workload, and as previously stated, the launched pods run on the Compute nodes by default.
However, when the SAS/CONNECT Spawner pod is running multiple client sessions it can consume significant resources. Therefore, much like the CAS pods, the SAS/CONNECT Spawner pod should be assigned a dedicated Kubernetes node and should be configured with a guaranteed Quality of Service (QoS).
Hence, if you do not have any legacy client sessions, the SAS/CONNECT Spawner can happily run as a “stateless service”. To support this, as of Stable 2021.2.6, the SAS/CONNECT Spawner is deployed in the stateless workload class by default. This means that implementing the connect workload class is ONLY required, or recommended, if you are supporting the legacy clients.
To implement, enable, the ‘connect’ workload class there are two new transformers:
enable-spawned-servers.yaml
use-connect-workload-class.yaml
Along with applying the patch transformers you also must create the ‘connect’ node pool and label and taint the nodes for the ‘connect’ workload class. This is what I have called the “old topology” in the decision flow.
GPU support for SAS Compute
The other change that I would like to briefly touch on is that the SAS Programming Environment container can now make use of GPUs, can use the SAS GPU reservation service. Prior to Stable 2021.2.6, the GPU reservation service was only used by the CAS Server.
The update extends support to SAS IML workloads (PROC IML) running on the Compute Server. For a complete list of the GPU support see the Offerings and Action Sets that Support GPU Capabilities section in the System Requirements for SAS Viya.
It is important to note that GPU support (for CAS or Compute) is not available when running on Red Hat OpenShift. Also see the following blog by Raphaël Poumarede, Add a CAS “GPU-enabled” Node pool to boost your SAS Viya Analytics Platform!
Topology decision flow
Even prior to this change, it wasn’t mandatory to implement a dedicated node pool for the connect workload, it was possible to use one of the other node pools for the CONNECT Spawner pod. However, depending on the tainting of the nodes this may have needed a custom configuration for the sas-connect-spawner Deployment.
For example, you might do this if the CONNECT workload was quite light, the Spawner pod is only supporting a small number of sessions. I’m sorry I can’t give you a formula to help you determine when a dedicated node pool is required. I would see this as part of regulate capacity planning. Monitor the performance and resource usage and scale out to using a dedicated Connect node when needed.
As discussed above, the Compute nodes are a good fit for the launched pods, they are just another type of compute session.
However, there are cases where you might still want to implement a ‘connect’ node pool to isolate the connect processing. For example, with the change in support for GPUs, the Compute nodes could be GPU enabled, but this is not required for the CONNECT sessions. Therefore, to optimize costs you might want to change the default configuration to use a different node type for the CONNECT workload.
Below is a decision flow to help with the assessment of whether the ‘connect workload class’ and a dedicated ‘connect’ node pool need to be implement.
The most likely paths through the decision flow are shown as ‘A’ (the blue path) and ‘B’ (the green path). I would hope that most customers will use the default configuration and are not supporting the legacy clients, they will be using the green path (B).
Conclusion
The good news is with Stable 2021.2.6, if there are no legacy clients, there is no need to have a dedicated node pool for SAS/CONNECT, by default there is no ‘connect’ workload class. The SAS/CONNECT Spawner follows the Viya architecture pattern and works as a stateless service.
However, a key thing to remember is that this change will NOT be available in LTS 2022.1 (May), the customers will have to wait until LTS 2022.2 (November). In the meantime, when using the LTS cadence it will continue to require a custom configuration to implement.
Similarly, the new GPU support will not be available in the LTS cadence until LTS 2022.2.
Finally, with the recent changes it makes it easier to “grow” or “shrink” the topology. For example, start with three node pools and grow to four, or five (to separate the Stateful and Stateless services) when needed.
I hope this is useful and thanks for reading.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
Labels:
03-29-2022
04:18 PM
2 Likes
In previous posts I have talked about creating a Workload Placement Plan or Strategy. One of the benefits of running in the cloud is that the Cloud Providers offer elastic infrastructure. In Kubernetes terms, this equates to node pools that can scale from zero nodes to a maximum number of nodes. But if you are using node pools that can auto-scale (scale to zero nodes) you might get some unexpected results.
I was recently testing a deployment in Azure using the SAS Viya Infrastructure as Code (IaC) GitHub project, using the minimal pattern with two node pools where both could scale to zero nodes. When I deployed SAS Viya all the pods ended up running in a single node pool! This wasn’t what I was after.
So, what went wrong with my CAS workload placement?
Let’s have a look at why this happened.
I built my Azure Kubernetes Service (AKS) cluster using the minimal IaC example, see here. Which provides a System node pool, plus a node pool called ‘generic’ and one called ‘cas’. As the names might suggest, the ‘cas’ node pool was to be dedicated to running the SAS Cloud Analytic Services (CAS) pods and the ‘generic’ node pool was for everything else.
Both the ‘generic’ and 'cas' node pools could auto-scale to zero nodes, which meant when I built the cluster it only had the system node pool active, with one node running. For the cas node pool, the nodes had the CAS workload label and taint (workload.sas.com/class=cas) applied. While the generic node pool didn’t have any taint, but had the following labels applied:
workload.sas.com/class=compute
launcher.sas.com/prepullImage=sas-programming-environment
These labels are used as part of the pre-pull process for the 'sas-programming-environment' pods. As this was a test environment my first deployment used a SMP CAS server (with a default SAS Viya configuration) without using the CAS auto-resources transformer. After seeing that I ended up with all the pods, including the CAS pods, running on the generic nodes, I did a second deployment using a MPP CAS server to confirm what I was seeing. This is shown in Figure 1.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Figure 1. IaC minimal sample without CAS auto-resources
The default SAS Viya configuration uses preferred node affinity, see the Kubernetes documentation Assigning Pods to Nodes. Hence, I could have labeled the figure as “Preferred node affinity without CAS auto-resources”.
As you can see, all the CAS pods are running on the generic nodes, and I ended up with three workers running on the same node (aks-generic-xxxxxx-vmss00000g). While having the three workers running on the same node was understandable (as I was not using the CAS auto-resources), but why didn’t the CAS nodes get used?
The answer to this question is within the default pod configuration, the default configuration uses preferred scheduling (preferredDuringSchedulingIgnoredDuringExecution) for the Node Affinity, along with the node pool configuration and the state at the time of deploying SAS Viya. Let’s explore what I mean by that.
With both the node pools autoscaling to zero, autoscaling to zero can occur when the "minimum node count for a node pool is set to zero". Therefore, part of the answer to why, is in the Viya start-up sequence, and the SAS Viya default configuration.
That is, some of the first objects to start are the stateful and stateless pods. Which meant that by the time the CASDeployment operator went to start the CAS controller and worker pods (or the SMP CAS Server) there were already generic nodes available.
This is where the preferred node affinity comes into play. It is only a preference that the CAS nodes are used, if there aren’t any cas nodes available then another choice is evaluated. Instead of spinning up a new CAS node the pods were started on the generic nodes as they didn’t have any taint applied.
Hence, it is a combination of these three factors (the untainted generic nodes, zero cas nodes and preferred affinity) that led to this situation. If you create node pools with a non-zero number of nodes, you may never see this behavior.
Finally, I should state that Kubernetes doesn’t have the concept of a node pool, just nodes, a node pool is a construct developed by the Cloud providers, in their implementation of Kubernetes. This is how they provide elasticity for the Kubernetes node infrastructure.
Simple! Maybe we should look at some more examples to explain what is happening.
Using CAS auto-resources
At this point you might think, I know how to fix this, I just need to use the CAS auto-resources transformer. The CAS auto-resources transformer is used to automatically adjust the resource limits and requests for the CAS pods (Controller and Workers) from a ‘Burstable’ Quality of Service (QoS) to using the ‘Guaranteed’ QoS, with a value of approximately 86 percent of the available resources (memory and CPU) of the first node found with the “CAS” label.
This might be a simplification of what the CAS Operator is doing but is the “out-of-the-box” node affinity behavior.
Enabling the auto-resources does two things for us, firstly it will ensure that there is only one CAS pod per node, and secondly adjusts the resources for the CAS pods without you (the SAS administrator) having to calculate a value and set it. The resources are set based on the size of the nodes.
If you are familiar with the SAS Viya 3.x deployment, using the CAS auto-resources allows you to have the same topology as using the CAS host group with SAS Viya 3.x. Using the CAS auto-resources (along with the CAS workload taint) allows you to have nodes dedicated to running the CAS Server.
So, what happened with this configuration?
Figure 2. Preferred node affinity with CAS auto-resources
In Figure 2, you can see that now the CAS Controller and Worker pods are all running on separate nodes, but still in the generic node pool. I should also state that there may have been other pods running on those nodes along with the CAS pods. I didn’t check, but Kubernetes could still schedule other pods to those nodes, depending on their resource requests. Remember, the generic nodes did not have any taint applied.
So, better, but not perfect.
Using Required nodeAffinity
The deployment assets provide an overlay to change the CAS node affinity from preferred scheduling to use required scheduling (requiredDuringSchedulingIgnoredDuringExecution) for the Node Affinity. The overlay is called require-cas-label.yaml. It is located under the sas-bases folder: sas-bases/overlays/cas-server/require-cas-label.yaml
Using this overlay means that the CAS pods will only run on nodes that have the ‘workload.sas.com/class=cas’ label. Therefore, you need to ensure that there are sufficient nodes available to run all the CAS pods. Otherwise, some of the CAS pods will not be able to run.
At this point my kustomization.yaml has the definition for using the CAS auto-resources and it also includes the require-cas-label.yaml overlay. Figure 3 shows the results of using the two transformers.
Figure 3. Required node affinity with CAS auto-resources
As you can now see, the CAS Controller and Worker pods are all now running on the cas nodes, with one CAS pod per node. This is what I wanted, the cas node pool is now being used. 😊
Just to round out the discussion, if you wondered what the deployment would look like if I used the required node affinity without enabling the CAS auto-resources, this is shown in Figure 4.
Figure 4. Required node affinity without CAS auto-resources
As can be seen, the CAS Controller and Worker pods are now all running on a single cas node (aks-cas-xxxxxx-vms000001).
Conclusion
Coming back to my question “What went wrong with my CAS workload placement?”
The short answer was nothing, Kubernetes did exactly what it was told to do!
The pod scheduling rules in Kubernetes are complex and many different conditions can affect where a pod is started, what node will be used. In this post, we have discussed node affinity and taints, but there is also node anti-affinity and pod anti-affinity that will affect where a pod runs.
Using the CAS auto-resources transformer enables you to set the CAS pods resources based on the size of the nodes being used and configures the pods to run using a Guaranteed QoS. I would expect that most, if not all, production deployments will use the CAS auto-resource configuration, unless it is for an environment with little use of the CAS Server.
Remember, running the CAS pods on the same node defeats the benefits of using a MPP CAS server, being, fault tolerance, scalability, and performance. Therefore, I would always recommend using the CAS auto-resources.
However, there is one possible scenario for not using the auto-resources. That is, using the auto-resources requires the CASDeployment Operator to have a ClusterRole with "get" access on the Kubernetes nodes. This role gives the Operator the ability to look at the nodes to see the resources.
If your organization's IT (Kubernetes) Standards do not allow the role assignment, it is not possible to grant the ClusterRole access, then you should do the following:
Manually calculate the resources needed and use the ‘cas-manage-cpu-and-memory.yaml’ transformer to set the resources, and
Enable required node affinity with the ‘require-cas-label.yaml’ transformer.
See the deployment documentation, Configure CAS Settings - Adjust RAM and CPU Resources for CAS Servers. But that’s a story for another article.
To summarize the key takeaways…
When using preferred scheduling the CAS pods may end up on other nodes, when the preferred node affinity is not available.
Using CAS auto-resources with preferred scheduling does NOT guarantee that the cas nodes will be used. However, they will be used if available.
Use the require-cas-label.yaml transformer to implement required (strict) node affinity, especially if there are un-tainted nodes in the cluster.
This will force the CAS pods to only use the CAS node pool nodes. The nodes with the CAS workload class label (workload.sas.com/class=cas). This will trigger the cas node pool to scale, if possible.
But you need to ensure that there are sufficient nodes available to run all the CAS pods. Otherwise, some pods may end up in a pending state.
Finally, you might have noticed that each screenshot shows a different set of node names. This was because between each test I deleted the SAS Viya namespace and waited for the AKS cluster to reduce to only having a system node running. In AKS, it appears that when a node is stopped it is marked as being used, so a new node is started using the next node name in the sequence. Hence, you can see that I ran the test in Figure 4 before the test in Figure 3. I hope this is useful and thanks for reading.
References
The SAS Viya Infrastructure as Code (IaC) project is available for AWS, Google GCP and Microsoft Azure.
SAS Viya 4 Infrastructure as Code (IaC) for Amazon Web Services (AWS)
SAS Viya 4 Infrastructure as Code (IaC) for Google Cloud Platform (GCP)
SAS Viya 4 Infrastructure as Code (IaC) for Microsoft Azure
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
03-17-2022
02:11 PM
2 Likes
In this article I would like to discuss simplified deployment patterns and share some videos that I have previously published. In two previous articles I discussed SAS Viya deployment topologies. See Creating custom SAS Viya topologies – realizing the workload placement plan and Creating custom SAS Viya topologies – Part 2 (using custom node pools for the compute pods)
In this post I want to discuss two alternatives to using the default approach which employs five node pools: cas, stateless, stateful, compute and connect.
So, will a simplified deployment topology lower or reduce the infrastructure costs?
I could drive you all crazy by just saying “IT DEPENDS”! But I think this warrants a deeper look.
There are many non-functional requirements that can have an impact on the infrastructure requirements. For example, performance, availability, and security. Of these the performance and availability requirements are architecturally significant, they can have a significant impact of the infrastructure costs. So, it is important that we understand the requirements.
Understand the requirements
There is an adage in computing, that the last 9th of availability that you implement will be the most expensive IT spend. Therefore, understanding your organizations Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) is key to defining the right approach, getting the right deployment design and to avoid unnecessary infrastructure costs.
When we think about performance, it is important to understand that not every organization or business processing needs blinding performance, needs a “super car” (a McLaren or Ferrari). The reality is organizations are after “acceptable performance within the constraints”. The biggest constraint is usually the budget, cost $$$
So, it is critical that you understand the availability and performance requirements to define the right deployment topology. I'm sorry if I'm preaching to the converted! 🙂
Comparing SAS Viya 3.x and SAS Viya 4 deployments
Before we get into the details of the SAS Viya topologies, let’s take a moment to set a baseline for the discussion. Below is a simple analogy between SAS Viya 3.x nodes (servers) and Kubernetes node pools.
A node pool can be compared to a single machine (physical or virtual) or multi-machine host group in a Bare OS SAS Viya 3.x deployment.
In a SAS Viya 3.x Bare OS deployment you could use one machine for the SAS Cloud Analytic Services (CAS) server and one machine for the rest of the SAS Viya services, a 2 machine (server) deployment. But larger deployments would have multiple servers. For example, you might have 5 machines for MPP CAS, 2 machines for the programming run-time and 3 machines to provide high availability for the infrastructure servers and microservices. Thus giving a total of 10 machines.
It is similar with the node pools; you can have just 2 node pools (CAS and general) or use the 5 node pool option to provide different node types for each type of SAS Viya workload.
However, the key difference between a node pool and the machines in a Viya 3 deployment is that a node pool is a scalable template for a VM instance. You define a Node template (instance type, label, taint, with or without GPU, etc…) and then you can make it scalable by defining a minimum and maximum number of nodes (which are VM instances) in the node pool.
So, even if you start with a 2 node pool topology, it can still be scaled in terms of the number of nodes (if needed). But remember, all nodes within a node pool have the same specification and attributes (instance type, storage mounts, and Kubernetes labels and taints).
Using simplified topologies
Coming back to the question, will a simplified deployment topology lower or reduce the infrastructure costs?
It can do, but you can’t just think about the node pools in isolation. As one of the key benefits of using multiple node pools is that the compute instance types can be optimized to the processing needs and workload. So, the question could be rephrased as
“How many node pools should I have”?
But the question could also be “How many nodes do I need in the node pool”?
In the video I discuss two deployment patterns:
Using two node pools for the SAS Viya components (pods), and
Using three node pools for the SAS Viya components (pods).
In both patterns there is still the default or system node pool for non-Viya services. For example, the ingress controller, cert-manager, or the monitoring and logging components. The examples show the use of a SMP CAS Server, but this could also be for a MPP CAS Server.
Pattern 1: Using two node pools
This deployment pattern has a ‘General’ node pool, which is for everything other than CAS, and a CAS node pool.
This pattern would be a good choice for smaller environments or environments with a small programming (SAS Compute Server) workload, or where there isn’t the need to dedicate node(s) to the programming workload. This is shown in the image below.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Pattern 2: Using three node pools
The three-node pool deployment pattern provides dedicated resources for both the CAS and programming workloads. The three node pools are:
Services node pool – for the Stateless and Stateful services (application pods)
Compute node pool, and
CAS node pool.
This is shown in the image below.
The Video…
As this has ended up being longer than I had intended, I guess it’s about time I let you watch the video.
Conclusion
In this article and the video, I have focused on node pools and nodes, but there are many things that can affect the infrastructure requirements and hence the costs. Such as the availability and performance requirements.
When running in the Cloud there are many storage options and the choice of storage can also have a significant impact on the infrastructure costs.
While the intent of this is to raise awareness of the many different deployment options and considerations, I have probably only scratched the surface of this topic, and you may have many more questions.
Finally, the SAS Viya 4 Infrastructure as Code (IaC) projects are available on GitHub, see the links below:
SAS Viya 4 Infrastructure as Code (IaC) for Microsoft Azure
SAS Viya 4 Infrastructure as Code (IaC) for Amazon Web Services (AWS)
SAS Viya 4 Infrastructure as Code (IaC) for Google Cloud Platform (GCP)
Below are the other videos in this series:
SAS Viya Topologies – An Introduction (Part 1 of 4)
SAS Viya Topologies – Basic Topologies (Part 2 of 4)
SAS Viya Topologies – Topologies 2 (Part 3 of 4)
I hope this is useful and thanks for reading.
Find more articles from SAS Global Enablement and Learning here.
... View more
- Find more articles tagged with:
- GEL
03-14-2022
05:54 PM
Hi Alan, good questions. You are right that SAS/CONNECT is used for sessions from SAS 9 and other SAS Viya environments. If you only have a single SAS Viya environment and there is no requirement for sessions from other SAS environments, then yes, having a connect node pool is not needed.
That topology shown is for a "fully" scaled out deployment, or what I call "separation by tier". In fact unless you have a lot of SAS/CONNECT sessions using the spawner I would just have a 'Compute' node pool to support the SAS/CONNECT and Compute (sas-programming-environment) pods.
For your environment, a three node pool topology is probably fine. That is, a shared node pool for the stateless and stateful pods, a shared node pool for Connect and Compute pods, and a CAS node pool.
To implement this you need to create a patch transformer to update the 'sas-connect-spawner' pod to use the Compute node pool.
Below is an example (that uses strict scheduling).
I hope that helps.
# This transformer changes the sas-connect-spawner pod to run on the compute nodes
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: add-compute-label
patch: |-
- op: remove
path: /spec/template/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
value:
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: In
values:
- connect
matchFields: []
weight: 100
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: NotIn
values:
- compute
- stateless
- stateful
matchFields: []
weight: 50
- op: add
path: /spec/template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: workload.sas.com/class
operator: In
values:
- compute
- op: replace
path: /spec/template/spec/tolerations
value:
- effect: NoSchedule
key: workload.sas.com/class
operator: Equal
value: compute
target:
kind: Deployment
name: sas-connect-spawner
... View more
03-13-2022
01:27 PM
3 Likes
In my last post, I described how to realize your SAS Viya workload placement plan (see here). In that article I discussed creating node pools to dedicate nodes to running SAS Micro Analytic Service (MAS) pods and the CAS Servers when running multiple SAS Viya environments (namespaces) in a shared Kubernetes cluster.
Prior to that and more recently, I have been asked about dedicating nodes to the Compute Server, or more correctly put, the ‘sas-programming-environment’ pods. In this blog I will share the required configuration changes and my thoughts on creating custom node pools to support the compute workloads.
First, we will look at the new workload placement plan, the target topology.
In this example, we will once again look at running two SAS Viya environments (production and discovery). As per last time, the stateless, stateful, connect and realtime nodes are shared by both SAS Viya environments, and the CAS Servers are running on dedicated, or separate, node pools for each environment.
But now we will add a new node pool for the programming workloads for the discovery environment. The ‘compute’ node pool is dedicated to the production environment and the ‘comp2’ node pool is dedicated to the discovery environment. Figure 1 illustrates my new workload placement plan.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Figure 1. Target topology running two SAS Viya environments.
A key driver for using this configuration would be the need to use different instance types (remember instance types vary by type and number of CPUs, RAM, local disk, ...) for the SAS Viya environments. For example, perhaps the production workload is more controlled and understood in terms of its resource demand profile. It is predictable in terms of the required CPUs/cores & RAM to complete the workloads for a given SLA. The discovery workload is more variable and needs larger nodes in terms of CPUs/cores & RAM to support the variability of the processing.
You are focusing on the resource (capacity) requirements for each workload and cost optimization.
Another driver for this topology, might be the need to totally separate (isolate) the production and discovery processing, so that a “rogue” discovery job can’t impact any production processing. The workload separation may still be needed even when using the new workload orchestration features (SAS Workload Management) as the orchestration works at a namespace (SAS Viya environment) level not across multiple namespaces.
SAS Workload Management for SAS Viya was GA in November 2021, with Stable 2021.2.1 and Long-Term Support 2021.2.
Creating the cluster
In my last blog, I discussed creating a naming scheme and the recommendation not to over taint the nodes. For my testing I used the following labels and taints.
Node pool name
Labels
Taints
cas
workload.sas.com/class=cas environment/prod=cas
workload.sas.com/class=cas
casnonprod
workload.sas.com/class=cas environment/discovery=cas
workload.sas.com/class=cas
realtime
workload/class=realtime
workload/class=realtime
compute
workload.sas.com/class=compute environment/prod=compute
workload.sas.com/class=compute
comp2
workload.sas.com/class=compute environment/discovery=cas
workload.sas.com/class=compute
As can be seen from the table above, I have only used the standard SAS taints for the CAS and compute nodes. Again, I did my testing in Azure and used the SAS Viya 4 Infrastructure as Code (IaC) for Microsoft Azure GitHub project to create the cluster.
To confirm the configuration of the nodes, the labels that have been assigned, I used the following command to list the node labels:
kubectl get nodes -L workload.sas.com/class,workload/mas,environment/prod,environment/discovery
This gave the following output (Figure 2).
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Figure 2. Displaying node labels.
To confirm the taints that have been applied use the following command:
kubectl get node -o=custom-columns=NODE:.metadata.name,TAINTS:.spec.taints
This gave the following output for my AKS cluster.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Figure 3. Displaying node taints.
Updating the SAS Viya Configuration
In my last blog I discussed preferred scheduling versus strict scheduling, and the ability to drive pods to a node by using the node label(s). In the following examples I have used the ‘requiredDuringSchedulingIgnoredDuringExecution’ node affinity definition, this specifies rules that must be met for a pod to be scheduled onto a node.
As I haven’t added any environment taint to the compute nodes, both SAS Viya deployments must be updated to stop “pod drift” across both node pools. If the default configuration was used the compute pods could make use of both node pools. This wasn’t my desired state, so I updated the SAS Viya configuration for both environments.
In Kubernetes the world ‘drift’ is used in several contexts. For example, “configuration drift” refers to an environment in which the running cluster becomes increasingly different over time, usually due to manual changes and updates on the cluster. It can also be used to describe “container drift”, usually within a security context, which refers to detecting and preventing misconfiguration in the Kubernetes deployments.
In this context, “pod drift” is referring to pods that end up running on nodes that are not the target or desired location. A drift away from the target topology.
Controlling the use of the compute nodes is more complex than the CAS or MAS configuration. This is because the ‘sas-programming-environment’ has several components. If you look at the site.yaml you will see that the following configuration needs to be updated:
sas-compute-job-config
sas-batch-pod-template
sas-launcher-job-config
sas-connect-pod-template
I will not go into the details here, but the different ‘sas-programming-environment’ components are explained in the SAS Viya Administration documentation and this SAS Communities blog.
In the following examples, the patch transformers will make the following changes:
Remove the preferred scheduling to simplify the manifest, and
Add the definition in the required scheduling section for the node selection.
The discovery configuration is shown in the examples below. In all cases I have tested for the value of the environment label, but I could have just tested for the existence of the label, in this case ‘environment/discovery’.
This would look like the following:
- op: add
path: /template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: environment/discovery
operator: Exists
Once you have created the patch transformers shown here, you need to update the kustomization.yaml to refer to the new configuration.
Compute Server (sas-compute-job-config) configuration
The following example is a patch transformer to update the sas-compute-job-config PodTemplate.
# Patch to update the sas-compute-job-config pod configuration
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: set-compute-job-label
patch: |-
- op: remove
path: /template/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
value:
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: In
values:
- compute
matchFields: []
weight: 100
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: NotIn
values:
- cas
- connect
- stateless
- stateful
matchFields: []
weight: 50
- op: add
path: /template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: environment/discovery
operator: In
values:
- compute
target:
kind: PodTemplate
version: v1
name: sas-compute-job-config
To view the changes made I used ‘icdiff’ to compare the default configuration (site.yaml) and the new configuration that was produced (compute-job-site.yaml). This is shown in Figure 4.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Figure 4. Review the compute server change.
As can be seen the preferred scheduling section has been removed (shown in red) and the new update for the required scheduling is shown in green.
SAS Batch Job (sas-batch-pod-template) configuration
The following example is a patch transformer to update the sas-batch-pod-template PodTemplate.
# Patch to update the sas-batch-pod-template configuration
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: set-batch-compute-label
patch: |-
- op: remove
path: /template/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
value:
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: In
values:
- compute
matchFields: []
weight: 100
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: NotIn
values:
- cas
- connect
- stateless
- stateful
matchFields: []
weight: 50
- op: add
path: /template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: environment/discovery
operator: In
values:
- compute
target:
kind: PodTemplate
version: v1
name: sas-batch-pod-template
Once again, to view the changes made I used ‘icdiff’ to compare the default configuration (site.yaml) and the new configuration that was produced (batch-site.yaml). This is shown in Figure 5.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Figure 5. Review the batch job change.
Again, you can see the deletion in red and the additional configuration in green.
SAS Launcher Job (sas-launcher-job-config) configuration
The following example is a patch transformer to update the sas-launcher-job-config PodTemplate.
# Patch to update the sas-launcher-job-config pod configuration
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: set-launcher-job-label
patch: |-
- op: remove
path: /template/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
value:
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: In
values:
- compute
matchFields: []
weight: 100
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: NotIn
values:
- cas
- connect
- stateless
- stateful
matchFields: []
weight: 50
- op: add
path: /template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: environment/discovery
operator: In
values:
- compute
target:
kind: PodTemplate
version: v1
name: sas-launcher-job-config
Connect Server (sas-connect-pod-template) configuration
The following example is a patch transformer to update the sas-connect-pod-template PodTemplate.
# Patch to update the sas-connect-pod-template pod configuration
---
apiVersion: builtin
kind: PatchTransformer
metadata:
name: set-connect-template-label
patch: |-
- op: remove
path: /template/spec/affinity/nodeAffinity/preferredDuringSchedulingIgnoredDuringExecution
value:
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: In
values:
- compute
matchFields: []
weight: 100
- preference:
matchExpressions:
- key: workload.sas.com/class
operator: NotIn
values:
- cas
- connect
- stateless
- stateful
matchFields: []
weight: 50
- op: add
path: /template/spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution/nodeSelectorTerms/0/matchExpressions/-
value:
key: environment/discovery
operator: In
values:
- compute
target:
kind: PodTemplate
version: v1
name: sas-connect-pod-template
Verifying the configuration
After both environments were running, I started four SAS Studio sessions and then used Lens to confirm that the compute server pods were running on the correct nodes. This is illustrated in the figure 6.
Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page.
Figure 6. Verifying the configuration.
If you look closely you will see there are three SAS Studio sessions in the discovery environment (namespace). This is shown by the three pods running on the ‘ask-comp2-3018…’ node, and there is one production SAS Studio session running on the ‘ask-compute-301…’ node. (Remember that SAS Compute Servers run as “sas-launcher-“ pods and here we’re looking at those Controlled By “Job”, not “ReplicaSet”)
Conclusion
Here we have looked at some of the drivers for using separate node pools for the compute pods and seen how to implement this (with the ‘compute’ and ‘comp2’ node pools) for two SAS Viya environments.
The examples shown above rely on updating both SAS Viya deployments, as I haven’t created a custom taint for the new ‘comp2’ node pool. If you wanted to keep the production deployment as “vanilla” as possible the minimum approach would be to add an environment taint to the new compute (comp2) node pool for the discovery deployment.
But remember if you use preferred scheduling you could end up with “pod drift” across all the available compute node pools unless you add additional taints to keep unwanted pods away.
Using the four patch transformers it would be possible to optimize, or tailor, the deployment further to meet the customer’s specific needs, allowing the different pod types to use specific node pools (node types).
Finally, if you were to share a single compute node pool for multiple SAS Viya environments the node pool must be sized appropriately to support the workload for all the SAS Viya environments.
This doesn’t just mean selecting the right instance type (node size), but you should also focus on elements such as the number of nodes in the node pool (min and max values) and setting the "max_pods" value for the nodes. Setting the ‘max_pods’ can help to stop the nodes from getting overloaded, but may mean you incur higher costs for running the Kubernetes cluster.
This may need some tuning once you understand the workloads and system performance.
I hope this is useful and thanks for reading.
... View more
- Find more articles tagged with:
- GEL