Add a CAS “GPU-enabled” Node pool to boost your SAS Viya Analytics Platform!

4 Likes

A few months ago, I was contacted by a colleague in the Iberia region with a technical question, then the next day a SAS Cloud engineer came up to our team with a very similar request but for another customer based in the US. The questions were about architecture and deployment considerations around the utilization of GPU with CAS.

With Viya 4, you will typically have a CAS node pool gathering the CAS nodes that will host the CAS pods (Controller and workers).

Most of the CAS Analytics run on standard CPUs, but how can I configure CAS in Viya 4 if I want to leverage GPU (Graphical Processing Units) acceleration for Deep learning models?

This type of questions are really interesting for me as a technical architect who came from an analytical background. With this in mind, I setup some infrastructure in Google Cloud platform to create a Viya test environment so that I could share my observations in this post.

Architecture considerations

One of the questions was “If a customer wants two cas node pools (one with GPU and one without GPU), is there a way to schedule work to one vs other without creating two cas servers?”

The answer is no. Remember that a CAS Server is a whole single processing unit. You submit your analytics action to the CAS Controller, and it will decide how to break it down between the CAS workers.

However, what you can do is to have two CAS servers (each Server will have a Controller and one or more workers) inside the same Viya environment. Each CAS server can be either SMP or MPP.

So, an interesting setup for example, could be to have one CAS MPP server for the standard analytics (reporting, statistics, forecasting, etc…) and a CAS SMP server equipped with GPU processors where you could train your Deep learning models.

The diagram below represents the topology of this scenario.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

As you can see on the far right, there is an extra “CASGPU” node pool in addition of the standard CAS node pool and that is where we want the CAS pod of our secondary CAS SMP server to run.

Deployment and configuration steps

To illustrate this use case, I wrote and tested a specific scenario;

The starting point is that we have already:

provisioned the GKE cluster with terraform,
deployed the Viya pre-requisite
and setup the Deployment Operator.

Then, we want to add a GPU Node pool, perform a standard deployment Viya with the Deployment operator and finally run through the steps to add, configure and run a secondary CAS SMP server where the CAS pods will be able to consume the GPU device for Deep Learning processing.

So, let’s see what the steps are to meet that goal.

Add a GPU Node pool : the first thing that we do is to add an additional CAS GPU node pool. It will be a dedicated Node pool node using GPU equipped nodes. (We will look at some of the details in a next section of the post.)
Create the SASDeployment Custom resource and perform a first deployment of SAS Viya with the Deployment Operator.
Add a secondary CAS Server :
1. we use the create-cas-server.sh script to generate the manifest to create a secondary CAS SMP server.
2. Modify the Target section of the cas-manage-workers.yaml (and potentially the cas-manage-backup.yaml) file so it uses labelSelector instead of names to apply the transformations.

See the SAS documentation for more details on that.

Change the default Node affinities and tolerations : Now we want to direct the CAS pods of the secondary CASDeployment to run on our GPU-enabled node pools. (We will look at some of the details in the next section of the post.)
Install the NVIDIA GPU device drivers : Google provides a DaemonSet that you can apply to install the drivers – You just need to run a single command to get the Daemonset deployed in your cluster. See this page in the GCP official documentation.
Enable the SAS GPU Reservation Service (with the cas-gpu-patch.yaml file) as explained in the SAS official documentation and associated README example.
Add the Nvidia driver PATH and LIBRARY_PATH environment variables in the CASDeployment CRD (We will look at some of the details in a next section of the post.)
Add the secondary CAS server overlay reference in the main Kustomization.yaml file.
Rebuild the SASDeployment CR and apply it.

As I’d like to keep this post to a reasonable length, I won’t cover the details of each of the step above. 😊

(Most of them are already documented either in the SAS official documentation or in the Google Documentation for most of them).

However, let’s detail a little bit more of some of the steps that are specific to our specific CAS GPU setup.

Add a GPU Node pool with Terraform

If you are using Terraform as part of the IaC project tool, the cool thing is that you can easily modify some part of your infrastructure without having to rebuild it completely.

Terraform can identify the delta between what is currently deployed (as reflected in the local terraform state file) and what you are changing in the terraform variables file.

For example, here are the steps to “update” our infrastructure with a new node pool:

First we modify the Terraform variable file by adding the node pool block below:

cas-gpu = { "vm_type" = "n1-highmem-8" # 8vcpu*52GB "os_disk_size" = 200 "min_nodes" = 1 "max_nodes" = 1 "node_taints" = ["workload.sas.com/class=casgpu:NoSchedule"] "node_labels" = { "workload.sas.com/class" = "casgpu" } "local_ssd_count" = 1 "accelerator_count" = 1 "accelerator_type" = "nvidia-tesla-p100" },

In this example, our Node pool will always have a single node and use the “n1-highmem-8” instance type with the Nvidia Tesla P100 GPU (16GB of RAM).

You can also notice that we assign a “casgpu” label and a “casgpu” taint to our new CAS GPU node pool.

Check the changes : we can then use the TF “plan” command to see the differences between the Terraform state (that correspond to the infrastructure built with the initial Terraform plan) and the changes that we want to apply with our modified Terraform variable file.

If you run:

terraform plan -var-file=./gel-vars.tfvars -state=terraform.tfstate

You should see something like :

The initial Terraform plan creates 42 resources, so the message with “3 to add, 1 to change, 2 to destroy” is a good indication that our Node pool change has been taken into account and will be applied in “delta” mode.

Build and Apply the plan.

Run the command below to build the new Terraform plan and apply it.

# Build the plan and keep it in a file

terraform plan -input=false \

-var-file=./gel-vars.tfvars -out ./addingcasgpupool.plan


# Apply the plan


terraform apply ./addingcasgpupool.plan

You should see this line at the end :

Finally, you can check with a kubectl get nodes command if we have our new GPU node.

Change the secondary CASDeployment Node affinities and tolerations

Node affinities change to force the CAS “gpu” pod to run on the “gpu” node

The purpose of this change is to make sure that the CASDeployment instance that corresponds to our secondary CAS Server (shared-casgpu) will only start the CAS pods on the Node(s) with the “casgpu” label.

We start from the default node-affinity.yaml PatchTransformer file generated by the create-cas-server.sh script and we make some changes as illustrated below.

As you can see in the screenshots, we removed the preferredDuringScheduling… nodeAffinity specification and modified the requiredDuringScheduling… one to associate the "casgpu" workload class. This change ensure that the CAS pod(s) from our secondary CAS Server will always land on a node labeled with the "casgpu" workload class. We also get rid of the "Not In system" required affinity since it is now useless.

Adding tolerations matching the GPU node taints

Then, in the CASDeployment CRD of our secondary CAS Server, we need to add a toleration so the instantiated CAS pod(s) can be accepted on a Node with the workload.sas.com/class=casgpu:NoSchedule taint.

In addition, since Google has automatically tainted our GPU node pool nodes with nvidia.com/gpu=present:NoSchedule we also want to add a toleration for it.

Once again we use a PatchTransformer to do it :

With these two additional "tolerations" our CAS pod can run a on node with the following taints :

Note : In theory, we should not have to explicitly add a toleration since adding a special nvidia.com/gpu resource request for the pod that needs to consume the GPU should be enough in GCP – according to the Google documentation. However, since we use the CAS auto-resource mode, the cas-probe could not be started without this additional toleration.

Add the Nvidia driver PATH and LIBRARY_PATH environment variables in the CASDeployment CR

While this step is not officially documented yet, it is required for our CAS pod to be able to utilize the GPU for the CAS processing.

After applying all the previous documented configuration steps, I tested with a program using GPU processing in my Google Cloud environment and it failed with the following error :

After some troubleshooting and investigation, it appeared that, while the NVidia binaries and libraries were available inside the CAS pod on the GPU node, CAS was not able to find them inside the execution and library paths.

But in Kubernetes, you can “inject” environment variables inside the pod using the “env” container specification. It looks like this in the Pod definition :

So, you can write a simple PatchTransformer to modify the CAS pod template in the CASDeployment Custom Resources to set the PATH and LD_LIBRARY_PATH variables for the NVidia drivers and libraries.

It would look like this :

apiVersion: builtin kind: PatchTransformer metadata: name: add-path-and-ldlpath patch: |- - op: add path: /spec/controllerTemplate/spec/containers/0/env/- value: name: PATH value: "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/nvidia/bin" - op: add path: /spec/controllerTemplate/spec/containers/0/env/- value: name: LD_LIBRARY_PATH value: "/usr/local/nvidia/lib64" target: group: viya.sas.com kind: CASDeployment name: shared-casgpu version: v1alpha1

Note that, depending on the Cloud platform and GPU accelerator type you might have to adjust the path used in this specific case.

Is it working ?

If our configuration worked, then after regenerating and reapplying the SASDeployment CRD in the cluster, we should see the CAS pod corresponding to our secondary CAS Server (cas-gpu) start and run on our GPU node, as below :

We can also check if the CAS auto-resources default configuration worked and set the resources request and limit appropriately.

If we run this code, to display the resources requests and limits :

# collect the resource requests and limits for the containers in the default CAS podkubectl -n casgpu get pod sas-cas-server-default-controller -o json | jq ".spec.containers[].name, .spec.containers[].resources"# collect the resource requests and limits for the containers in the casgpu CAS podkubectl -n casgpu get pod sas-cas-server-shared-casgpu-controller -o json | jq ".spec.containers[].name, .spec.containers[].resources"

Then we get the following results that confirm that each type of CAS instance is using most of the associated node capacity (4 vCPU/26BG of RAM for our 4 cas-default MPP nodes, and 8vCPU/52GB of RAM for our cas-gpu SMP node) with the appropriate resource requests and limits.

Let’s test it !

Finally, after all these configuration steps, we want to make sure that we can run a program performing some Analytics processing that takes advantage of the GPU acceleration.

It should be noted that only specific analytics processing tasks are really taking advantage of GPU devices. Currently code samples in the official SAS Documentation focus on python programs.

However, with the help of my GEL colleagues Beth Ebersole and Nicolas Robert, we managed to write a SAS program (which leverages CAS actions) that build, train, and score a Deep learning model using the node’s GPU.

First we need to open a CAS session on the secondary cas server, with something like:

/*Open a session on the CAS GPU server*/cas MySession sessopts=(caslib=casuser timeout=1800 locale="en_US") host="sas-cas-server-shared-casgpu-client";

Then, after loading some sample data in CAS , we train the model with the “GPU=TRUE” option and see in the log that the GPU device was identified and used.

Then we also score the model with GPU=TRUE.

Hurray ! The message in the log confirms that CAS has found and is using our GPU device !!!

Note : if you'd like to use this same validation program in your own GPU-enabled, you can download the data and the program from my personal GitHub repository there.

Finally, if (like me) you never really believe things until you see them for real, you might want to monitor the GPU processing at the system level while the model is trained and scored 😊

There are various ways to do it, but in our case we just ran the Nvidia provided program (nvidia-smi) from the CAS GPU pod to check the memory and GPU utilization in real time. On the screenshot below, taken during the model training, we can see that the utilization of our P100 GPU device reaches 72%.

Conclusion

What we’ve seen in this post is basically a demonstration on how you can leverage the GPU processing for CAS Deep Learning in the Cloud.

BUT… it also shows how to play with the node affinities and tolerations to assign different CASDeployment (within the same Viya deployment) to different CAS node pools (Even without talking about GPU).

It could be an interesting scenario to configure several CAS servers for distinct Business Units and assign, not only different topologies( SMP/MPP) but also different instance types (more or less power, faster or slower storage, etc…) for the different BU’s CAS Servers. As I finish writing this blog, it should be noted that in the last few days (of April 2022), SAS now has the capability for the IML Procedure to leverage GPU processing from a SAS Compute Server session.

Finally, during the experimentation, I tested both techniques for the CAS resources allocation (auto-resources and custom) and both worked well.

OK that’s it for today !

Thanks again to my colleagues who helped me along the way to make all this work: Beth Ebersole, Liping Cai, Frederik Vandenberghe, Nicolas Robert, Uttam Kumar, and David Zanter.

Find more articles from SAS Global Enablement and Learning here.

JuanS_OCS · ‎04-25-2022

Thank you Raphael!

I think your last link is pointing to an internal SAS site

RPoumarede · ‎04-25-2022

Thank you @JuanS_OCS I guess you are talking about the IML documentation link ? I will fix it.

paterd2 · ‎09-12-2022

Raphael,

thanks for the code.

I tested this to see if out nvidia was used.

a little result :

NOTE: Using device: CPU. real time 49.34 seconds

NOTE: Using device: GPU 0. real time 9.31 seconds

Dik

SimonWilliams · ‎11-14-2022

Thank you @paterd2 for sharing your results. For the example you used that seems quite a compelling case for using GPU processing in certain uses cases.

--Simon