SAS Container Runtime observability using SAS Enterprise Session Monitor

1 Like

In this post we will look at using SAS Enterprise Session Monitor (ESM) to monitor SAS Container Runtime deployments running on Kubernetes.

The published SAS Container Runtime container images, the published models and decisions, can run on many platforms. If we choose to run them in Kubernetes, then it is possible to use monitoring tools such as Prometheus/Grafana and SAS Enterprise Session Monitor.

Let’s take a dive into the world of SAS Container Runtime observability using SAS Enterprise Session Monitor.

To run the published model or decision, you must create the manifests. This gives you flexibility in the naming of the pods that are running the images and other configuration elements. For example, you can set up a Kubernetes Deployment definition to run the SAS Container Runtime image. This makes it very easy to create an environment that ESM can monitor.

We will start by discussing the ESM configuration.

ESM Configuration

SAS Enterprise Session Monitor provides a suite of functions that allow you to look at and/or filter by workload type. The Dashboard that you see on login has a ‘Load by Type’ portlet, which shows the detected session types.

While this portlet will discover the SAS Viya processes by default, it is possible to configure the ESM Agent for additional workload types (session types), such as the SAS Container Runtime images (pods). To filter on specific pod types and define a custom name, the ‘esm-agents.yaml’ file has a set of regular expression statements that can be updated to meet your specific needs. The filters are defined under the ‘pod_types:’ section.

To monitor the SAS Container Runtime pods you need to update the filters for these pods. Two factors come into play here. The first is that any filter (the regular expression) must be able to find a unique match for the specific pod. The second is that this in turn means that the pods of a specific type must have a unique name to be separately identified or at very least, have a name that a regular expression can target.

This brings me back to my opening comment. You create the manifests to deploy the SAS Container Runtime images, so you have complete control of the naming of the pods. Therefore, a little planning is required, but having a naming standard for the SAS Container Runtime deployments (pods, services, ingress, etc) is always a good thing.

Let’s look at an example…

Using a text editor, I updated the ‘esm-agents.yaml’ file with the following setting:

pod_types:
  - pod_log_level: WARN
    pod_regex: .*scr.*
    pod_type: SCR

This meant that any pod name containing “scr” would be assigned the type ‘SCR’.

When defining the rules, it is important to understand that they are processed sequentially, from top to bottom. So, you need to define the most specific rules first in the list.

For example, to target any pod starting with “scr” you would use:

pod_types:
  - pod_log_level: WARN
    pod_regex: ^scr.*
    pod_type: SCR

Using this as an example, if you had both definitions (with different pod_type names) in the ESM Agent definition, this would have to be placed before the first example.

Here is a link to Microsoft tutorial on the Regular Expression Language.

I then deployed the ESM server and agents to my Kubernetes cluster.

SAS Container Runtime deployment

For my testing I used two models from the SAS Model Manager Quick Start Tutorial: the QS_Reg1 and QS_Tree1 models. I ran the models using a deployment, with 2 pod replicas for each of the models.

Here is a snippet of the QS_Tree1 manifest showing the name for the model pods.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: qstree1
    workload/class: models
  name: scr-qstree1-model
spec:
  # modify replicas to support the requirements
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: qstree1
  template:
    metadata:
      labels:
        app: qs-tree1
        app.kubernetes.io/name: qstree1
        workload/class: models
    spec:
      affinity:

In the example above, you can see that the pods have a name of “scr-qstree1-model”. I created a similar deployment for the QS-Reg1 model. But for these pods I used the name “qsreg1-scr-model”. Hence, I had the need to use more of a wild-card definition in the ESM Agent setup, looking for ‘scr’ anywhere in the pod name.

I then deployed the models and created a bash script to generate some workload. Let’s look at the results in ESM.

Using SAS Enterprise Session Monitor

For my testing I was working in Azure and created a dedicated node pool for running the models. This included the SAS Container Runtime pods and the SAS Micro Analytic Score (MAS) pod.

I will not go into the details of using ESM but will show some screenshots to illustrate monitoring the SAS Container Runtime sessions. In this first image you can see the ESM dashboard. I have selected the “models” node (aks-models-24548859-vmss000000).

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Looking at the image, you can see that I now have a session type of ‘SCR’, you can see this in the “CPU Performance and Session Count” and “Load by Type” portlets.

Looking at the “CPU Performance and Session Count” portlet you can see the CPU usage and that I had 4 pods (sessions) running.

If you look at the “Load by Type” portlet in the dashboard image it shows the various workload types (session types) that have been detected. As the node is dedicated to running models, the SAS Container Runtime pods, you can see the ‘SCR’ session type and the system / Kubernetes resource consumption. There also a small amount of workload tagged as “viya” this was from the SAS Micro Analytic Score pod that was running on the node.

I’m getting a little off track here, but if you also wanted to differentiate the SAS Micro Analytic Score pod(s) you could use the following expression:

  - pod_log_level: WARN
    pod_regex: ^sas-microanalytic-score.*
    pod_type: MAS

This would allow you to clearly identify the pods that are running the models and decisions, for both SCR and MAS.

Coming back to monitoring the SAS Container Runtime deployment. Using the Timespans function allows you to drill in on a node. As I described earlier, all the SAS Container Runtime pods were running on the models node. Here we can see the processes running on that node.

As I was focusing on the SAS Container Runtime pods, I used the ‘Category’ selector to filter on the ‘SCR’ session type that I defined. I then used the ‘Group By’ function on the ‘Process List’ menu to group the process list by pod id (name). For example:

As you can see, this removed all the unwanted processes, allowing me to just focus on the pods running my models.

Now it is very easy to get information on the pods. In the ‘Process List’ you can see that I had 2 replicas of the QS_Reg1 model and 2 replicas of the QS_Tree1 model running.

The node portlet is showing summary information for the ‘aks-models-24548859-vmss000000’ node. By hovering the mouse pointer over the graph, we can see that there is 9.7% CPU usage and 7.25% memory usage. This is a live view of the system.

Using ESM there are many ways to dive into the details. The next image is showing a view of the ‘Subsessions for java’. Under Live View, I selected Distributed Search, which allowed me to focus on the java processes, the SAS Container Runtime processes.

In the ‘Process List’ you can see the java process running in each SAS Container Runtime pods. On the bottom left we can see a Heat Map, and on the right we can see the CPU being used by each of these processes.

Here we can see that process ID ‘256730’ is using the most CPU, this is for the first ‘scr-qstree1-model’ pod. The red arrows are highlighting the process ‘256751’, this pod is using 2% CPU. It was also possible to see the memory consumption using the Heat Map on the left.

Finally, from this view you can get additional information on several performance metrics (CPU, memory, IO, page faults, etc). In this final image you can see that I have selected ‘Major Faults’, the graph is now showing the page faults for each pod.

Within this context, a ‘Minor Fault’ is where the process is trying to access a block of data that is not in the CPU cache, but it is in memory. If the data was not in memory at all this is a ‘Major Fault’. Here you can see the faults being detected with the 4 model pods.

Ideally, you want to see very few major faults, as the more major faults a process has, the less performant it’s going to be, since the system must wait (CPU wait) while the IO subsystem returns the requested page of data.

From this view we can see that the process under the greatest stress is ‘256730’, with 2 major faults/sec. From the Process List we can see that this is one of the QS_Tree1 pods. This type of information is important when trying to understand the performance of the running model or decision at a system level.

For this test I had several bash scripts that were generating calls to score data using the QS_Tree1 and QS_Reg1 models. While I don’t have an exact number, I was probably generating around 7 to 10 transactions per second for the QS_Tree1 model.

Conclusion

It was very easy to configure the ESM Agent to monitor the SAS Container Runtime pods. We can see that ESM provides valuable insights into the system performance of each model or decision being executed.

As can be seen here, this can help with system tuning and identification of possible problems. I see this being particularly valuable when the model or decision is being integrated within a real-time business process.

I hope this post has given you a sense for what is possible when using SAS Enterprise Session Monitor to monitor the running models and decisions.

Finally, while I haven’t tested monitoring the SAS Container Runtime Batch Agent, it should be possible to take the same approach. That sounds like a post for another day 😊

Find more articles from SAS Global Enablement and Learning here.