SAS Container Runtime Observability

5 Likes

In this post we will look at observability for SAS Container Runtime. We will discuss monitoring the container runtime pods running on Kubernetes.

I will show how to extend the SAS Viya Monitoring for Kubernetes framework, but the core configuration tasks are also relevant for a vanilla Grafana / Prometheus environment.

For the purposes of this post when I say “model” I’m talking about any model or decision that has been published as a SAS Container Runtime image.

The first step to collecting the metrics data is to deploy a monitor into the namespace where the SAS Container Runtime pods are running. As I alluded to above, this is required regardless of how you have deployed the Prometheus and Grafana environment.

The monitor configuration is used to target the pods to be monitored. Therefore, the monitoring requirements influence the manifest used to deploy the model images. This is the start of the planning for the deployment / configuration.

In the pod monitor (PodMonitor) definition, you need to define how the required pods will be targeted, using the pod name, a label, or some other unique attribute. Probably the easiest element to target is a label, for this you use a “matchLabel” selector. This is the label you must add to the model pod.

For this example, I added the label “app.kubernetes.io/component=scr”, this was my top-level filter to target all SAS Container Runtime pods. I also added the label “app.kubernetes.io/name” to the pod(s); I used this label to specify the model or decision name. You will see these labels in the model deployment YAML later in this post.

In the Grafana dashboard if you wanted to further filter or group the models into business unit or functional areas you might add additional labels to the SAS Container Runtime pods. For example, “app.kubernetes.io/department=marketing” or “app.kubernetes.io/function=risk”

Therefore, in planning the monitoring configuration you not only need to determine how you will target the SAS Container Runtime pods, but you also need to think about the labels you might want to use when creating the Grafana visualisations.

Here is the PodMonitor yaml that I used.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: sas-scr-pods
  labels:
    app/monitoring-base: scr-monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: scr
  podMetricsEndpoints:
    # By default SCR uses /prometheus as the endpoint
    - port: http
      path: /prometheus
      tlsConfig:
        insecureSkipVerify: true
      relabelings:
      - sourceLabels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        targetLabel: model
      - sourceLabels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
        targetLabel: component
      - sourceLabels: [__meta_kubernetes_pod_node_name]
        targetLabel: node
      - sourceLabels: [__meta_kubernetes_pod_annotationpresent_prometheus_io_scheme,__meta_kubernetes_pod_annotation_prometheus_io_scheme]
        action: replace
        regex: true;(.+)
        targetLabel: __scheme__
        replacement: \$1
      - targetLabel: cluster
        replacement: student-aks-cluster

There are several dependencies between the PodMonitor configuration and the model deployment configuration. I created the following illustration to help describe the required configuration.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

On the left (in the blue box) is the PodMonitor configuration and on the right, you can see two snippets of the model deployment yaml.

Looking at the dependencies:

As the PodMonitor configuration is using a “matchLabels” selector the label definition must match in both places. This is shown in “1”.
The port name must match in both configurations. In this example the name “http” has been used.
In the SAS Container Runtime image, the default metrics endpoint is “/prometheus”. If you wish to override this, for example to use “/metrics”, you must set the SAS_SCR_PROMETHEUS_ENDPOINT environment variable. This is shown in “3”. The value of the variable must include the forward slash (“/”).
Finally, in “4”, you can see the required annotations that must be added to the model deployment configuration. There are four annotations that must be added, they must match the pod configuration values.

For the model deployment there are several environment variables that relate to the monitoring. In the image above you can see three of the variables. For a complete list see the Monitoring SAS Container Runtime Metrics documentation.

The SAS_SCR_PROMETHEUS_ENABLED variable is enabled by default. However, the SAS_SCR_METRICS variable needs to be enabled to expose the execution metrics data. Without setting this variable you will only collect the system level data.

The last part of the PodMonitor configuration to review is the “relabelings:” configuration. This is an optional configuration and provides the ability to map labels from the Kubernetes configuration to labels for Prometheus/Grafana, and to provide additional labels.

To take a closer look at a few of the mappings that I used:

relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
  targetLabel: model
- sourceLabels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
  targetLabel: component

- targetLabel: cluster
  replacement: student-aks-cluster

Previously I described that I used the two labels (“app.kubernetes.io/component=scr” and “app.kubernetes.io/name”) to target and report on the SAS Container Runtime pods. I wanted to use these labels in the PromQL queries. I also wanted to map them to more human readable names (“models” and “component”). For more information on PromQL see the following link to the Prometheus documentation: Query basics

A key detail to understand when working with the labels is that in the Prometheus database the dot and “/” characters are mapped to an underscore. Therefore, app.kubernetes.io/component becomes app_kubernetes_io_component.

If we look at the configuration snippet above, I have created two new labels. The “component” label is used to target all the model pods. But if I was wanting to add a label for grouping of models, I might use the Kubernetes label “app.kubernetes.io/function” and map that to a “model_group” label.

The “model” label is being used for any pod with the Kubernetes label “app.kubernets.io/name”. I’m using it to be able to filter on the model name.

As can be seen above, the expression for the source labels have the prefix of:

“__meta_kubernetes_pod_label_”

It is also possible to add additional labels, as my AKS cluster (I was testing in Azure) didn’t have any label (this is normal) you can add a label. In the example above the “cluster” label has been given a static name of “student-aks-cluster”.

Deploying the PodMonitor

With the PodMonitor configuration completed I could have manually applied the configuration, but I was wanting to integrate this configuration with the standard SAS Viya Monitoring for Kubernetes (SAS Viya Monitoring) configuration. I wanted to extend the monitoring to pick up the SAS Container Runtime pods.

For this there are two steps:

The first step is to copy the monitor definition to the required SAS Viya Monitoring folder. This is under the cloned Git project path of: viya4-monitoring-kubernetes/monitoring/monitors/viya
The next step is to use the deploy_monitoring_viya.sh script to deploy the monitor. For example, in my environment I used the following code.

# Install monitoring components to the SCR (models) namespace
cd $HOME/project/v4m/
export USER_DIR=$HOME/project/v4m
# Enable namespace level monitoring for the namespace
export VIYA_NS=models
$HOME/project/viya4-monitoring-kubernetes/monitoring/bin/deploy_monitoring_viya.sh

Setting the “VIYA_NS” environment variable identifies the namespace containing the SAS Container Runtime pods to SAS Viya Monitoring.

This deploys the monitor configuration and the prometheus-pushgateway to the namespace. It will also deploy the standard SAS Viya monitors to the target namespace.

With this configuration in place, you can confirm that the monitor is working using the Prometheus UI and the “Status > Target Health” function. If your selector is working properly, you will see a row for each pod that has been detected. Here I deployed the qs_reg1 model as a single pod and the qs_tree1 model as a Kubernetes deployment with two replicas.

In the screenshot you can see the 3 endpoints, there are three pods running, and that I was using the default monitoring endpoint (/prometheus). You can also see that there are two pods running on the node: aks-models-25581144-vmss000001

In the “Labels” section you can see the labels that have been mapped and created.

Drilling in on the labels, you can see the labels that I defined (cluster, component and model) and the “relabeling” for the node label. Plus, the labels that were automatically created: container, endpoint, instance, job, namespace and pod.

Now that the monitor is in place it is possible to start creating the Grafana dashboard(s).

Creating the Dashboard

While the monitoring metrics are documented in the SAS Container Runtime Help Center, see section: Monitoring SAS Container Runtime Metrics

It helps to understand the data that is returned. This can be done several ways, using Grafana or Prometheus, or by interactively querying the endpoint. The benefit of the latter approach is that you get to see the data that is being scraped by Prometheus.

Assuming you are using the default endpoint here is an example command to curl the metrics data from a pod.

kubectl -n namespace exec pod_name -- curl http://localhost:8080/prometheus

The following is an extract showing some sample output.

# HELP jvm_buffer_pool_capacity_bytes Bytes capacity of a given JVM buffer pool.
# TYPE jvm_buffer_pool_capacity_bytes gauge
jvm_buffer_pool_capacity_bytes{pool="direct"} 81920.0
jvm_buffer_pool_capacity_bytes{pool="mapped"} 0.0
jvm_buffer_pool_capacity_bytes{pool="mapped - 'non-volatile memory'"} 0.0
# HELP jvm_buffer_pool_used_buffers Used buffers of a given JVM buffer pool.
# TYPE jvm_buffer_pool_used_buffers gauge
jvm_buffer_pool_used_buffers{pool="direct"} 10.0
# HELP scr_busy_worker_thread_count Number of busy worker threads
# TYPE scr_busy_worker_thread_count gauge
scr_busy_worker_thread_count 3.0
# HELP scr_gc_grace_period_seconds Garbage collection grace period in seconds
# TYPE scr_gc_grace_period_seconds gauge
scr_gc_grace_period_seconds 10.0
# HELP scr_gc_interval_seconds Garbage collection interval in seconds
# TYPE scr_gc_interval_seconds gauge
scr_gc_interval_seconds 60.0
# HELP scr_jvm_max_memory Maximum amount of Java memory allowed to be used
# TYPE scr_jvm_max_memory gauge
scr_jvm_max_memory 8.413773824E9
# HELP scr_jvm_total_memory Total allocated memory reserved for Java
# TYPE scr_jvm_total_memory gauge
scr_jvm_total_memory 1.00663296E8
# HELP scr_memory_high_water_mark_bytes Memory usage high water mark
# TYPE scr_memory_high_water_mark_bytes gauge
scr_memory_high_water_mark_bytes 1.6941056E7

# HELP scr_system_cpu_time System CPU time since startup in microseconds
# TYPE scr_system_cpu_time gauge
scr_system_cpu_time 790000.0
# HELP scr_thread_pool_size Thread pool size
# TYPE scr_thread_pool_size gauge
scr_thread_pool_size 4.0
# HELP scr_threads_high_water_mark Thread usage high water mark
# TYPE scr_threads_high_water_mark gauge
scr_threads_high_water_mark 13.0

# HELP scr_total_hit_count Total number of execute calls since the server has been up
# TYPE scr_total_hit_count gauge
scr_total_hit_count 2.0
# HELP scr_user_cpu_time User CPU time since startup in microseconds
# TYPE scr_user_cpu_time gauge
scr_user_cpu_time 560000.0

Armed with this information you have a list of the metrics being returned by a specific pod. Another reason for using the CURL command is that I found the metrics data can vary depending on the model or decision that has been published.

My advice is, don’t just rely on the SAS Help Center documentation, query the running pods.

You are now ready to create a dashboard...

The Sample Dashboard

For the dashboard to function I have already described adding the labels to the SAS Container Runtime pods, you can see this in the dashboard notes. In my environment I was using a node pool dedicated to running the SAS Container Runtime pods.

To enable node affinity and make reporting easier, I applied (added) the label workload/class=models to the Kubernetes nodes hosting the SAS Container Runtime pods.

The following images provide a taste of what is possible. Here is my first dashboard…

A dashboard can have content grouped into rows. In the image above you can see the following rows: Dashboard Notes, Running SCR Pods, Detailed View, Resource View, JVM Details and Node View.

Looking at the running pods, the Running SCR Pods view, I deployed 4 models and had a script running to generate some reporting data. You can see this in the next two images.

Both the qs-tree1 and homeloan models were deployed as a Kubernetes deployment. The qs-tree1 model had 2 replicas and the homeloan model had 3 replicas.

Looking at the calls to the running models (in the Detailed View)…

In the bottom two panels (“Total Calls by Pod” and “Execution Time by Pod”) you can see the use of an active filter, the qsreg1 model was selected.

Prometheus Alerts

Integrating the SAS Container Runtime deployments with the SAS Viya Monitoring framework also means that error events are tracked. For the next two screenshots I simulated an image pull error with the Azure Container Registry.

Starting with the “Target health” page you can see that Prometheus has detected 3 pods, but the status is DOWN. You can see in the error description that there was a connection problem, the connection to the pod was refused.

Using the “Alerts” page we can get more information. Looking at the image, you can see that the pod was in a “CrashLoopBackOff” state. The error reason is shown in the red box, there was an image pull error.

Conclusion

As the deployment of the SAS Container Runtime pods is a customer task, SAS has no control or input on what deployment YAML is used. Therefore, it is not possible to provide a standard dashboard as part of the SAS Viya Monitoring for Kubernetes project.

However, from the configuration steps shown here I hope I have enabled you to start your SAS Container Runtime observability journey.

This is Part 1 of looking at SAS Container Runtime observability; in Part 2 we will look at the Grafana dashboard definition in more detail.

Find more articles from SAS Global Enablement and Learning here.